You are on page 1of 11

Knowledge Graph and Text Jointly Embedding

Zhen Wang†§ , Jianwen Zhang† , Jianlin Feng§ , Zheng Chen†



{v-zw,jiazhan,zhengc}@microsoft.com
§
{wangzh56@mail2,fengjlin@mail}.sysu.edu.cn
† §
Microsoft Research Sun Yat-sen University

Abstract Recently, targeting knowledge graph comple-


tion, a promising paradigm of embedding was pro-
We examine the embedding approach to posed, which is able to reason new facts only from
reason new relational facts from a large- the knowledge graph (Bordes et al., 2011; Bor-
scale knowledge graph and a text corpus. des et al., 2013; Socher et al., 2013; Wang et al.,
We propose a novel method of jointly em- 2014). Generally, in this series of methods, each
bedding entities and words into the same entity is represented as a k-dimensional vector and
continuous vector space. The embedding each relation is characterized by an operation in
process attempts to preserve the relations <k so that a candidate fact can be asserted by sim-
between entities in the knowledge graph ple vector operations. The embeddings are usually
and the concurrences of words in the text learnt by minimizing a global loss function of all
corpus. Entity names and Wikipedia an- the entities and relations in the knowledge graph.
chors are utilized to align the embeddings Thus, the vector of an entity may encode global
of entities and words in the same space. information from the entire graph, and hence scor-
Large scale experiments on Freebase ing a candidate fact by designed vector operations
and a Wikipedia/NY Times corpus show plays a similar role to long range “reasoning” in
that jointly embedding brings promising the graph. However, since this requires the vectors
improvement in the accuracy of predicting of both entities to score a candidate fact, this type
facts, compared to separately embedding of methods can only complete missing facts for
knowledge graphs and text. Particularly, which both entities exist in the knowledge graph.
jointly embedding enables the prediction However, a missing fact often contains entities out
of facts containing entities out of the of the knowledge graph (called out-of-kb for short
knowledge graph, which cannot be han- in this paper), e.g., one or both entities are phras-
dled by previous embedding methods. At es appearing in web text but not included in the
the same time, concerning the quality of knowledge graph yet. How to deal with these fact-
the word embeddings, experiments on the s is a significant obstacle to widely applying the
analogical reasoning task show that jointly embedding paradigm.
embedding is comparable to or slightly In addition to knowledge embedding, anoth-
better than word2vec (Skip-Gram). er interesting approach is the word embedding
method word2vec (Mikolov et al., 2013b), which
1 Introduction shows that learning word embeddings from an
Knowledge graphs such as Freebase (Bollacker et unlabeled text corpus can make the vectors con-
al., 2008) and WordNet (Miller, 1995) have be- necting the pairs of words of some certain
come important resources for many AI & NLP ap- relation almost parallel, e.g., vec(“China”) −
plications such as Q & A. Generally, a knowledge vec(“Beijing”) ≈ vec(“Japan”) − vec(“Tokyo”).
graph is a collection of relational facts that are of- However, it does not know the exact relation be-
ten represented in the form of a triplet (head en- tween the pairs. Thus, it cannot be directly applied
tity, relation, tail entity), e.g., “(Obama, Born-in, to complete knowledge graphs.
Honolulu)”. An urgent issue for knowledge graph- The capabilities and limitations of knowledge
s is the coverage, e.g., even the largest knowledge embedding and word embedding have inspired us
graph of Freebase is still far from complete. to design a mechanism to mosaic the knowledge

1591
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1591–1601,
October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics
graph and the “word graph” together in a vector 2 Related Work
space so that we can score any candidate relation-
al facts between entities and words1 . Therefore, Knowledge Embedding. A knowledge graph is
we propose a novel method to jointly embed enti- embedded into a low-dimensional continuous vec-
ties and words into the same vector space. In our tor space while certain properties of it are pre-
solution, we define a coherent probabilistic model served (Bordes et al., 2011; Bordes et al., 2013;
for both knowledge and text, which is composed Socher et al., 2013; Chang et al., 2013; Wang et
of three components: the knowledge model, text al., 2014). Generally, each entity is represented
model, and alignment model. Both the knowledge as a point in that space while each relation is inter-
model and text model use the same core transla- preted as an operation over entity embeddings. For
tion assumption for the fact modeling: a candidate instance, TransE (Bordes et al., 2013) interprets a
fact (h, r, t) is scored based on kh + r − tk. The relation as a translation from the head entity to the
only difference is, in the knowledge model the re- tail entity. The embedding representations are usu-
lation r is explicitly supervised and the goal is to ally learnt by minimizing a global loss function in-
fit the fact triplets, while in the text model we as- volving all entities and relations so that each entity
sume any pair of words h and t that concur in some embedding encodes both local and global connec-
text windows are of certain relation r but r is a hid- tivity patterns of the original graph. Thus, we can
den variable, and the goal is to fit the concurring reason new facts from learnt embeddings.
pairs of words. The alignment model guarantees Word Embedding. Generally, word embeddings
the embeddings of entities and words/phrases lie are learned from a given text corpus without su-
in the same space and impels the two models to en- pervision by predicting the context of each word
hance each other. Two mechanisms of alignment or predicting the current word given its contex-
are introduced in this paper: utilizing names of en- t (Bengio et al., 2003; Collobert et al., 2011;
tities and utilizing Wikipedia anchors. This way of Mikolov et al., 2013a; Mikolov et al., 2013b). Al-
jointly embedding knowledge and text can be con- though relations between words are not explicitly
sidered to be semi-supervised knowledge embed- modeled, continuous bag-of-words (CBOW) and
ding: the knowledge graph provides explicit su- Skip-gram (Mikolov et al., 2013a; Mikolov et al.,
pervision of facts while the text corpus provides 2013b) learn word embeddings capturing many
much more “relation-unlabeled” pairs of words. syntactic and semantic relations between words
where a relation is also represented as the trans-
We conduct extensive large scale experiments lation between word embeddings.
on Freebase and Wikipedia corpus, which show Relational Facts Extraction. Another pivotal
jointly embedding brings promising improve- channel for knowledge graph completion is ex-
ments to the accuracy of predicting facts, com- tracting relational facts from external sources such
pared to separately embedding the knowledge as free text (Mintz et al., 2009; Riedel et al., 2010;
graph and the text corpus, respectively. Particu- Hoffmann et al., 2011; Surdeanu et al., 2012;
larly, jointly embedding enables the prediction of Zhang et al., 2013; Fan et al., 2014). This se-
a candidate fact with out-of-kb entities, which can ries of methods focuses on identifying local text
not be handled by any existing embedding meth- patterns that express a certain relation and making
ods. We also use embeddings to provide a prior predictions based on them. However, they have
score to help fact extraction on the benchmark da- not fully utilized the evidences from a knowledge
ta set of Freebase+NYTimes and also observe very graph, e.g., knowledge embedding is able to rea-
promising improvements. Meanwhile, concerning son new facts without any external sources. Ac-
the quality of word embeddings, experiments on tually, knowledge embedding is very complemen-
the analogical reasoning task show that jointly em- tary to traditional extraction methods, which was
bedding is comparable to or slightly better than first confirmed by (Weston et al., 2013). To es-
word2vec (Skip-Gram). timate the plausibility of a candidate fact, they
added scores from embeddings to scores from an
extractor, which showed significant improvemen-
t. However, as pointed out in the introduction,
1
We do not distinguish between “words” and “phrases”, their knowledge embedding method cannot pre-
i.e., “words” means “words/phrases”. dict facts involving out-of-kb entities.

1592
3 Jointly Embedding Knowledge and in the knowledge graph:
Text X
LK = Lf (h, r, t) (3)
We will first describe the notation used in this pa- (h,r,t)∈∆
per. A knowledge graph ∆ is a set of triplets in 3.1.2 Text Model
the form (h, r, t), h, t ∈ E and r ∈ R where E is
We propose the following key assumption for
the entity vocabulary and R is a collection of pre-
modeling text, which connects word embedding
defined relations. We use bold letters h, r, t to de-
and knowledge embedding: there are relations
note the corresponding embedding representation-
between words although we do not know what
s of h, r, t. A text corpus is a sequence of words
they are.
drawn from the word vocabulary V. Note that we
perform some preprocessing to detect phrases in Relational Concurrence Assumption. If two
the text and the vocabulary here already includes words w and v concur in some context, e.g., a win-
the phrases. For simplicity’s sake, without spe- dow of text, then there is a relation rwv between
cial explanation, when we say “word(s)”, it means the two words. That is, we can state the triplet of
“word(s)/phrase(s)”. Since we consider triplets in- (w, rwv , v) is a fact.
volving not only entities but also words, we denote We define the conditional probability
I = E ∪ V. Additionally, we denote anchors by A. Pr(w|rwv , v) following the same formulation
of Eq.(1) to model why two words concur in some
3.1 Modeling context. In contrast to knowledge embedding,
Our model is composed of three components: here rwv is a hidden variable rather than explicitly
the knowledge model, text model, and alignment supervised.
model. The challenge is to deal with the hidden variable
Before defining the component models, we first rwv . Obviously, without any more assumption-
define the element model for a fact triplet. In- s, the number of distinct rwv is around |V| × N̄ ,
spired by TransE, we also represent a relation r where N̄ is the average number of unique word-
as a vector r ∈ <k and score a fact triplet (h, r, t) s concurred with each word. This number is ex-
by z(h, r, t) = b − 12 kh + r − tk2 where b is a tremely large. Thus it is almost impossible to esti-
constant for bias designated for adjusting the scale mate a vector for each rwv . And the problem is ac-
for better numerical stability and b = 7 is a sensi- tually ill-posed. We need to constrain the freedom
ble choice. z(h, r, t) is expected to be large if the degree of rwv . Here we use auxiliary variables to
triplet is true. Based on the same element model of reduce the size of variables we need to estimate:
fact, we define the component models as follows. let w0 = w + rwv , then
1
3.1.1 Knowledge Model z(w, rwv , v) , z(w0 , v) = b − kw0 − vk2 (4)
2
We define the following conditional probability of
and
a fact (h, r, t) in a knowledge graph:
exp{z(w0 , v)}
Pr(w|rwv , v) , Pr(w|v) = P
exp{z(h, r, t)} 0
w̃∈V exp{z(w̃ , v)}
Pr(h|r, t) = P (1) (5)
h̃∈I exp{z(h̃, r, t)}
In this way we need to estimate vectors w and w0
and we have named our model pTransE (Proba- for each word w, and a total of 2 × |V| vectors.
bilistic TransE) to show respect to TransE. We also The goal of the text model is to maximize the
define Pr(r|h, t) and Pr(t|h, r) in the same way likelihood of the concurrences of pairs of words in
by choosing corresponding normalization terms text windows:
respectively. We define the likelihood of observ- X
LT = nwv log Pr(w|v). (6)
ing a fact triplet as:
(w,v)∈C

Lf (h, r, t) = log Pr(h|r, t)+ log Pr(t|h, r) In the above equation, C is all the distinct pairs of
(2) words concurring in text windows of a fixed size.
+ log Pr(r|h, t)
And nwv is the number of concurrences of the pair
The goal of the knowledge model is to maximize (w, v). Interestingly, as explained in Sec.(3.3),
the conditional likelihoods of existing fact triplets this text model is almost equivalent to Skip-Gram.

1593
3.1.3 Alignment Model one hand, the name of an entity is ambiguous be-
If we only have the knowledge model and text cause different entities sometimes have the same
model, the entity embeddings and word embed- name so that the name graph may contaminate the
dings will be in different spaces and any comput- knowledge embedding. On the other hand, an en-
ing between them is meaningless. Thus we need tity often has several different aliases when men-
mechanisms to align the two spaces into the same tioned in the text but we do not have the complete
one. We propose two mechanisms in this paper: u- set, which will break the semantic balance of word
tilizing Wikipedia anchors, and utilizing names of embedding. For example, for the entity Apple In-
entities. c., suppose we only have the standard name “Ap-
Alignment by Wikipedia Anchors. This mod- ple Inc.” but do not have the alias “apple”. And for
el is based on the connection between Wikipedia the entity Apple that is fruit, suppose we have the
and Freebase: for most Wikipedia (English) pages, name ”apple” included in the name graph. Then
there is an unique corresponding entity in Free- the vector of the word “apple” will be biased to
base. As a result, for most of the anchors in the concept of fruit rather than the company. But if
Wikipedia, each of which refers to a Wikipedi- no name graph intervenes, the unsupervised word
a page, we know that the surface phrase v of an embedding is able to learn a vector that is closer to
anchor actually refers to the Freebase entity ev . the concept of the company due to the polarities.
Thus, we define a likelihood for this part of an- Alignment by anchors relies on the additional data
chors as Eq.(6) but replace the word pair (w, v) source of Wikipedia anchors. Moreover, the num-
with the word-entity pair (w, ev ), i.e., using the ber of matched Wikipedia anchors (∼40M) is rela-
corresponding entity ev rather than the surface tively small compared to the total number of word
word v in Eq.(5): pairs (∼2.0B in Wikipedia) and hence the contri-
X bution is limited. However, the advantage is that
LAA = log Pr(w|ev ) (7) the quality of the data is very high and there are no
(w,v)∈C,v∈A ambiguity/completeness issues.
Considering the above three component models
where A denotes the set of anchors. together, the likelihood we maximize is:
In addition to Wikipedia anchors, we can also
use an entity linking system with satisfactory per- L = LK + LT + LA (9)
formance to produce the pseudo anchors.
where LA could be LAA or LAN or LAA + LAN .
Alignment by Names of Entities. Another way
is to use the names of entities. For a fact triplet 3.2 Training
(h, r, t) ∈ ∆, if h has a name wh and wh ∈ V, then
3.2.1 Approximation to the Normalizers
we will generate a new triplet of (wh , r, t) and add
it to the graph. Similarly, we also add (h, r, wt ) It is difficult to directly compute the normalizers in
and (wh , r, wt ) into the graph if the names exist Pr(h|r, t) (or Pr(t|h, r), Pr(r|h, t)) and Pr(w|v)
and belong to the word vocabulary. We call this as the normalizers sum over |I| or |V| terms where
sub-graph containing names the name graph and both |I| and |V| reach tens of millions. To pre-
define a likelihood for the name graph by observ- vent having to exactly calculate the normalizer-
ing its triplets: s, we use negative sampling (NEG) (Mikolov et
al., 2013b) to transform the original objective, i.e.,
X Eq.(9) to a simple objective of the binary classifi-
LAN = I[wh ∈V ∧ wt ∈V] ·Lf (wh , r, wt )+
cation problem—differentiating the observed data
(h,r,t)∈∆
from noise.
I[wh ∈V] · Lf (wh , r, t) + I[wt ∈V] · Lf (h, r, wt ) First, we define: (i) the probability of a given
(8) triplet (h, r, t) to be true (D = 1); and (ii) the
probability of a given word pair (w, v) to co-occur
Both alignment models have advantages and
(D = 1):
disadvantages. Alignment by names of entities is
straightforward and does not rely on additional da- Pr(D = 1|h, r, t) = σ(z(h, r, t)) (10)
ta sources. The number of triplets generated by the Pr(D = 1|w, v) = σ(z(w , v)) 0
(11)
names is also large and can significantly change
1
the results. However, this model is risky. On the where σ(x) = 1+exp{−x} and D ∈ {0, 1}.

1594
Instead of maximizing log Pr(h|r, t) in Eq.(2),
Table 1: Data: triplets used in our experiments.
we maximize:
#R #E #Triplet (Train/Valid/Test)
log Pr(1|h, r, t) 4,490 43,793,608 123,062,855 40,528,963 40,528,963
X c
(12)
+ Eh̃i ∼Prneg (h̃i ) [Pr(0|h̃i , r, t)]
i=1 sake of efficiency, no lock is used on the shared
memory.
where c is the number of negative examples to
be discriminated for each positive example. NEG 3.3 Connections to Related Models
guarantees that maximizing Eq.(12) can approxi- TransE. (Bordes et al., 2013) proposed to mod-
mately maximize log Pr(h|r, t). Thus, we also re- el a relation r as a translation vector r ∈ <k
place log Pr(r|h, t), log Pr(t|r, h) in Eq.(2), and which is expected to connect h and t with low
log Pr(w|v) in Eq.(6) in the same way by choosing error if (h, r, t) ∈ ∆. We also follow it. How-
corresponding negative distributions respectively. ever, TransE uses a margin based ranking loss
As a result, the objectives of both the knowledge {kh+r−tk2 +γ−kh̃+r−t̃k2 }+ . It is not a proba-
model LK (Eq.(3)) and text model LT (Eq.(6)) are bilistic model and hence it needs to restrict the nor-
free from cumbersome normalizers. m of either entity embedding and/or relation em-
3.2.2 Optimization bedding. Bordes et al. (2013) intuitively addresses
this problem by simply normalizing the entity em-
We use stochastic gradient descent (SGD) to max-
beddings to the unit sphere before computing gra-
imize the simplified objectives.
dients at each iteration. We define pTransE as a
Knowledge model. ∆ is randomly tra-
probabilistic model, which doesn’t need addition-
versed multiple times. When a positive example
al constraints on the norms of embeddings of en-
(h, r, t) ∈ ∆ is considered, to maximize (12), we
tities/words/relations, and thus eliminates the nor-
construct c negative triplets by sampling elements
malization operations.
from an uniform distribution over I and replacing
Skip-gram. (Mikolov et al., 2013a; Mikolov et al.,
the head of (h, r, t). The transformed objective of
2013b) defines the probability of the concurrence
log Pr(r|h, t) is maximized in the same manner,
of two words in a window as:
but by sampling from a uniform distribution over
R and corrupting the relation of (h, r, t). After a exp{w0T v}
Pr(w|v) = P (13)
mini-batch, computed gradients are used to update 0T
w̃∈V exp{w̃ v}
the involved embeddings.
Text model. The text corpus is traversed one or which is based on the inner product, while our text
more times. When current word v and a context model (Eqs. (4), (5)) is based on distance. If we
word w are considered, c words are sampled from constrain kwk = 1 for each w, then w0T v =
the unigram distribution raised to the 3/4rd power 1 − 12 kw0 − vk2 . It is easy to see that our text
and regarded as negative examples (w̃, v) that are model is equivalent to Skip-gram in this case. Our
never concurrent. Then we compute and update distance-based text model is directly derived from
the related gradients. the triplet fact model, which clearly explains why
Alignment model. LAA and LAN are absorbed it is able to make the pairs of entities of a certain
by the text model and knowledge model respec- relation parallel in the vector space.
tively, since anchors are considered to predict con-
4 Experiments
text given an entity and the name graph are homo-
geneous to the original knowledge graph. We empirically evaluate and compare related mod-
Joint. All three component objectives are si- els with regards to three tasks: triplet classifica-
multaneously optimized. To deal with large-scale tion (Socher et al., 2013), improving relation ex-
data, we implement a multi-thread version with traction (Weston et al., 2013), and the analogi-
shared memory. Each thread is in charge of a por- cal reasoning task (Mikolov et al., 2013a). The
tion of the data (either knowledge or text corpus), related models include: for knowledge embed-
and traverses through them, calculates gradients ding alone, TransE (Bordes et al., 2013), pTransE
and commits the update to the global model and (proposed in this paper); for word embedding
is stored in a block of shared memory. For the alone, Skip-gram (Mikolov et al., 2013b); for both

1595
Table 2: Data: the number of e − e, w − e, e − Table 3: Triplet Classification: comparison be-
w, w − w triplets/analogies where w represents tween TransE and pTransE over e − e triplets.
the out-of-kb entity, which is regarded as word and Method Accuracy (%) Area under PR curve
replaced by its corresponding entity name. TransE 93.1 0.86
Type #Triplet (Valid/Test) #Analogy pTransE 93.4 0.97
e−e 12,305,200 12,305,200 71,441
w−e 3,655,164 3,654,404 70,878
e−w 3,643,914 3,642,978 70,442 and/or tail.
w−w 460,762 451,381 40,980 Text. We adopt the Wikipedia (English) cor-
pus. After removing pages designated for nav-
igation, disambiguation, or discussion purpos-
knowledge and text, we use “respectively” to re-
es, there are 3,469,024 articles left. We ap-
fer to the embeddings learnt by TransE/pTransE
ply sentence segmentation, tokenization, Part-of-
and Skip-gram, respectively, “jointly” to refer to
Speech (POS) tagging, and named entity recog-
our jointly embedding method, in which “anchor”
nition (NER) to these articles using Apache
and “name” refer to “Alignment by Wikipedia An-
OpenNLP package2 . Then we conduct some sim-
chors” and “Alignment by Names of Entities”, re-
ple chunking to acquire phrases: if several con-
spectively.
secutive tokens are identically tagged as ”Loca-
4.1 Data tion”/”Person”/”Organization”, or covered by an
anchor, we combine them as a chunk. After the
To learn the embedding representations of entities preprocessing, our text corpus contain 73,675,188
and words, we use a knowledge graph, a text cor- sentences consisting of 1,522,291,723 chunks. A-
pus, and some connections between them. mong them, there are around 20 millions distinct
Knowledge. We adopt Freebase as our knowl- chunks, including words and phrases. We filter out
edge graph. First, we remove the user profiles, punctuation and rare words/phrases that occur less
version control, and meta data, leaving 52,124,755 than three times in the text corpus, reducing |V| to
entities, 4,490 relations, and 204,120,782 triplet- 5,240,003.
s. We call this graph main facts. Then we held Alignment. One of our alignment models need-
out 8,331,147 entities from main facts and regard s Wikipedia anchors. There are around 45 million
them as out-of-kb entities. Under such a setting, such anchors in our text corpus and 41,970,548 of
from main facts, we held out all the triplets in- them refer to entities in E. Another mechanism u-
volving out-of-kb entities, as well as 24,610,400 tilizes the name graph constructed through names
triplets that don’t contain out-of-kb entities. Held- of entities. Specifically, for each training triplet
out triplets are used for validation and testing; the (h, r, t), suppose h and t have entity names wh
remaining triplets are used for training. See Table and wt , respectively and wh , wt ∈ V, the train-
1 for the statistics. ing triplet contributes (wh , r, wt ), (wh , r, t), and
We regard out-of-kb entities as words/phrases (h, r, wt ) to the name graph. There are 81,753,310
and thus divide the held-out triplets into four type- triplets in our name graphs. Note that there is no
s: no out-of-kb entity (e − e), the head is out-of-kb overlapping between the name graph and held-out
entity but the tail is not (w − e), the tail is out-of- triplets of e − w, w − e, and w − w types.
kb entity but the head is not (e − w), and both the
head and tail are out-of-kb entities (w − w). Then 4.2 Triplet Classification
we replace the out-of-kb entities among the held-
This task judges whether a triplet (h, r, t) is true
out triplets by their corresponding entity names.
or false, i.e., binary classification of a triplet.
The mapping from a Freebase entity identifier to
Evaluation protocol. Following the same pro-
its name is done through the Freebase predicate—
tocol in NTN (Socher et al., 2013), for each true
“/type/object/name”. Since some entity names
triplet, we construct a false triplet for it by ran-
are not present in our vocabulary V, we remove
domly sampling an element from I to corrupt its
triplets involving these names (see Table 2). In
head or tail. Since |E| is significantly larger than
such a way, besides the missing edges between ex-
|V| in our data, sampling from a uniform distri-
isting entities, the related models can be evaluated
on triplets involving words/phrases as their head 2
https://opennlp.apache.org

1596
Table 4: Triplet classification: accuracy (%) over various types of triplets.
Type e−e w−e e−w w−w all
respectively 93.4 52.1 51.4 71.0 77.5
jointly (anchor) 94.4 67.0 66.7 79.8 81.9
jointly (name) 94.5 80.5 80.0 89.0 87.7
jointly (anchor+name) 95.0 82.0 81.5 90.0 88.8

bution over I will let triplets involving no word according to the trick proposed in (Mikolov et al.,
dominate the false triplets. To avoid that, when we 2013b). The optimal configurations of Skip-gram
corrupt the head of (h, r, t), if h ∈ E, h0 is sam- are: k = 150, s = 5, and c = 10.
pled from E while if h ∈ V, h0 is sampled from V. Combining entity embeddings and word em-
The same rule is applied when we corrupt the tail beddings learnt by pTransE and Skip-gram respec-
of (h, r, t). In this way, for each of the four types tively, “respectively” model can score all types of
of triplets, we ensure the number of true triplets is held-out triplets. For our jointly embedding mod-
equal to that of false ones. el, we consider various alignment mechanisms and
To classify a triplet (h, r, t), we first use the con- use equal numbers of threads for knowledge mod-
sidered methods to score it. TransE scores it by el and text model. The best configurations of
−|h + r − t|. Our models score it by Pr(D = “jointly” model are: k = 150, s = 5, c = 10, and
1|h, r, t) (see Eq.(10)). Then the considered meth- α = 0.025 which linearly decreases along with the
ods label a triplet (h, r, t) as true if its score is 6 epochs of traversing text corpus.
larger than the relation-specific threshold of r, as Results. We first illustrate the comparison be-
false otherwise. The relation-specific thresholds tween TransE and pTransE over e − e type triplet-
are chosen to maximize the classification accura- s in Table 3. Observing the scores assigned to
cy over the validation set. true triplets by TransE, we notice that triplets of
We report the classification accuracy. Addition- popular relations generally have larger scores than
ally, we rank all the testing triplets by their scores those of rare relations. In contrast, pTransE, as
in descending order. Then we draw a precision- a probabilistic model, assigns comparable scores
recall (PR) curve based on this ranking and report to true triplets of both popular and rare relations.
the area under the PR curve. When we use a threshold to separate true triplets
Implementation. We implement TransE (Bor- from false triplets of the same relation, there is no
des et al., 2013), Skip-gram (Mikolov et al., obvious difference between the two models. How-
2013a), and our models. ever, when all triplets are ranked together, assign-
First, we train TransE and pTransE over our ing scores in a more uniform scale is definitely an
training triplets with embedding dimension k advantage. Thus, the contradiction stems from the
in {50, 100, 150}. Adhering to (Bordes et al., different training strategies of the two models and
2013), we use the fixed learning rate α in the consideration of relation-specific thresholds.
{0.005, 0.01, 0.05} for TransE during its 300 e- Classification accuracies over various types of
pochs. For pTransE, we use the number of neg- held-out triplets are presented in Table 4. The
ative examples per positive example c among “jointly” model outperforms the “respectively”
{5, 10}, the learning rate α among {0.01, 0.025} model no matter which alignment mechanism(s)
where α decreases along with its 40 epochs. The are used. Actually, for the “respectively” model,
optimal configurations of TransE are: k = 100, there is no interaction between entity embeddings
α = 0.01. The optimal configurations of pTransE and word embeddings during training and thus it-
are: k = 100, c = 10, and α = 0.025. s predictions, over triplets that involve both enti-
Then we train Skip-gram with the embedding ty and word at the same time, are not much bet-
dimension k in {50, 100, 150}, the max skip-range ter than random guessing. It is also a natural re-
s in {5, 10}, the number of negative examples per sult that alignment by names is more effective than
positive example c in {5, 10}, and learning rate alignment by anchors. The number of anchors is
α = 0.025 linearly decreasing along with the 6 much smaller than the number of overall chunks
epochs over our text corpus. Popular words whose in our text corpus. In addition, the number of en-
frequencies are larger than 10−5 are subsampled tities mentioned by anchors is very limited com-

1597
1.05 1.0 1.0

0.9
1.00 0.9

0.8
0.95 0.8

0.7
precision

precision

precision
0.90 0.7

0.6

0.85 0.6
0.5

0.80 Mintz (0.864752658197) 0.5


0.4
Mintz+Jointly (0.891043673778) Mintz (0.512956668019) Mintz (0.662641417434)
Mintz+Knowledge (0.917260610051) Mintz+Jointly (0.636313453126) Mintz+Jointly (0.695062505333)
0.75 0.3 0.4
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall Recall

0.35 1.0

0.9
0.30

0.8

0.25
0.7

0.20
precision

precision
0.6

0.5
0.15

0.4
0.10

0.3
Mintz (0.363969399646)
0.05
Mintz (0.0914506877334) 0.2 Mintz+Jointly (0.480909766665)
Mintz+Jointly (0.0993184342972) Mintz+Knowledge (0.418875519101)
0.00 0.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall

Figure 1: Improving Relation Extraction: PR curves of Mintz alone or combined with knowledge
(pTransE) / jointly model over (a) e − e, (b) w − e, (c) e − w, (d) w − w, and (e) all triplets.

pared with |E|. Thus, interactions brought in by way as the previous triplet classification experi-
anchors are not as significant as that of the name ment.
graph. For combination, we first divide each candidate
triplet into one of these categories: e − e, e − w,
4.3 Improving Relation Extraction w − e, w − w, and “out-of-vocabulary”. Be-
It has been shown that embedding models are very cause there is no embedding model that can score
complementary to extractors (Weston et al., 2013). triplets involving out-of-vocabulary word/phrase,
However, some entities detected from text are out- we just ignore these triplets.Please note that, for
of-kb entities. In such a case, triplets involving our jointly embedding model, there are no “out-
these entities cannot be handled by any existing of-vocabulary” triplets if we include the NYT cor-
knowledge embedding method, but our jointly em- pus for training. We use the embedding models
bedding model can score them. As our model can to score candidate triplets and combine the scores
cover more candidate triplets provided by extrac- given by the embedding model with scores given
tors, it is expected to provide more significant im- by the extractors. For each type e−e, e−w, w −e,
provements to extractors than any other embed- w − w and their union (i.e. all), we rank the candi-
ding model. We confirm this point as follow. date triplets by their revisited scores and draw PR
Evaluation protocol. For relation extraction, curve to observe which embedding method pro-
we use a public dataset—NYT+FB (Riedel et al., vides the most significant improvements to the ex-
2010)3 , which distantly labels the NYT corpus by tractors.
Freebase facts. We consider (Mintz et al., 2009) Implementation. For (Mintz et al., 2009), we
and Sm2r (Weston et al., 2013) as our extractors use the implementation in (Surdeanu et al., 2012)4 .
to provide candidate triplets as well as their plau- We implement Sm2r by ourselves with the best hy-
sibilities estimated according to text features. perparameters introduced in (Weston et al., 2013).
For embedding, we first held out triplets from For TransE, pTransE, and the “jointly” model, we
our training set that appear in the test set of use the same implementations, scoring schemes,
NYT+FB. Then we train TransE, pTransE and the and optimal configurations as the triplet classifica-
“jointly” model on the remaining training triplets tion experiment.
as well as on our text corpus. Then we use these To combine extractors with embedding mod-
models to score each candidate triplet in the same
4
http://nlp.stanford.edu/software/
3
http://iesl.cs.umass.edu/riedel/ecml/ mimlre.shtml

1598
1.00 1.00 1.00
0.95 0.98
0.98
0.90 0.96
0.96
0.85 0.94
0.80 0.94 0.92
precision

precision

precision
0.75 0.92 0.90
0.70 0.88
0.90
0.65 0.86
Sm2r (0.773014296476)
Sm2r+Jointly (0.858251870864) 0.88 Sm2r (0.908752270146) Sm2r (0.875810724647)
0.60 0.84
Sm2r+Knowledge (0.858251870864) Sm2r+Jointly (0.966914103913) Sm2r+Jointly (0.966676169402)
0.550.0 0.2 0.4 0.6 0.8 1.0 0.860.0 0.2 0.4 0.6 0.8 1.0 0.820.0 0.2 0.4 0.6 0.8 1.0
Recall Recall Recall

1.0 0.90

0.9 0.85

0.8 0.80

0.7 0.75
precision

precision
0.6 0.70

0.5 0.65

0.4 0.60
Sm2r (0.67446140565)
0.3 Sm2r (0.431460754321) 0.55 Sm2r+Jointly (0.802071230237)
Sm2r+Jointly (0.536200901997) Sm2r+Knowledge (0.697899016348)
0.20.0 0.2 0.4 0.6 0.8 1.0 0.500.0 0.2 0.4 0.6 0.8 1.0
Recall Recall

Figure 2: Improving Relation Extraction: PR curves of Sm2r alone or combined with knowledge
(TransE) / jointly model over (a) e − e, (b) w − e, (c) e − w, (d) w − w, and (e) all triplets.

els, we consider two schemes. Since Mintz s-


Table 6: Phrases Analogical Reasoning Task.
cores candidate triplets in a probabilistic man-
Method Accuracy (%) Hits@10 (%)
ner, we linearly combine its scores with the s-
Skip-gram 18.0 56.1
cores given by pTransE or the “jointly” mod- Jointly (anchor) 27.6 65.0
el: β PrMintz +(1 − β) PrpTransE/Jointly where β is Jointly (name) 11.3 40.6
enumerated from 0 to 1 with 0.025 as a search Jointly (anchor+name) 18.3 54.0
step. On the other hand, neither Sm2r nor TransE
is a probabilistic model. Thus, we combine
Table 7: Constructed Analogical Reasoning
Sm2r with TransE or the “jointly” model ac-
Task.
cording to the scheme proposed in (Weston et Method Accuracy (%) Hits@10 (%)
al.,
P 2013) where for each candidate0 (h, r, t), if Skip-gram 10.5 14.1
r0 6=r δ(Score(h, r, t) < Score(h, r , t)) is less Jointly (anchor) 10.5 14.3
than τ , we increase ScoreSm2r (h, r, t) by p. We Jointly (name) 11.5 16.2
search for the best β, τ , and p on another dataset— Jointly (anchor+name) 11.6 16.5
Wikipedia corpus distantly labeled by Freebase.
4.4 Analogical Reasoning Task
We compare our method with Skip-gram on this
task to observe and study the influences of both
Result. We present the PR curves in Fig. (1, knowledge embedding and alignment mechanisms
2). Over candidate triplets provided by either on the quality of word embeddings.
Mintz or Sm2r, the “jointly” model is consis- Evaluation protocol. We use the same pub-
tently comparable with the “knowledge” model lic datasets as in (Mikolov et al., 2013b): 19,544
(TransE/pTransE) over e − e triplets while it out- word analogies5 ; 3,218 phrase analogies6 . We al-
performs the “knowledge” model by a consider- so construct analogies from our held-out triplet-
able margin over triplets of other types. These s (see Table 2) by first concatenating two entity
results confirm the advantage of jointly embed- pairs of the same relation to form an analogy and
ding and are actually straightforward results of our
5
triplet classification experiment because the only code.google.com/p/word2vec/source/
browse/trunk/questions-words.txt
difference is that the triplets here are provided by 6
code.google.com/p/word2vec/source/
the extractor. browse/trunk/questions-phrases.txt

1599
Table 5: Words Analogical Reasoning Task.
Method Accuracy (%) Hits@10 (%)
Semantic Syntactic Total Semantic Syntactic Total
Skip-gram 71.4 69.0 70.0 90.4 89.3 89.8
Jointly (anchor) 75.3 68.3 71.2 91.5 88.9 89.9
Jointly (name) 54.5 54.2 59.0 75.8 86.5 82.1
Jointly (anchor+name) 56.5 65.7 61.9 78.1 87.6 83.6

then replacing the entities by corresponding entity son stems from the versatility of mentioning an
names, e.g., “(Obama, Honolulu, David Beckham, entity. Consider “(Japan, yen, Europe, euro)” for
London)” where the relation is “Born-in”. example. Knowledge embedding is supposed to
Following (Mikolov et al., 2013b), we only con- give significant help to completing this analogy as
sider analogies that consist of the top-K most fre- “/location/country/currency”∈ R. However, the
quent words/phrases in the vocabulary. For each entity of Japanese currency is named “Japanese
analogy denoted by (h1 , t1 , h2 , t2 ), we enumer- yen” rather than “yen” and thus the explicit trans-
ate all the top-K most frequent words/phrases w lation learnt from knowledge embedding is not di-
except for h1 , t1 , h2 , and calculate the distance rectly imposed on the word embedding of “yen”.
(Cosine/Euclidean according to specific model) In contrast, using entity names for alignment im-
between h2 + (t1 − h1 ) and w. Ordering all proves the performances on constructed analogies
these words/phrases by their distances in ascend- (Table 7). Since there is a relation r ∈ R for
ing order, we obtain the rank of the correct an- each constructed analogy (wh1 , wt1 , wh2 , wt2 ), al-
swer t2 . Finally, we report Hits@10 (i.e., the pro- though neither (wh1 , r, wt1 ) nor (wh2 , r, wt2 ) is
portion of correct answers whose ranks are not present in the name graph, other facts involving
larger than 10) and accuracy (i.e., Hits@1). For these words act on the vectors of these words, in
word analogies and constructed analogies, we set the same manner of traditional knowledge embed-
K = 200, 000; while for phrase analogies, we set ding.
K = 1, 000, 000 to recall sufficient analogies. Overall, any high-quality entity linking system
Implementation. For Skip-gram and the can be used to further improve the performance.
“Jointly” (anchor/name/anchor+name) model, we
5 Conclusions
use the same implementations and optimal config-
urations as the triplet classification experiment. In this paper, we introduced a novel method of
Results. Jointly embedding using Wikipedi- jointly embedding knowledge graphs and a text
a anchors for alignment consistently outperforms corpus so that entities and words/phrases are rep-
Skip-gram (Table 5, 6, 7) showing that the influ- resented in the same vector space. In such a way,
ence of knowledge embedding, injected into word our method can perform prediction on any can-
embedding through Wikipedia anchors, is benefi- didate facts between entities/words/phrases, going
cial. The vector of an ambiguous word is often a beyond previous knowledge embedding methods,
mixture of its several meanings but, in a specific which can only predict facts whose entities exist
context, the word is disambiguated and refers to in knowledge graph. Extensive, large-scale exper-
a specific meaning. Using global word embedding iments show that the proposed method is very ef-
to predict words within a specific context may pol- fective at reasoning new facts. In addition, we also
lute the embeddings of surrounding words. Align- provides insights into word embedding, especially
ment by anchors enables entity embeddings to al- on the capability of analogical reasoning. In this
leviate the propagation of ambiguities and thus im- aspect, we empirically observed some hints that
proves the quality of word embeddings. jointly embedding also helps word embedding.
Using entity names for alignment hurts the per-
formance of analogies of words and phrases (Ta- References
ble 5, 6). The main reason is that these analo-
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
gies are popular facts frequently mentioned in tex- Christian Jauvin. 2003. A neural probabilistic lan-
t while a name graph forces word embeddings to guage model. Journal of Machine Learning Re-
satisfy both popular and rare facts. Another rea- search, 3:1137–1155.

1600
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim S- Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-
turge, and Jamie Taylor. 2008. Freebase: a collab- rafsky. 2009. Distant supervision for relation ex-
oratively created graph database for structuring hu- traction without labeled data. In Proceedings of the
man knowledge. In Proceedings of the 2008 ACM Joint Conference of the 47th Annual Meeting of the
SIGMOD International Conference on Management ACL and the 4th International Joint Conference on
of Data, pages 1247–1250. ACM. Natural Language Processing of the AFNLP: Vol-
ume 2-Volume 2, pages 1003–1011. Association for
Antoine Bordes, Jason Weston, Ronan Collobert, and Computational Linguistics.
Yoshua Bengio. 2011. Learning structured embed-
dings of knowledge bases. In Proceedings of the Sebastian Riedel, Limin Yao, and Andrew McCal-
Twenty-Fifth AAAI Conference on Artificial Intelli- lum. 2010. Modeling relations and their mention-
gence, pages 301–306. s without labeled text. In Machine Learning and
Knowledge Discovery in Databases, pages 148–163.
Antoine Bordes, Nicolas Usunier, Alberto Garcia- Springer.
Duran, Jason Weston, and Oksana Yakhnenko.
2013. Translating embeddings for modeling multi- Richard Socher, Danqi Chen, Christopher D Manning,
relational data. In Advances in Neural Information and Andrew Ng. 2013. Reasoning with neural ten-
Processing Systems, pages 2787–2795. sor networks for knowledge base completion. In Ad-
vances in Neural Information Processing Systems,
Kai-Wei Chang, Wen-tau Yih, and Christopher Meek. pages 926–934.
2013. Multi-relational latent semantic analysis. In
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,
Proceedings of the 2013 Conference on Empirical
and Christopher D Manning. 2012. Multi-instance
Methods in Natural Language Processing, pages
multi-label learning for relation extraction. In Pro-
1602–1612, Seattle, Washington, USA, October.
ceedings of the 2012 Joint Conference on Empirical
Association for Computational Linguistics.
Methods in Natural Language Processing and Com-
putational Natural Language Learning, pages 455–
Ronan Collobert, Jason Weston, Léon Bottou, Michael
465. Association for Computational Linguistics.
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng
scratch. Journal of Machine Learning Research, Chen. 2014. Knowledge graph embedding by trans-
12:2493–2537. lating on hyperplanes. In Proceedings of the Twenty-
Eighth AAAI Conference on Artificial Intelligence,
Miao Fan, Deli Zhao, Qiang Zhou, Zhiyuan Liu, pages 1112–1119.
Thomas Fang Zheng, and Edward Y. Chang. 2014.
Distant supervision for relation extraction with ma- Jason Weston, Antoine Bordes, Oksana Yakhnenko,
trix completion. In Proceedings of the 52nd Annual and Nicolas Usunier. 2013. Connecting language
Meeting of the Association for Computational Lin- and knowledge bases with embedding models for re-
guistics (Volume 1: Long Papers), pages 839–849, lation extraction. In Proceedings of the 2013 Con-
Baltimore, Maryland, June. Association for Compu- ference on Empirical Methods in Natural Language
tational Linguistics. Processing, pages 1366–1371, Seattle, Washington,
USA, October. Association for Computational Lin-
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke guistics.
Zettlemoyer, and Daniel S Weld. 2011. Knowledge-
based weak supervision for information extraction Xingxing Zhang, Jianwen Zhang, Junyu Zeng, Jun
of overlapping relations. In Proceedings of the 49th Yan, Zheng Chen, and Zhifang Sui. 2013. Towards
Annual Meeting of the Association for Computa- accurate distant supervision for relational facts ex-
tional Linguistics: Human Language Technologies- traction. In Proceedings of the 51st Annual Meet-
Volume 1, pages 541–550. Association for Compu- ing of the Association for Computational Linguistic-
tational Linguistics. s (Volume 2: Short Papers), pages 810–815, Sofi-
a, Bulgaria, August. Association for Computational
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Linguistics.
Dean. 2013a. Efficient estimation of word rep-
resentations in vector space. arXiv preprint arX-
iv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-


rado, and Jeff Dean. 2013b. Distributed representa-
tions of words and phrases and their compositional-
ity. In Advances in Neural Information Processing
Systems 26, pages 3111–3119.

George A Miller. 1995. Wordnet: a lexical


database for english. Communications of the ACM,
38(11):39–41.

1601

You might also like