You are on page 1of 10

Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

Shared Embedding Based Neural Networks for


Knowledge Graph Completion
Saiping Guan, Xiaolong Jin, Yuanzhuo Wang and Xueqi Cheng
CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of
Sciences; School of Computer and Control Engineering, University of Chinese Academy of Sciences
guansaiping@software.ict.ac.cn;{jinxiaolong,wangyuanzhuo,cxq}@ict.ac.cn

ABSTRACT 1 INTRODUCTION
Knowledge Graphs (KGs) have facilitated many real-world applica- The rapid development of Internet technologies and applications
tions (e.g., vertical search and intelligent question answering). How- has witnessed the popularity of Knowledge Graphs (KGs), which
ever, they are usually incomplete, which affects the performance play an increasingly important role in machine learning and arti-
of such KG based applications. To alleviate this problem, a number ficial intelligence applications, like vertical search and intelligent
of Knowledge Graph Completion (KGC) methods have been devel- question answering [4, 19, 37, 38]. KGs are a kind of labelled in-
oped to predict those implicit triples. Tensor/matrix based methods formation networks, consisting of triples in a form of (head en-
and translation based methods have attracted great attention for tity, relation, tail entity). Usually, KGs are incomplete with some
a long time. Recently, neural network has been introduced into triples missing. Knowledge Graph Completion (KGC) [17, 33] has
KGC due to its extensive superiority in many fields (e.g., natural thus been extensively studied to fill in missing triples as many
language processing and computer vision), and achieves promising as possible, so as to promote the completeness of KG and further
results. In this paper, we propose a Shared Embedding based Neural improve the performance of KG based applications. Specifically,
Network (SENN) model for KGC. It integrates the prediction tasks KGC includes the prediction tasks of head entities, relations and
of head entities, relations and tail entities into a neural network tail entities. Knowledge Graph Embedding (KGE) is currently a
based framework with shared embeddings of entities and relations, type of dominant approaches to these prediction tasks. It represents
while explicitly considering the differences among these prediction entities and relations in a KG as vectors in a low-dimensional space.
tasks. Moreover, we propose an adaptively weighted loss mecha- KGC is then conducted via simple vector operations.
nism, which dynamically adjusts the weights of losses according to KGE based KGC has attracted increasing attention in recent
the mapping properties of relations, and the prediction tasks. Since years. A number of tensor/matrix based and translation based KGC
relation prediction usually performs better than head and tail entity methods have been proposed. Tensor/matrix based methods (e.g.,
predictions, we further extend SENN to SENN+ by employing it to RESCAL [23] and ComplEx [30]) regard triples in a KG as entries
assist head and tail entity predictions. Experiments on benchmark of a tensor/matrix, and learn the embeddings of entities and rela-
datasets validate the effectiveness and merits of the proposed SENN tions through minimizing reconstruction error of the tensor/matrix.
and SENN+ methods. The shared embeddings and the adaptively KGC is then conducted via the reconstructed tensor/matrix. Triples
weighted loss mechanism are also testified to be effective. corresponding to the entries of high values are treated as valid
ones. Translation based methods (e.g., TransE [3] and TransR [17])
CCS CONCEPTS view valid triples as translation operations of entities via relations
• Computing methodologies → Reasoning about belief and and define a score function accordingly. Then, they learn the em-
knowledge; beddings of entities and relations by minimizing score function
based losses. New triples of high scores are used to complete the
KEYWORDS corresponding KG. Recently, motivated by the excellent perfor-
mance of neural network in natural language processing [2, 26]
Knowledge graph completion; shared embedding; neural network
and computer vision [9, 29], many researchers have turned to neu-
ACM Reference Format: ral network based KGC and achieved very promising results (e.g.,
Saiping Guan, Xiaolong Jin, Yuanzhuo Wang and Xueqi Cheng. 2018. Shared ProjE [27] and R-GCN [25]). Neural network based methods also
Embedding Based Neural Networks for Knowledge Graph Completion.
use scores to evaluate the validity of triples. These methods can be
In The 27th ACM International Conference on Information and Knowledge
Management (CIKM ’18), October 22–26, 2018, Torino, Italy. ACM, New York,
divided into two categories. The first category (e.g., ER-MLP [5]
NY, USA, 10 pages. https://doi.org/10.1145/3269206.3271704 and ProjE [27]) produces a score function by neural networks, and
similar to other methods, the embeddings of entities and relations
Permission to make digital or hard copies of all or part of this work for personal or used in the score function are randomly initialized; the second cat-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation egory (e.g., R-GCN [25]) uses an existing effective score function,
on the first page. Copyrights for components of this work owned by others than ACM but the embeddings of entities are acquired from neural networks
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, rather than being randomly initialized.
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org. Despite the effectiveness of the above various KGC methods,
CIKM ’18, October 22–26, 2018, Torino, Italy they all do not differentiate the three types of prediction tasks
© 2018 Association for Computing Machinery. directly in the training process. However, these prediction tasks
ACM ISBN 978-1-4503-6014-2/18/10. . . $15.00
https://doi.org/10.1145/3269206.3271704 have quite different performance even on the same dataset. Take

247
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

TransE [3] for example, on the widely used dataset FB15k, it gets In KGE based KGC methods, entities and relations in a KG are
0.275/0.319 on head/tail entity prediction and 0.872 on relation often represented as real-valued vectors of size k or matrices of
prediction in terms of Hits@1, which indicates the proportion of the size k × k in a low-dimensional embedding space, where k is the
test entities/relations ranked in the top-1 position. For this reason, latent dimension, and KGC is then transformed to simple vector
it is reasonable to explicitly distinguish them in a certain way. operations. The same as the convention, in what follows, we denote
Actually, Ref. [13] has demonstrates the benefit of using different the embeddings of entities and relations with the same letters but
projection matrices for head and tail entities, even though it does in boldface.
not clearly differentiate the prediction tasks. On the other hand,
prediction upon reasoning is, intuitively, a process that gradually 2.2 Tensor/Matrix Based Methods
approaches to the target. Fully-connected neural networks with A quintessential method of this category is RESCAL [23], which
decreasing layer dimensions can elegantly imitate such a process [1, is a three-way tensor factorization based method. In RESCAL, if
24]. a triple (h, r , t ) is valid, the corresponding tensor entry is set to 1,
Motivated by the above observations, in this paper, we use fully- otherwise, set to 0. And, the score function of (h, r , t ) is defined as
connected neural networks to model the prediction tasks of head follows:
entities, relations and tail entities of triples directly. In general, the s (h, r, t ) = hMr t⊺ , (1)
main contributions of this paper can be summarized as follows:
where Mr is the relation matrix of r of size k × k and t⊺ is the
• We propose a Shared Embedding based Neural Network transpose of t. The embeddings of entities and relations are then
(SENN) model, which explicitly distinguishes the predic- learned through minimizing reconstruction error of the tensor.
tion tasks of head entities, relations and tail entities, and Then, triples corresponding to the highly scored entries of the
integrates them into a fully-connected neural network based reconstructed tensor are used to complete KG.
framework with shared embeddings of entities and relations. Recently, a matrix factorization based approach, called Com-
• An adaptively weighted loss mechanism is proposed in SENN, plEx [30], was proposed, which uses complex values to define the
which can well handle triples with different mapping prop- embeddings of entities and relations. It relates each relation with
erties and deal with different prediction tasks. a matrix. If a head entity holds that relation with a tail entity, the
• Since relation prediction usually has better performance than corresponding entry is set to 1, otherwise, set to -1. In ComplEx,
head and tail entity predictions, we employ it to improve the score function of a triple (h, r , t ) has a form of:
head and tail entity predictions and thus extend SENN to X k
SENN+ . s (h, r , t ) = ⟨h, r, t⟩ = Re *. [h]i [r]i [t]i +/ , (2)
, i=1 -
2 RELATED WORKS where h, r, t ∈ Ck are the complex embeddings of h, r and t, re-
Many researchers have dug into KGE based KGC, which can be spectively; Re(x ) returns the real part of x; x = Re(x) − iIm(x)
divided into three categories, i.e., tensor/matrix based, translation is the conjugate of x = Re(x) + iIm(x), here Im(x) returns the
based, and neural network based. Before introducing each category, imaginary part of x; [x]i is the i-th entry of vector x. ComplEx
let’s formulate KGC first. then minimizes reconstruction error of the matrices to learn the
embeddings of entities and relations. The reconstructed matrices
from the learned embeddings are used to conduct KGC, similar to
2.1 Formulation of KGC
that in RESCAL [23].
Let G = (E, R,T ) be a KG, where E is the entity set, R is the relation
set and T ⊂ E × R × E is the set of triples. A triple is usually denoted 2.3 Translation Based Methods
as (h, r, t ), where h, r , and t are its head entity, relation, and tail
The seminal study of translation based methods, TransE [3], regards
entity, respectively.
valid triples as relational translation operations between head and
Since G is usually incomplete, KGC aims to find missing but
tail entity pairs. Specifically, it assumes that if a triple (h, r , t ) is valid,
valid triples, denoted as T ′ (T ′ ⊂ E × R × E and T ′ ∩ T = ∅), of G
the embedding h of head entity h plus the embedding r of relation
to mitigate its inherent incompleteness. KGC contains three types
r should be close to the embedding t of tail entity t, otherwise, far
of tasks, i.e., head entity prediction (given r and t, predict h, i.e.,
away. Thus, it defines the score function of (h, r , t ) as:
(?, r, t )), relation prediction (given h and t, predict r , i.e., (h, ?, t )),
and tail entity prediction (given h and r , predict t, i.e., (h, r , ?)). Both s (h, r , t ) = −∥h + r − t∥L1 /L2 . (3)
head entity prediction and tail entity prediction are called entity The embeddings of entities and relations are then learned through
prediction. These prediction tasks can be reduced to scoring triples score function based losses. The learned embeddings and score
and treating those with relatively high scores to be valid ones. In function are used to conduct KGC, with triples of high scores as
this way, the crux of existing KGC methods is usually to propose valid ones.
an effective score function, denoted as s (h, r, t ), which quantizes The above translation assumption triggers the popular Trans
the strength of a triple (h, r , t ) of being valid. Since valid triples series, and thus a multitude of TransE-inspired methods spring
precede invalid ones in terms of their scores and thus have different up (e.g., TransH [32], TransR [17], TransD [12], TranSparse [13],
labels from invalid ones, KGC can then be viewed as a ranking or TransA [14], STransE [21], PTransE [16], RTransE [6], DKRL [34],
classification task. TKRL [35] and TEKE [31]). They differ with each other in the

248
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

Figure 1: The framework of the proposed SENN method (LS indicates label smoothing).

modification to the translation assumption or the use of additional Differently, the newly developed R-GCN method [25] introduces
information and constraints, such as, path feature and entity descrip- Relational Graph Convolutional Networks (R-GCN) to model triples
tion. In particular, many recent studies (e.g., KG2E [10], HolE [22], in a pipeline manner, with R-GCN as its encoder and DistMult [36]
TATEC [7], DistMult [36] and ANALOGY [18]) further focus on as its decoder. The R-GCN encoder propagates information across
using more expressive score functions. For example, HolE applies a the edges of KG and uses the hidden states of the last neural net-
special compositional operator, i.e., circular correlation, denoted as work layer as the embeddings of entities. The DistMult decoder then
⋆, to define the score function of (h, r, t ) as follows: takes the embeddings of head and tail entities from them, and com-
putes scores with the embeddings of relations, which are randomly
s (h, r, t ) = r(h ⋆ t) ⊺ . (4) initialized, following the score function of DistMult. However, in
DistMult adopts a multiplication operator among head entity vector, the R-GCN method, the implicit interactions of the embeddings of
relation diagonal matrix and tail entity vector to define the score entities and relations are not fully modelled, as they are captured
function. ANALOGY further models the analogical properties of only at the latter decoder phase, which may affect the performance
the embedded entities and relations. of prediction.

2.4 Neural Network Based Methods 3 THE SENN METHOD


Neural network based methods are newly developed ones, as com- In this paper, KGC is casted into a triple classification task and
pared to tensor/matrix based and translation based methods. Many three fully-connected neural networks are employed to produce
researchers have poured increasing attention to this category of prediction-specific score functions, which are further used to evalu-
methods. ate the corresponding predictions directly. These prediction-specific
ER-MLP [5] uses a multi-layer perceptron to capture the implicit score functions differ the SENN method from other existing ones,
interactions among head entity, relation and tail entity. Then, a as they usually adopt only one score function to evaluate whether
sigmoid function is applied to get a score, indicating the validity of or not a triple is valid.
the corresponding triple. In ER-MLP, the score function of a triple In this section, we first outline the framework of the proposed
(h, r , t ) is defined as: SENN method, then elaborate the three substructures correspond-
ing to the prediction tasks of head entities, relations and tail entities,
s (h, r , t ) = sigmoid( f ([h; t; r]W)v), (5) respectively. The details of model training are introduced in the
where f (·) is a nonlinear function, such as tanh; W ∈ R3k ×ξ and following part subsequently.
v ∈ Rξ ×1 are the parameters of the network, and ξ is related to the
size of the hidden layer. 3.1 The Framework of SENN
Recently, ProjE [27] models head and tail entity predictions using The proposed SENN method explicitly models the tasks of head
neural networks with a combination layer and a projection layer. entity prediction, relation prediction and tail entity prediction as
The former combines the known head/tail entity and relation via three different substructures, denoted as head_pred, rel._pred and
a combination operator to create a target vector, and the latter tail_pred, respectively, but integrates them into a unified framework,
projects each of the candidate entities onto this target vector to get as illustrated in Figure 1. Note that, for saving space, the substruc-
a validity score. Although ProjE performs well, it does not model tures corresponding to rel._pred and tail_pred are only presented
relation prediction. in a simplified manner, as they are similar to that of head_pred.

249
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

As presented in the figure, SENN takes the triple batches from


the training set T as input. To be clear, in what follows, we illustrate
the learning process of a triple (h, r, t ). The embeddings, h, t and r,
of h, t and r are looked up from the embedding matrix AE ∈ R |E |×k
of entities and AR ∈ R |R |×k of relations, respectively. These two
embedding matrices are shared among the three prediction tasks
and thus called shared embeddings.
Next, the embeddings h, r and t are passed to the three predic-
tion tasks, with one of the three elements in the triple (h, r , t ) as
the prediction goal, i.e., (?, r , t ), (h, ?, t ) and (h, r , ?). These tasks
are modelled individually, resulting in three similar substructures,
which enable prediction-specific inference using different given Figure 2: The architecture of the substructure head_pred cor-
information to target at different goals (see Section 3.2). The predic- responding to head entity prediction.
tion label vectors ph of size |E|, pr of size |R| and pt of size |E|, as
the outputs of the substructures head_pred, rel._pred and tail_pred,
respectively, are used to generate losses with the corresponding difference, d, to gradually reduce the dimensions of neural network
smoothed target label vectors l̃h of size |E|, l̃r of size |R| and l̃t of layers. Since the input of the first neural network layer is a combi-
size |E|. The final loss of (h, r, t ) is the weighted sum of the above nation vector of size 2k, and the output of the last neural network
losses (see Section 3.3). Here, l̃h , l̃r and l̃t are obtained by smoothing layer is a prediction vector of size k, we can get
the target label vectors lh of size |E|, lr of size |R| and lt of size |E|,
2k − k k
$ % $ %
respectively, which are derived from the training set T . For example, d= = , (7)
assume (h, r , t ), (h ′, r , t ) and (h ′′, r, t ) be all the valid triples in the n n
training triple set with r and t as the common relation and tail where ⌊·⌋ is the floor function. Therefore, the size of Wh,1 is 2k ×
entity. Then, on the head entity prediction task, the entries of the (2k − d ), the size of Wh,i is (2k − (i − 1) × d ) × (2k − i × d ), and the
target label vector lh corresponding to h, h ′ and h ′′ are set to 1, and size of Wh,n is (2k − (n − 1) × d ) × k. Similarly, the sizes of bh,1 ,
other entries are set to 0. bh,i and bh,n are (2k − d ), (2k − i × d ) and k, respectively.
Note that the weights of the losses corresponding to different The score function s (r, t) returns a score vector of size |E|, each
predictions are adjusted adaptively to well handle triples with differ- entry of which is a score indicating the degree of the corresponding
ent mapping properties and cope with different types of prediction entity to be a valid head entity. To make the entries normalized, we
tasks (see Section 3.3). take sigmoid or softmax as in ProjE [27] to s (r, t) and thus obtain
the following prediction label vector ph :
3.2 The Three Substructures
ph = д(s (r, t)). (8)
Since the three substructures are similar, in the following we present
only the detailed architecture of the substructure head_pred cor- In the task of predicting r given h and t, the process is similar
responding to head entity prediction, as illustrated in Figure 2. to that of predicting h, but the concatenation layer is replaced by
Head_pred concatenates the known r and t to get a combination the combination of h and t. Similarly, in the task of predicting t
vector of size 2k, i.e., double embedding dimension. Then, it is used given h and r , the concatenation layer is replaced by the combina-
as the input of the first neural network layer. After layer-by-layer tion of h and r. Therefore, following Equation (6), the score func-
learning, the last neural network layer generates the prediction tions corresponding to these two prediction tasks/substructures,
vector vh of size k. Subsequently, to match with all the entities, vh i.e., rel._pred (h, t) and tail_pred (h, r) are presented as follows, re-
is multiplied with the embeddings of all the entities in AE and the spectively:
result is further taken sigmoid or softmax to get the prediction label ⊺

vector ph . This process is denoted as д(vh AE ) in Figure 2, where s (h, t) = f ( f (· · · f ([h; t]Wr,1 + br,1 ) · · · )Wr,n + br,n )AR , (9)

д(·) is the sigmoid or softmax function. s (h, r) = f ( f (· · · f ([h; r]Wt,1 + bt,1 ) · · · )Wt,n + bt,n )AE , (10)
Formally, the score function corresponding to head_pred (r, t) is
where {Wr,1 , Wr,2 , . . . , Wr,n }, {Wt,1 , Wt,2 , . . . , Wt,n } and {br,1 ,
defined as:

br,2 , . . . , br,n }, {bt,1 , bt,2 , . . . , bt,n } are the weight matrices and bias
s (r, t) = vh AE vectors for the corresponding neural network layers, respectively;
(6)

= f ( f (· · · f ([r; t]Wh,1 + bh,1 ) · · · )Wh,n + bh,n )AE , s (h, t) returns a score vector of size |R|, with its entries implying the
degree of the corresponding relations to be the valid ones; s (h, r)
where f (·) is a non-linear function, and the widely used Rectified returns a score vector of size |E|, with its entries indicating the
Liner Units (ReLU) [20], i.e., f (x ) = max (0, x ), is adopted in this validity of the corresponding entities.
paper; n is the number of neural network layers (hidden layers); Similar to Equation (8), the prediction label vectors of these two
{Wh,1 , Wh,2 , . . . , Wh,n } and {bh,1 , bh,2 , . . . , bh,n } are the weight prediction tasks/substructures are defined as follows, respectively:
matrices and bias vectors of the neural network layers, respec-
tively. These neural network layers imitate the prediction process pr = д(s (h, t)), (11)
to approach to the target h step by step. Here, we use the same pt = д(s (h, r)). (12)

250
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

3.3 Model Training multiple tail entities. Similarly, given a pair of head entity and tail
In the following, we present the loss function and the training entity, there may also exist multiple relations. It is obvious that a
process in detail. prediction with only one valid entity/relation as its answer is more
deterministic than a prediction with multiple valid entities/relations.
3.3.1 The General Loss Function. As above described, given a triple In other words, the more the valid entities/relations the prediction
(h, r , t ), we get three prediction label vectors, i.e., ph , pr and pt , for has in the training set, the less deterministic it is. Therefore, in the
head entity prediction, relation prediction and tail entity prediction, training process, we punish the model more severely if the relatively
respectively. On the other hand, these prediction tasks have their deterministic prediction is wrongly predicted. The loss of it should
target label vectors lh , lr and lt for (h, r , t ). The entry of them is thus be weighted higher than that of less deterministic one. Specif-
defined as follows, respectively: ically, this idea is implemented in this paper as follows: For each
(
1, ei ∈ Ih , triple (h, r, t ), we relate the weights of the losses corresponding to
[lh ]i = (13) head entity prediction, relation prediction and tail entity prediction
0, otherwise,
with the numbers of valid entities/relations of the predictions, i.e.,
1, r i ∈ Ir ,
(
[lr ]i = (14) |Ih |, |Ir | and |It |. More precisely, the reciprocals of |Ih |, |Ir | and |It |
0, otherwise,
are used to weight the corresponding losses.
1, ei ∈ It ,
(
Second, as aforementioned, it has been well testified that relation
[lt ]i = (15)
0, otherwise, prediction performs better than entity prediction [3, 16]. To learn
where Ih is the set of valid head entities in the training set, given r relatively hard head and tail entity predictions better, we punish the
and t; Ir and It are defined similarly; ei is the entity in E correspond- wrong predictions on head/tail entities more severely than those on
ing to the i-th entry of lh or lt ; r i is the relation in R corresponding relations. Specifically, we multiply the losses of head and tail entity
to the i-th entry of lr . predictions (Equations (19) and (21)) with an additional weight
Since the entries of the prediction label vectors are discrete real factor w (> 1).
values in [0, 1], we employ label smoothing [29] to regularize the With the above observations into consideration, the adaptively
target label vectors and thus get the following smoothed target weighted loss mechanism gets the following updated new losses
label vectors: for the three prediction tasks/substructures:
α w
l̃h = (1 − αh )lh + h , (16) L̃(ph , l̃h ) =
|Ih |
L(ph , l̃h ), (23)
|E|
αr 1
l̃r = (1 − α r )lr + , (17) L̃(pr , l̃r ) = L(pr , l̃r ), (24)
|R| |Ir |
αt w
l̃t = (1 − α t )lt + , (18) L̃(pt , l̃t ) = L(pt , l̃t ). (25)
|E| |It |
where αh , α r and α t are the label smoothing parameters. In this The final loss function for the triple (h, r, t ) is consequently
paper we use the same label smoothing parameter α e for head and defined as follows:
tail entity predictions, i.e., αh = α t = α e . Loss
g (h, r , t ) = L̃(ph , l̃h ) + L̃(pr , l̃r ) + L̃(pt , l̃t )
By comparing the prediction label vectors with the smoothed tar-
w 1 w (26)
get label vectors, we get the following losses for the three prediction = L(ph , l̃h ) + L(pr , l̃r ) + L(pt , l̃t ).
|Ih | |Ir | |It |
tasks/substructures in terms of binary cross-entropy:
|E |
3.3.3 The Training Process. Algorithm 1 presents the training pro-
1 X  cess of the SENN method.
L(ph , l̃h ) = − [l̃h ]i log[ph ]i + (1−[l̃h ]i ) log(1−[ph ]i ) , (19)
|E| i=1 Before the training process, the shared embedding matrices AE
|R | and AR are initialized in terms of normal distribution N (0, std ) [8]
1 X  (Line 1), which is a widely used initialization scheme in neural
L(pr , l̃r ) = − [l̃r ]i log[pr ]i + (1−[l̃r ]i ) log(1−[pr ]i ) , (20)
|R| i=1 network.
|E | During the training process, Lines 3-18 are iterated for nepoch
1 X
 
|T |
times. In this paper, nepoch is set to 1000. In each epoch,

L(pt , l̃t ) = − [l̃t ]i log[pt ]i + (1−[l̃t ]i ) log(1−[pt ]i ) . (21) β
|E| i=1
batches of triples are sampled from the training set. For each triple
Finally, the general loss function for the given triple (h, r, t ) is (h, r , t ) in a batch, we get the embeddings of h, t and r from AE and
defined as: AR , respectively, via performing look-up operations (Lines 7 and
8). They are then passed to the corresponding substructures to get
Loss (h, r , t ) = L(ph , l̃h ) + L(pr , l̃r ) + L(pt , l̃t ). (22)
three prediction vectors (Lines 9-11). Meanwhile, three target label
3.3.2 The Adaptively Weighted Loss Mechanism. The adaptively vectors for these predictions are constructed. After label smooth-
weighted loss mechanism is designed based on two considerations. ing on them (Line 13), loss is computed and added to the loss of
First, the triples in a KG can be divided into four categories, i.e., the current batch (Lines 14 and 15). Finally, parameter update is
1-TO-1, 1-TO-M, M-TO-1 and M-TO-M, according to the mapping conducted in batch mode (Line 17).
properties of their relations [3]. Here, for example, 1-TO-M means In order to stabilize and accelerate the training process, we apply
that for this type of relation, one head entity may correspond to batch normalization [11] after each neural network layer of the

251
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

Algorithm 1 The training process of the SENN method. 4.1 The Relation-Aided Test Mechanism
Input: Training triple set T , entity set E, relation set R, embedding First of all, let’s present the principle behind this mechanism. This
dimension k, number of epochs nepoch, batch size β. mechanism holds for both head and tail entity predictions in the
Output: AE and AR , as well as the three qsubstructures. test process. In what follows, we take head entity prediction as
1: Initialize AE ← N (0, std ); //std = 1 an example for illustration. Given a head prediction task (?, r , t )
q|E |+k and assume that h is a valid head entity. Since the performance of
AR ← N (0, std ); //std = 1
|R |+k relation prediction is better than entity prediction, if we employ the
for i = 1 to nepoch SENN method to predict the relation between h and t, i.e., carry out
2:
 do
3: for j = 1 to |Tβ | do the relation prediction task (h, ?, t ), relation r should most probably
have a prediction label higher than other relations and it should
4: S ← sample(T , β ); //sample β training triples thus be ranked higher than others. Otherwise, h should probably
5: Loss
g ← 0;
not be a valid head entity of the task (?, r , t ). With the relation-
6: for ∀(h, r , t ) ∈ S do aided test mechanism, we adopt the prediction label and the rank
7: h and t ← look-up(AE , h, t ); //look up h and t of relation r to help testify the validity of the head entity h in the
8: r ← look-up(AR , r ); //look up r task (?, r , t ).
9: ph ← head_pred (r, t); Next, we present the relation-aided test mechanism more for-
10: pr ← rel._pred (h, t); mally. Given the head entity prediction task (?, r , t ), we obtain the
11: pt ← tail_pred (h, r); prediction label vector ph through the head_pred substructure. Be-
12: lh , lr and lt ← construct target label vector, following sides ph , we compute two additional relation aided vectors, i.e., qh
Equations (13), (14) and (15); and q̂h , corresponding to ph as follows.
13: l̃h , l̃r and l̃t ← smooth lh , lr and lt , following Equa- Let ei be the entity in E corresponding to the i-th entry of ph .
tions (16), (17) and (18); Taking ei as a head entity and t as the given tail entity, we predict
14: Loss
g (h, r , t ) ← compare ph , pr and pt with l̃h , l̃r and l̃t their relation through the rel._pred substructure and get the relation
respectively, following Equation (26); prediction label vector д(s (ei , t)). Then, [qh ]i and [q̂h ]i are defined
15: Loss
g ← Loss g + Lossg (h, r , t ); as:
16: end for
17: Update the embeddings of entities and relations in S (i.e., [qh ]i = Value(д(s (ei , t)), r ), (27)
the related rows of AE and AR ), as well as the weight 1
[q̂h ]i = , (28)
matrices and bias vectors of the substructures via ▽Loss; g Rank(д(s (ei , t)), r )
18: end for
where Value(x, r ) returns the entry of vector x corresponding to
19: end for
relation r ; Rank(x, r ) returns the rank of the entry of vector x
corresponding to relation r in descending order. These two relation
aided values indicate the degree that ei , as a head entity, has relation
r with tail entity t, and a valid head entity is expected to have high
three substructures. Specifically, the dropout mechanism [28] is values.
adopted to alleviate the potential overfitting problem by dropping Similarly, we also compute such relation aided vectors qt and q̂t ,
out concatenated embeddings and hidden layer units after each corresponding to the prediction label vector pt for the test of each
neural network layer. tail entity prediction.

4 THE SENN+ METHOD 4.2 The Final Prediction Vectors


In the previous section, we bias SENN to relatively hard head and Since for a valid head/tail entity, the prediction label from the cor-
tail entity predictions against relation prediction by assigning high responding substructure, its reciprocal rank, and the two relation
weights to the corresponding losses in the training process. We aided values are expected to be large, we can define the final pre-
believe that the pretty good performance of relation prediction can diction vector p̃h for head entity prediction and p̃t for tail entity
be further utilized to assist head and tail entity predictions in the prediction in the test process of SENN+ as follows:
test process.
The test process of head/tail entity prediction of a triple in SENN p̃h = ph ⊙ p̂h ⊙ qh ⊙ q̂h , (29)
is to simply sort the entries of the prediction label vector from the p̃t = pt ⊙ p̂t ⊙ qt ⊙ q̂t , (30)
head_pred/tail_pred substructure in descending order and check
whether the valid head/tail entity is ranked in the top region of where ⊙ is the element-wise multiplication operator; p̂h and p̂t are
the ranking list. In the following, we extend SENN to SENN+ by the reciprocal rank vectors corresponding to ph and pt , each entry
proposing a relation-aided test mechanism to replace the above of which is defined as follows, respectively:
prediction label vectors for head and tail entity predictions with 1
[p̂h ]i = , (31)
relatively informative ones before sorting. Note that SENN+ shares Rank(ph , ei )
the same training process with SENN, and they only differ in the 1
test process. [p̂t ]i = . (32)
Rank(pt , ei )

252
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

5 EXPERIMENTS {1, 5, 10, . . . , 30}, the function д in {sigmoid, softmax}. Adam [15] is
In this section, we evaluate the proposed SENN and SENN+ methods used as the optimizer and the default learning rate 0.001 is adopted.
through entity prediction and relation prediction. We also verify the The finally adopted optimal settings are: k = 200, n = 4, β = 128,
effectiveness of the shared embeddings and the adaptively weighted λ = 0.0, γe = 0.0, γr = 0.0, α e = 0.6, α r = 0.6, w = 5, д = sigmoid
loss mechanism. for WN18, and k = 200, n = 2, β = 128, λ = 0.1, γe = 0.0, γr = 0.3,
α e = 0.6, α r = 0.6, w = 25, д = softmax for FB15k.
5.1 Datasets and Metrics
5.1.1 Datasets. In our experiments, we choose two widely used 5.3 KGC Experiment
benchmark datasets, WN18 and FB15k [3], which are the subsets of 5.3.1 Entity Prediction. In order to demonstrate the performance
WordNet and Freebase, respectively. The detailed statistics of these of the developed methods on entity prediction, we report in Table 2
two datasets is shown in Table 1, where #T rain, #V alid and #T est the experimental results compared with the representative baseline
are the size of the training set, the validation set and the test set, methods on fine-grained evaluation metrics, i.e., MRR, Hits@1,
respectively. Hits@3 and Hits@10. Note that in Table 2, the experimental results
of the baselines are taken from the corresponding references.
Table 1: The statistics of the two datasets, WN18 and FB15k. It can be seen from Table 2 that on both datasets, SENN consis-
tently outperforms all the baselines in terms of all the four metrics,
which indicates the effectiveness and superiority of SENN. SENN+
Dataset |R | |E | #T r ain #V alid #T est
further improves the performance, especially in terms of Hits@1,
WN18 18 40,943 141,442 5,000 5,000 and achieves the best results. In more detail, on WN18, although
FB15k 1,345 14,951 483,142 50,000 59,071 the baselines have already done excellently, SENN performs better
than all the baselines. SENN+ is slightly better than SENN. This is
because the relation set of WN18 is small, the superiority of the
5.1.2 Metrics. The popular Mean Reciprocal Rank (MRR, the aver- relation aided mechanism cannot be fully demonstrated and thus
age of the reciprocal ranks of the test entities/relations from all the plays a limited role in the performance increasement; on FB15k,
test triples) and Hits@N (the proportion of the test entities/relations the superiority of both SENN and SENN+ is relatively significant.
from all the test triples ranked in the top-N ranking list) are used as SENN+ specifically increases the performance by 2.0% on MRR and
metrics for comparison. We do not adopt mean rank (the average 2.5% on Hits@1, as compared to the best baseline ANALOGY [18].
of the ranks of the test entities/relations from all the test triples) In general, the effectiveness and superiority of the proposed meth-
as a metric, since it is sensitive to outliers [22]. For the chosen ods is because of two reasons. First, they unify the three types of
two types of metrics, the higher the value of MRR (the higher the prediction tasks into the neural network based framework with
value of Hits@N), the better the performance of prediction. In the shared embeddings, through which the three prediction tasks can
test process, except the test head entity/relation/tail entity, other share information and interact implicitly with each other; second,
entities/relations that can form valid triples (existing in the train- they consider the differences among these tasks more explicitly
ing/validation/test set) with the other two given elements of the than other methods, by the three separated substructures. In addi-
test triple are discarded when ranking all the entities/relations in tion, as SENN+ further uses relation prediction to assist head and
|E|/|R|. tail entity predictions, it is not surprising that SENN+ beats all the
baselines on both datasets.
5.2 Baselines and Experiment Settings In order to elaborately evaluate the performance of SENN and
5.2.1 Baselines. We adopt three categories of baselines, i.e., ten- SENN+ on entity prediction, from Table 2, we select the top-3 base-
sor/matrix based methods: RESCAL [23] and ComplEx [30]; transla- lines, i.e., ANALOGY [18], ComplEx [30] and DistMult [36], in
tion based methods: TransE [3], TransR [17], DistMult [36], HolE [22], terms of all the metrics, on FB15k, to conduct further experiments.
ANALOGY [18], DKRL [34], TKRL [35] and PTransE [16]; neural The triples of FB15k are divided into four categories, i.e., 1-TO-1,
network based methods1 : ER-MLP [5] and R-GCN [25]. They are 1-TO-M, M-TO-1 and M-TO-M, and Hits@10 is adopted as the met-
all representative or state-of-the-art. ric. We use the source codes of ANALOGY, ComplEx and DistMult
downloaded from github2 to carry out experiments. The experi-
5.2.2 Experiment Settings. We select the parameters of the devel-
mental results corresponding to the four relation categories are
oped methods via grid search. Their ranges are listed as follows:
presented in Table 3.
The embedding size k in {100, 200}, the number of hidden layers
It can be seen that both SENN and SENN+ demonstrate their
n in {2, 4, 5} (the three substructures share the same n), the batch
strength regarding the entity prediction task on all the four triple
size β in {64, 128, 256}, the dropout rate λ of concatenated embed-
categories. Besides the above discussed two, this is also because
dings in {0.0, 0.1, 0.2}, the dropout rate of hidden layer units γe
that the adaptively weighted loss mechanism in SENN and SENN+
for head and tail entity predictions (we use the same parameters
can well distinguish and learn the predictions of different mapping
for head and tail entity predictions) and γr for relation predic-
properties. Specifically, in the training process, with the adaptively
tion in {0.0, 0.1, 0.3, 0.5}, the label smoothing parameter α e and
α r in {0.0, 0.1, . . . , 0.7}, the additional weight factor of losses w in
2 The source code of ANALOGY is from "https://github.com/quark0/ANALOGY".
1 Wedo not use ProjE [27] as a baseline, since the results therein are based on wrong The source codes of ComplEx and DistMult are all from "https://github.com/
codes as pointed out by the authors at “https://github.com/bxshi/ProjE/issues/3". ttrouill/complex".

253
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

Table 2: Experimental results of entity prediction in terms of MRR and Hits@{1, 3, 10}.

WN18 FB15k
Method
MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10
RESCAL 0.890 0.842 0.904 0.928 0.354 0.235 0.409 0.587
TransE 0.495 0.113 0.888 0.943 0.463 0.297 0.578 0.749
TransR 0.605 0.335 0.876 0.940 0.346 0.218 0.404 0.582
ER-MLP 0.712 0.626 0.775 0.863 0.288 0.173 0.317 0.501
DistMult 0.822 0.728 0.914 0.936 0.654 0.546 0.733 0.824
HolE 0.938 0.930 0.945 0.949 0.524 0.402 0.613 0.739
ComplEx 0.941 0.936 0.945 0.947 0.692 0.599 0.759 0.840
R-GCN 0.814 0.686 0.928 0.955 0.651 0.541 0.736 0.825
ANALOGY 0.942 0.939 0.944 0.947 0.725 0.646 0.785 0.854
SENN 0.947 0.941 0.950 0.956 0.737 0.659 0.792 0.870
SENN+ 0.947 0.942 0.950 0.956 0.745 0.671 0.796 0.871

Table 3: Experimental results of entity prediction on FB15k by mapping properties of the relations (Hits@10).

Task Head entity prediction Tail entity prediction


Relation category 1-TO-1 1-TO-M M-TO-1 M-TO-M 1-TO-1 1-TO-M M-TO-1 M-TO-M
DistMult 0.829 0.880 0.567 0.852 0.832 0.672 0.817 0.866
ComplEx 0.792 0.912 0.587 0.864 0.786 0.697 0.839 0.881
ANALOGY 0.863 0.905 0.577 0.867 0.857 0.704 0.850 0.886
SENN 0.899 0.961 0.634 0.876 0.890 0.717 0.954 0.905
SENN+ 0.910 0.961 0.638 0.877 0.892 0.722 0.953 0.906

weighted loss mechanism, the losses of head and tail entity predic- Table 4: Experimental results of relation prediction on
tions are related with the numbers of valid entities of the predictions. FB15k.
Therefore, SENN and SENN+ are enabled to well handle the predic-
tions on both 1-side and M-side. Moreover, SENN+ performs better Method Hits@1
on the predictions on M-side than those on 1-side, as compared to
TransE 0.872
SENN. That is, the use of relation prediction to assist head and tail TransR 0.916
entity predictions, i.e., the relation-aided test mechanism in SENN+ , DKRL(CBOW) 0.827
is more effective in picking more valid entities out. DKRL(CNN) 0.890
TKRL(WHE+STC) 0.906
5.3.2 Relation Prediction. As shown in Table 1, the number of rela- TKRL(RHE+STC) 0.907
tion types on WN18 is only 18. Thus, the relation prediction task on TKRL(WHE) 0.925
WN18 is relatively simple, as compared to that on FB15k. Therefore, TKRL(RHE) 0.928
the same as in the recent works, TKRL [35] and PTransE [16], we PTransE(ADD, 2-step) 0.936
just report the experimental results on FB15k. Since Hits@10 of PTransE(RNN, 2-step) 0.932
this task usually reaches 1, we adopt only Hits@1 as the unique PTransE(ADD, 3-step) 0.940
metric. Table 4 present the corresponding experimental results. SENN/SENN+ 0.947
Since many methods do not conduct the relation prediction task,
Table 4 contains fewer different methods, as compared to Table 2.
Again, the experimental results in Table 4 prove the superiority
of SENN and SENN+ . It indicates that SENN and SENN+ can cap-
ture the implicit information interaction among different types of embeddings, SENN(no SE) conducts the three types of prediction
prediction tasks and the prediction-specific information to obtain tasks independently, and results in three separated models. The
better prediction performance. parameters of these models are retuned, respectively. The experi-
mental results of entity prediction on both WN18 and FB15k, and
5.4 The Effectiveness of the Shared relation prediction on FB15k are presented in Figure 3, where MRR,
Embeddings Hits@1, Hits@3 and Hits@10 marked on the x-axes denote the
In order to evaluate the effectiveness of the shared embeddings in experimental results on entity prediction in terms of these metrics,
the proposed methods, we compare SENN with its variant with- respectively, while rel_Hits@1 denotes the experimental results on
out shared embeddings, denoted as SENN(no SE). Without shared relation prediction in terms of Hits@1.

254
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

Figure 3: Experimental comparison between SENN and SENN(no SE) on WN18 and FB15k.

Figure 4: Experimental comparison between SENN and SENN(no AW) on WN18 and FB15k.

In Figure 3, it is obvious that SENN performs much better than as SENN(no AW). In SENN(no AW), the weights of head entity
SENN(no SE) on both benchmark datasets, which indicates the ef- prediction, relation prediction and tail entity prediction are all set
fectiveness and necessity of the shared embeddings in SENN. On to 1, and it uses the general loss function following Equation (22).
one hand, the shared embeddings enable the three types of predic- To make the comparison fair, we keep all the other parameters
tion tasks to share information and implicitly interact with each of SENN(no AW) the same as those in SENN. The experimental
other, so as to capture more information for better performance; results of entity prediction on both WN18 and FB15k, and relation
on the other hand, the shared embeddings enable the embeddings prediction on FB15k are presented in Figure 4. The same as in
of entities and relations in SENN to receive more updates than the Figure 3, Hits@1, Hits@3 and Hits@10 in Figure 4 also denote the
separated models. Thus, the embeddings of entities and relations experimental results on entity prediction in terms of these metrics,
in SENN are learned more sufficiently than those in SENN(no SE), whilst rel_Hits@1 denotes the experimental results on relation
which is beneficial to the predictions. prediction in terms of Hits@1.
We can see from Figure 4 that SENN also outperforms SENN(no
5.5 The Effectiveness of the Adaptively AW), which testifies the effectiveness of the adaptively weighted
Weighted Loss Mechanism loss mechanism in SENN. In more detail, on WN18, SENN is slightly
To evaluate the impact of the adaptively weighted loss mechanism better than SENN(no AW). This is because SENN(no AW) has al-
on the performance of SENN and SENN+ , we compare SENN with its ready performed very well and there is little room for the improve-
variant without the adaptively weighted loss mechanism, denoted ment, thus, it is difficult to further prompt the performance. This

255
Session 2B: Knowledge Graph Learning CIKM’18, October 22-26, 2018, Torino, Italy

is also consistent with the small optimal value of the additional [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
weight factor of losses on WN18. However, on FB15k, the perfor- learning for image recognition. In Proceedings of CVPR. 770–778.
[10] Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao. 2015. Learning to represent
mance increasement is relatively significant, especially in terms of knowledge graphs with gaussian embedding. In Proceedings of CIKM. 623–632.
MRR and Hits@1. This is because entity prediction is much harder [11] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In Proceedings of ICML.
than relation prediction on FB15k (see the experimental results in 448–456.
Section 5.3 in terms of Hits@1). Thus, biasing the model to entity [12] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Knowledge
prediction makes a difference. It even improves the performance of graph embedding via dynamic mapping matrix. In Proceedings of ACL. 687–696.
[13] Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. 2016. Knowledge graph comple-
relation prediction in turn. tion with adaptive sparse transfer matrix. In Proceedings of AAAI. 985–991.
[14] Yantao Jia, Yuanzhuo Wang, Hailun Lin, Xiaolong Jin, and Xueqi Cheng. 2016.
6 CONCLUSIONS AND FUTURE WORK Locally adaptive translation for knowledge graph embedding. In Proceedings of
AAAI. 992–998.
In this paper, we proposed a novel neural network based Knowl- [15] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-
edge Graph Completion (KGC) method, called SENN. By adopting tion. arXiv preprint arXiv:1412.6980 (2014).
[16] Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu.
shared embeddings of entities and relations, SENN unifies the pre- 2015. Modeling relation paths for representation learning of knowledge bases. In
diction tasks of head entities, relations and tail entities into a neural Proceedings of EMNLP. 705–714.
[17] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning
network based framework. The shared embeddings encourage im- entity and relation embeddings for knowledge graph completion. In Proceedings
plicit interactions among these three types of prediction tasks. And, of AAAI. 2181–2187.
the three substructures corresponding to these tasks enable SENN [18] Hanxiao Liu, Yuexin Wu, and Yiming Yang. 2017. Analogical inference for multi-
relational embeddings. In Proceedings of ICML, Vol. 70. 2168–2178.
to catch different information of different tasks. The adaptively [19] Denis Lukovnikov, Asja Fischer, Jens Lehmann, and Sören Auer. 2017. Neural
weighted loss mechanism further equips it with the ability to well network-based question answering over knowledge graphs on word and character
handle triples with different mapping properties and deal with dif- level. In Proceedings of WWW. 1211–1220.
[20] Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted
ferent prediction tasks. Furthermore, SENN is extended to SENN+ boltzmann machines. In Proceedings of ICML. 807–814.
by utilizing relation prediction to promote entity prediction. Exper- [21] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and Mark Johnson. 2016. STransE:
A novel embedding model of entities and relationships in knowledge bases. In
iment results manifest the merits and superiority of the proposed Proceedings of NAACL-HLT. 460–466.
SENN and SENN+ methods. The shared embeddings and the adap- [22] Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. 2016. Holographic
tively weighted loss mechanism are also testified to be effective. embeddings of knowledge graphs. In Proceedings of AAAI. 1955–1961.
[23] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way
For future work, we will adopt more expressive neural networks, model for collective learning on multi-relational data. In Proceedings of ICML.
such as RNN and LSTM, to learn more favorable and elegant infor- 809–816.
mation, and achieve even better prediction performance. Moreover, [24] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan
Pascanu, Peter Battaglia, and Tim Lillicrap. 2017. A simple neural network
since we use in this paper only the triples in a KG to conduct predic- module for relational reasoning. In Proceedings of NIPS. 4967–4976.
tion, in the future we will explore the use of additional information, [25] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan
Titov, and Max Welling. 2017. Modeling relational data with graph convolutional
such as path feature and entity description, to further improve the networks. arXiv preprint arXiv:1703.06103 (2017).
developed models. [26] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014.
Learning semantic representations using convolutional neural networks for web
search. In Proceedings of WWW. 373–374.
ACKNOWLEDGMENTS [27] Baoxu Shi and Tim Weninger. 2017. ProjE: Embedding projection for knowledge
The work is supported by National Key Research and Development graph completion. In Proceedings of AAAI. 1236–1242.
[28] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Program of China under grants 2016YFB1000902, 2017YFC0820404, Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from
and 2017YFB1002302, and National Natural Science Foundation of overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
China under grants 61772501, 61572473, 61572469, and 91646120. [29] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. 2016. Rethinking the inception architecture for computer vision. In
Proceedings of CVPR. 2818–2826.
REFERENCES [30] ThÃľo Trouillon, Johannes Welbl, Sebastian Riedel, Eric Gaussier, and Guillaume
[1] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and koray Bouchard. 2016. Complex embeddings for simple link prediction. In Proceedings
kavukcuoglu. 2016. Interaction networks for learning about objects, relations of ICML, Vol. 48. 2071–2080.
and physics. In Proceedings of NIPS. 4502–4510. [31] Zhigang Wang and Juanzi Li. 2016. Text-enhanced representation learning for
[2] Phil Blunsom, Edward Grefenstette, and Nal Kalchbrenner. 2014. A convolutional knowledge graph. In Proceedings of IJCAI. 1293–1299.
neural network for modelling sentences. In Proceedings of ACL. [32] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge
[3] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok- graph embedding by translating on hyperplanes. In Proceedings of AAAI. 1112–
sana Yakhnenko. 2013. Translating embeddings for modeling multi-relational 1119.
data. In Proceedings of NIPS. 2787–2795. [33] Zhuoyu Wei, Jun Zhao, Kang Liu, Zhenyu Qi, Zhengya Sun, and Guanhua Tian.
[4] Wanyun Cui, Yanghua Xiao, Haixun Wang, Yangqiu Song, Seung-won Hwang, 2015. Large-scale knowledge base completion: Inferring via grounding network
and Wei Wang. 2017. KBQA: Learning question answering over QA corpora and sampling over selected instances. In Proceedings of CIKM. 1331–1340.
knowledge bases. Proceedings of the VLDB Endowment 10, 5 (2017), 565–576. [34] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. 2016. Repre-
[5] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Mur- sentation learning of knowledge graphs with entity descriptions. In Proceedings
phy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: of AAAI. 2659–2665.
A web-scale approach to probabilistic knowledge fusion. In Proceedings of KDD. [35] Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2016. Representation learning of
ACM, 601–610. knowledge graphs with hierarchical types. In Proceedings of IJCAI. 2965–2971.
[6] Alberto García-Durán, Antoine Bordes, and Nicolas Usunier. 2015. Composing [36] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Em-
relationships with translations. In Conference on Empirical Methods in Natural Lan- bedding entities and relations for learning and inference in knowledge bases. In
guage Processing (EMNLP 2015). 286–290. https://doi.org/10.18653/v1/D15-1034 Proceedings of ICLR.
[7] Alberto Garcia-Duran, Antoine Bordes, Nicolas Usunier, and Yves Grandvalet. [37] Xuchen Yao and Benjamin Van Durme. 2014. Information extraction over struc-
2016. Combining two and three-way embedding models for link prediction in tured data: Question answering with Freebase. In Proceedings of ACL. 956–966.
knowledge bases. Journal of Artificial Intelligence Research 55 (2016), 715–742. [38] Yuanzhe Zhang, Shizhu He, Kang Liu, and Jun Zhao. 2016. A joint model for
[8] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training question answering over multiple knowledge bases. In Proceedings of AAAI.
deep feedforward neural networks. In Proceedings of AISTATS. 249–256. 3094–3100.

256

You might also like