Knowledge Graph Embedding

Data Mining and Knowledge Discovery https://doi.org/10.
1007/s10618-019-00653-z
A semi-supervised model for knowledge graph embedding
Jia Zhu1,2 · Zetao Zheng1 · Min Yang3 · Gabriel Pui Cheong Fung4 · Yong Tang1
Received: 13 October 2018 / Accepted: 5 September 2019
© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer
Nature 2019
Abstract
Knowledge graphs have shown increasing importance in broad applications such as question answering,
web search, and recommendation systems. The objective of knowledge graph embedding is to encode
both entities and relations of knowledge graphs into continuous low-dimensional vector spaces to
perform various machine learning tasks. Most of the existing works only focused on the local structure
of knowledge graphs when utilizing structural information of entities, which may not sincerely preserve
the global structure of knowledge graphs.In this paper, we propose a semi-supervised model by
adopting graph convolutional networks to utilize both local and global structural information of entities.
Specifically, our model takes textual information of each entity into consideration as entity attributes in
the process of learning. We show the effectiveness of our model by applying it to two traditional tasks
for knowledge graph: entity classification and link prediction. Experimental results on two well-known
corpora reveal the advantages of this model compared to state- of-the-art methods on both tasks.
Moreover, the results show that even with only 1% labeled data to train, our model can still achieve
good performance.
Keywords Knowledge graph · Deep learning · Graph convolutional networks
Responsible editor: Shuiwang Ji.
B Jia Zhu jzhu@m.scnu.edu.cn
B Yong Tang ytang@m.scnu.edu.cn
1234
School of Computer Science, South China Normal University, Guangzhou, China Guangzhou Key
Laboratory of Big Data and Intelligent Education, Guangzhou, China Shenzhen Institutes of Advanced
Technology, Chinese Academy of Sciences, Beijing, China Department of SEEM, The Chinese University
of Hong Kong, Hong Kong, China
123
J. Zhu et al.
Fig.1 Asampleofknowledgegraph(originalfigurefromGoogle)
1 Introduction
Knowledge graphs (KGs) integrate the structural information of concepts across multiple information
sources and links these concepts together, which provide crucial resources for broad applications such
as question answering, web search, and recommendation systems (Szumlanski and Gomez 2010; Wang
and Li 2016; Xiao et al. 2016; Zhu et al. 2015, 2016). A typical KG usually describes knowledge as multi-
relational data and represent as triple facts (head entity, relation, tail entity), which are denoted as
(h,r,t), indicating the relation between two entities.
Graph embedding has been widely studied and proved that it can improve the performance of many
graph based applications (Perozzi et al. 2014; Tang et al. 2015; Wang et al. 2016; Huang et al. 2018). It
learns low-dimensional representations of vertexes in the graph to capture and preserve the graph
structure. Though a few graph embedding methods have been proposed for KG embedding recently, it
remains chal- lenging for entities with few or no fact (Ji et al. 2016; Wang et al. 2017; Xie et al. 2016).
The main problem of existing methods is that the structural information they focused, in fact, is the
structure-based representations for head and tail, which means they only consider local structure but
not global structure.The main problem of existing methods is that the structural information they
focused in fact is the structure-based representations for head and tail, which means they only
consider local structure but not global structure. However, a good structure-based representation of an
entity should jointly encode the information of both local and global structure. Figure 1 gives an
illustration of the problem. Each entity has associated textual information, e.g., “Tom Hanks” and
123
“Meg Ryan”. The entity “Tom Hanks” acted in the movie “Sleepless in Seattle”, and the entity “Meg
Ryan” in fact is also an actor in this movie. However, it is difficult to predict if there is a link between
“Meg Ryan” and “Sleepless in Seattle” because the global structural information of “Tom Hanks” and
“Meg Ryan” in this KG cannot be fully exploited if only considering the neighbor entities of “Tom Hanks”
and “Meg Ryan”. According to this example, we can conclude that using only information from an entity’
neighbor is not sufficient to learn the embedding of a KG in order to achieve good performance on
specific tasks, e.g., link prediction.
In this paper, we propose a semi-supervised model by adopting graph convolutional networks (GCNs)
(Defferrard et al. 2016; Kipf and Welling 2016, 2017) to utilize both textual information and structural
information of entities. The model not only takes textual information of each entity into consideration
to compute the first-order proximity between two vertices in a KG, but also computes the second-order
proximity according to two vertices’ neighborhood structure with a joint optimization mechanism to
simultaneously preserve the local and global structure so that the embedding of KG can be more
deeply and properly learned. In this paper, we propose a semi-supervised model by adopting graph
convolutional networks (GCNs) (Defferrard et al. 2016; Kipf and Welling 2016, 2017) to utilize both
textual information and structural information of entities. The model not only takes textual information
of each entity into consideration to compute the first-order proximity between two vertices in a KG but
also computes the second-order proximity according to two vertices’ neighborhood structure with a
joint optimization mechanism to simultaneously preserve the local and global structure so that the
embedding of KG can be more profoundly and adequately learned.
We have conducted extensive experiments on two popular corpora. Experimental results of two classical
tasks, entity classification, and link prediction, show that our model outperforms state-of-the-art
methods on all evaluation metrics with an impressive margin. The main technical contributions of this
work are threefold:
1. We novelly propose a semi-supervised model that integrates the textual information and structural
information of entities by exploiting the first-order and the second-order proximity with a joint
optimization mechanism to effectively learn the embedding of KGs. We design corresponding algorithms
for the computation of the first-order and the second-order proximity to suit our need.
2. Forthecomputationofthefirst-orderproximity,wefirstdesignaneighborvertices selection algorithm by

adopting DeepWalk (Perozzi et al. 2014) to select neighbor entities as attributes for each entity. We
then combine these attributes with each entity’ textual information to obtain the information of local
structure and description-based representations for head and tail. Besides, we utilize the results of
neighbor vertices selection algorithm to provide supervised information for the computation of the
second-order proximity because the selected neighbor vertices can act as labels for each entity.
3. For the computation of the second-order proximity, we design a GCNs based auto-encoder that can
encode global graph structure according to two vertices’ neighborhood structure. The novelty of our
model is that we masterly use the supervised information obtained from the process of the computation
of the first- order proximity to make the GCNs based auto-encoder more properly learn the
123
J. Zhu et al.
latent representations of global graph structure. Moreover, we propose a joint optimization mechanism
to simultaneously preserve the local and global structure by adopting the common graph Laplacian
regularization term loss function.
The rest of this paper is organized as follows. Section 2 introduces recent related work on knowledge
graph embedding. Section 3 describes the technical details of our proposed model. Section 4 reports
and discusses the experimental results based on two well-known corpora. Section 5 concludes this
paper and talks over our future work.
2 Related work
In recent years, many researchers show stupendous interests to learn the embeddings of KGs. For
example, Bordes et al. (2013) interpreted the relations as translating operations between head and tail
entities on the low-dimensional vector space. Regards to the textual information of KGs. Socher et al.
(2013) treated an entity as the average of its word embeddings in entity name, allowing the sharing of
textual information located in similar entity names. Wang et al. (2014) showed promising improvements
to the accuracy of predicting facts by jointly embed knowledge and text into the same space. Tang et al.
(2015) studied the problem of embedding extensive information networks into low-dimensional vector
spaces. They first introduced the concept of the first-order and the second-order proximity with an
objective function to preserve both the local and global network structures. We adopted their idea into
KGs in our work. Zhong et al. (2015) extended their joint model and aligned knowledge and words in the
entity textual information. However, these two works only align the two kinds of embeddings on the
word level, which can lose some semantic information on the phrase or sentence level. Ji et al. (2019)
proposed a fine-grained model called TransD, which used two vectors to represent entity and relation.
TransD not only considers the diversity of relations but also entities with fewer parameters and has no
matrix-vector multiplication operations. Later, Xie et al. (2016) used continuous bag-of-words and
convolutional neural network to encode semantics of entity textual information. Xu et al. (2017)
proposed a deep architecture to learn a joint representation of both structure and textual information
of the entities, which benefits to modeling the meaning of an entity.
Besides entity representation, there are also a few works, e.g., Toutanova et al. (2015), Lao et al. (2012)
and Lin et al. (2015), to map textual relations and knowledge base relations to the same vector space
and obtained substantial improvements. Nickel et al. (2016) proposed a method called holographic
embeddings (HOLE) to learn compositional vector space representations of entire knowledge graphs.
The proposed method is related to holographic models of associative memory to create compositional
representations. By using correlation as the compositional operator, HOLE can capture rich interactions
but simultaneously remains high efficiency. Shi and Weninger (2018) proposed a knowledge completion
model called ConMask, which can learn embeddings of the entity’s name and parts of its text-
description to connect unseen entities to the KG. They used relationship-dependent content masking
and fully convolutional neural networks to extract relationship-dependent embeddings from the
123
textual features of entities and relationships in KGs apart from common entity embedding.
Moreover, there are a few GCNs based approaches being proposed recently. Kipf and Welling (2017)
introduced a model for semi-supervised classification on graph- structured data. The model uses an
effective layer-wise propagation rule based on a first-order approximation of spectral convolutions on
graphs, which provides theoreti- cal support to our work. Wang et al. (2018) proposed an approach for
cross-lingual KG alignment via GCNs. Though the goal of their work is for KGs alignment, but their idea to
use GCNs to learn embedding from both the structural and attribute information of entities is similar to
us. Li et al. (2018) developed more profound insights into the GCN model and addressed its fundamental
limits. The authors showed that the limit of GCNs is to bring potential concerns of oversmoothing with
many convolutional layers. The authors then proposed both co-training and self-training approaches to
train GCNs to overcome the problem. Velickovic et al. (2018) presented an architecture called graph
attention networks (GATs), which leveraging masked self-attentional layers to address the shortcomings
of prior methods based on GCNs. The stacking layers of GATs can attend over the neighborhoods’
features with different weights to different nodes in a neighborhood. Most recently, Yao et al. (2019)
proposed a method called Text GCN for text classification based on Kipf and Welling (2017). The authors
built a heterogeneous word document graph for a whole corpus and turn document classification into a
node classification problem.
Most of the existing methods we mentioned above only concentrate on the structural information
between two entities apart from textual information. In other words, their use of structural information
considers the local structure of a KG but not global structure. Additionally, though some methods
considered global structure information, they are not designed for KGs embedding specifically, or they
do not use GCNs. The main difference of our approach and other GCNs based approaches is that we use
the first-order and the second-order proximity to capture a KG’ local structure and global structure with
a joint learning optimization mechanism so that the textual and structural information of entities can be
better learned.
3 Proposed model
Assume we have a graph G = (V , E ), where V represents a set of vertices in the graph, V = {v1,...,vn}. E
represents a set of edges in the graph, E = {e1,...,en}. Each edge is associated with two vertices. For a KG,
each vertex v represents an entity, and each edge e represents the relation between two entities.
As we mentioned earlier, our goal is to have an embedding model that can adequately capture the local
and global structure of a KG. Therefore, we first define the first-order proximity, which can specifically
characterize the local structure of a KG as shown in Definition 1.
Definition 1 First-order proximity in KGs: The first-order proximity describes the pairwise similarity
between entities in KGs. For any pair of entities, if there is an edge between vi and vj, either from vi to vj
or from vj to vi, which means there is a
123
J. Zhu et al.
Fig.2 Proposedsemi-supervisedmodel
relation between vi and v j , then exists positive first-order proximity between vi and
vj.Otherwise,thefirst-orderproximitybetweenvi andvj is0.
According to this definition, we can quickly know that the computation of the pairwise similarity
between entities is the key to exploit the first-order proximity. Unlike other kinds of graphs, we need to
specifically design a method to perform the computation because there is textual information attached
to each vertex in a KG.
On the other hand, we also define the second-order proximity, which can specifically, characterize the
global structure of a KG as shown in Definition 2.
Definition 2 Second-order proximity in KGs: The second-order proximity describes the pairwise similarity
between entities’ neighborhood structure. Let Ni and N j denote thesetofneighborverticesofvi andvj
,thenthesecond-orderproximityisdetermined by the similarity of Ni and N j .
From Definition 2, we know that the second-order proximity between two entities is high if two entities
share many common neighbors with the same relation information. The second-order proximity has
been demonstrated to be a useful metric to define the similarity of a pair of vertices, and can profoundly
enrich the relationship of vertices even if they are not linked by an edge (Liben-Nowell and Kleinberg
2007).
On the basis of above discussion, we wrap all up into a semi-supervised model as shown in Fig. 2 in
order to preserve both the local and global structure. The component for the first-order proximity
computation is designed for pairwise similarity calcula- tion between two vertices based on DeepWalk
(Perozzi et al. 2014), which can provide supervised information to the second-order proximity
computation. The second-order proximity computation is unsupervised and designed for pairwise
similarity calcula- tion between two vertices’ neighborhood using a structure reconstruction process and
GCNs. We will introduce more details in the following sections.
123
3.1 Computation for the first-order proximity
As we described earlier, there is textual information attached to each entity in a KG. Therefore, to
compute the pairwise similarity between entities, we not only need to consider the local structure but
also need to consider the string similarity of textual information between entities.
To properly get the information about the local structure of an entity, we need to know which neighbors
are important or similar to it. Therefore, we propose a method that can select entities from the
neighbors of each entity based on the vectors generated by DeepWalk (Perozzi et al. 2014). DeepWalk
has been proved successful in social networks and graph analysis. We have made a modified version to
support both directed and undirected graphs. It learns the latent representations by modeling a stream
of a short random walk and then encodes it in a continuous vector space with low dimensions.
Let G = (V, E) to be a graph, v ∈ V, which represents an entity. H is the set
of neighbor entities of v, hi ∈ H, which represents a neighbor entity of v. We then
use Euclidean metric based on the vectors generated by DeepWalk to compute the
closeness score Scorev,hi between v and each hi. The higher closeness score means
theneighborentityhi hashighersimilaritytov.Lastly,wekeepthoseneighborentities
that have closeness score higher than average closeness score of neighbor entities as the
attributes One of v. The neighbor entities selection process is described in Algorithm
1, which can help us identify essential entities and provide supervised information to
the computation for the second-order proximity later because we will also use these
neighbor entities as labels. Regards to the textual information, we remove all stop words
from raw texts, and adopt the bag of words (BOW) model with the classical TF–IDF to
select top K keywords as the attributes Ot for each entity. We then concatenate Ot and
O together to get the attributes O = O 􏰀 O for each entity/vertex. We use O ne e ne t e
as the feature vector to exploit the first-order proximity and refine the representations in the latent
space to constrain the similarity of a pair of vertices. Besides, One can provide supervised information to
the second-order proximity computation because we use the One of each entity as its labels.
123
3.2 Computation for the second-order proximity
The second-order proximity refers to how similar the neighborhood structure of a pair of vertexes is.
Thus, to model the second-order proximity, it is required to model the neighborhood of each vertex.
Given a graph G = (V , E ), we can obtain its adjacency matrix M, for each cell mi, j in M, mi, j > 0 if and
only if there exists a link between vi and v j . The M provides the information of the neighborhood
structure of each vertex. In this section, we introduce how we design a GCNs based auto-encoder to
preserve the second-order proximity of G.
GCNs (Defferrard et al. 2016; Kipf and Welling 2016, 2017) can make use of latent variables and is
capable of learning interpretable latent representations for a graph. However, many existing GCNs
based models are only applicable for undirected graphs, which is not suitable for directed KG. Therefore,
in our GCNs model, we set the KG as an undirected bipartite graph with additional nodes that represent
the relation in the original graph for such a situation. In other words, the structure of the original KG is
reconstructed. For example, assume a triplet (e1 , r , e2 ), we assign separate relation vertices r1 and r2
for this triplet as (e1, r1) and (e2, r2).
Each entity vertex is described by a feature vector as mentioned in the Sect. 3.1, and every relation
vertex will be assigned a unique one-hot representation. Note that when we perform the convolution
operation, we always want to get the latent representations for each entity precisely. Therefore, we use
the selected neighbor entities On e as labels for each entity to provide constraint for the output of GCNs
model, which means each entity will be assigned one or more labels when constructing the training
dataset. We set the text representation for each relation vertex is its text plus the text of the directly
connected entity. If we take “Tom Hanks”, “Acted in”, “Sleepless in Seattle” in Fig. 1 as example, then
the text representation for relation vertex r1 is “Tom Hanks Acted in”, and the text representation for
relation vertex r2 is “Acted in Sleepless in Seattle”. Note that we do not consider the semantic
information because the order of text representation for each relation vertex is no matters due to the
characteristics of our approach. For the setting of one-hot representation, we first put all words from
the textual information of each entity and each relation into a table T . Every word in T will have a
number i ∈ 1, . . . |T |, then the one-hot representation of each word is a vector with |T | length. The i
th element of this vector is 1, others are 0.
Based on this newly constructed graph, we use the feature vector that represents each entity vertex,
and the unique one-hot representation for each relation vertex as input channels of the GCNs. The
relation vertices can indicate how many neighbors with the same relation information between two
entity vertices. After l convolutional layers encoding, we can get a representation that learned from the
graph including the information of entity vertices and relation vertices.
The pre-processing architecture of our GCNs for KG is shown in Fig. 3. The vertex with grey color is the
entity vertex, and the vertex with red color is the relation vertex. We first read the entity vertex
attributes as channels, and then we construct a set of neighborhood graphs to rank each entity vertex,
which means an entity vertex will be assigned a higher value if its neighborhood graph has more number
of vertices. A neighborhood graph is a subgraph consists of a vertex and its neighbor vertices.
123
J. Zhu et al.
Fig.3 Thepre-processingarchitectureofourGCNs
We use the Weisfeiler-Lehman graph kernel algorithm (Vries 2013) to perform the construction because
of its advantage on the computation time. The vertex with blue color is the vertex with the highest
ranking value in a certain range, and it is used as
centroid to construct a neighborhood graph. Regards to the normalization, we use the 11
formula Mˆ = S ̃−2 M ̃ S ̃−2 , M ̃ = M+IN, M ̃ istheadjacencymatrixoftheundirected graph G. IN is the

identity matrix, S ̃ii = 􏰁 M ̃ij. We calculate M ̃ by adding the
identity matrix and the adjacency matrix together, then we do a row-wise sum on M ̃ 1
to get the S ̃, and calculate the S ̃ to the power of −0.5 to get the S ̃−2 . Finally, we 1
multiply the S ̃ − 2 and the M ̃ according to the normalization formula defined above to finish the
normalization of the adjacency matrix. After normalization, we can get a list of receptive fields for each
neighborhood graph as the input of the convolutional network. The whole GCNs process can be
modeled as follows:
Given a graph G = (V , E ) with N = |V | vertices. We have an adjacency matrix M of G and an N × D

matrix X as input. With a stochastic latent variables zi , we can summarize an N × F output matrix Z ,
where F is the number of output features.
In this process, we define D is the number of attributes per vertex. As the attributes are generated based
on the selected neighbor vertices and the textual information of each entity vertex (see Sect. 3.1 for
details) originally, plus each relation vertex only has a unique one-hot representation, the number of
attributes for each vertex is different. Therefore, we perform union set operation for all vertices’
attributes, and then set the number of elements in this union set as the value of D. When constructing X,
in the case a vertex does not have the attributes, we fill up zero value for these attributes in order to
complete the matrix construction. Each layer of GCNs can then be written as a non-linear function:
H(l+1) = f (H(l), M), (1)
where H(0) = X and H(L) = Z, L is the number of layers. We then set the following propagation rule:
123
J. Zhu et al.
Fig.4 Eachnodesendsitsownfeatureinformationtoitsneighbornodeaftertransformation
f(H(l),M)= ReLU(MH(l)W(l)), (2)

where W(l) is a weight matrix for the l-th network layer, and ReLU is the activation. Note that the
multiplication with M only sums up all attributes of all neighbor vertices, not the vertex itself. Therefore,
we need to add an identity matrix I to M according to Kipf and Welling (2017). Then the Eq. (2) changes
to:
(l) −1 −1 (l) (l)
f(H ,M)=ReLU(Dˆ 2MˆDˆ 2H W ), (3)
where Mˆ = M + I and Dˆ is the diagonal vertex degree matrix of Mˆ . If we set the
L = 3 for example, it means the network has three convolutional layers to reconstruct
the structure of M to get Z. If we want to remain the half number of receptive fields
from the previous layer on the current layer, we can easily get F = D = D after 2L 8
three convolutional layers. Once we have the output of the encoder, we use an inner product decoder
to reconstruct the data.
To better understand the convolution process, we describe three steps using a simple running example
as following:
Step 1: Each node sends its own feature information to the neighbor node after it is transformed. As Fig.
4 shows. This step is to extract and transform the feature information of nodes.
Step 2: Each node aggregates the feature information of the neighbor node. As Fig. 5 shows. This step is
to fuse the local structure information of the nodes.
Step 3: After gathering the previous information, the nonlinear transformation is completed by using
ReLU function(except the last layer) to strengthen our model. At the last layer of convolution, softmax is
used as the activation function to
123
Fig.5 Eachnodeaggregatesthefeatureinformationoftheneighbornode
Fig.6 Nonlineartransformationandlabelprediction
classify each node. As shown in Fig. 6, X1 to X4 represents the input data with features. After features
extraction, features fusion and nonlinear transformation, we can obtain R1 to R4. Then R1 to R4 are fed
into the next convolution layer to extract the features once again. Finally, the extracted features are
sent into the softmax activation function to determine which type of the node’s features belongs to.
3.3 Model optimization
As described earlier, our model needs to preserve both local and global structure of a KG. In other
words, a joint optimization mechanism is needed to optimize the first-order and the second-order
proximity simultaneously.
123
In our model, we adopt the common graph Laplacian regularization term loss function (Belkin et al.
2006; Kipf and Welling 2016; Weston and Collobert 2008; Zhou et al. 2003; Zhu et al. 2003) to perform
joint optimization:
Loverall =Lfirst +λLsecond, (4)
where L f i r s t denotes the supervised loss of the first-order proximity, which is the labeled part of the
graph. Lsecond denotes the unsupervised loss of the second-order proximity, the smaller Lsecond is, the
better interpretable latent representations is learned from the global structure of the graph by GCNs.
The λ is a trade-off fac- torbetweenLfirst andLsecond.
For the loss function L f i r s t , we simply define it according to the idea of Laplacian Eigenmaps (Belkin
and Niyogi 2003), which incurs a penalty when similar vertices are mapped far away in the embedding
space:
􏰂􏰃􏰃􏰄 􏰅n i,j=1
where si,j = 1 if vi and vj are two vertices linked by an edge, otherwise, si,j = 0. The yi and y j are two
vectors represent the attributes of vi and v j .
For the Lsecond, we can define it as:
􏰂􏰃 L − 1
Lsecond = 􏰃􏰄􏰅||H(l+1) − Hl||2
􏰂l=0
􏰃L−1
= 􏰃􏰄􏰅|| f (H(l), M) − f (H(l−1), M)||2
􏰂l=0
􏰃L−1
= 􏰃􏰄􏰅||ReLU(MH(l)W(l))− ReLU(MH(l−1)W(l−1))||2, (6)

Lfirst =
si,j||yi −yj||2,vi ≠ vj, (5)
J. Zhu et al.
l=0
whereH(0)=N×D,andH(l)=N×D ifwewanttokeeponlyhalfoffeatures
2l
after each layer. The dimension of H(0) and H(l) is different, therefore, we increase the size of the
smaller matrix H(l) to the same size of H(0) by filling up zero elements to make sure subtraction between
two matrices can be performed before minimizing the norm. Note that we use 2-norm because it is
commonly used for neural network optimization, which represents the norm of the difference between
the steps and after
in each iteration. The smaller the difference is, the closer it is to the actual value. Our goal is to minimize
Loverall as a function of θ, and θ is the overall parameters. According to the Eqs. (3), (5) and (6) we know
that the key step is to calculate the
follows:
123
partial derivative ∂Loverall = ∂L f irst + λ∂Lsecond . For the ∂L f irst , it can be written as
∂W(l) ∂W(l) ∂W(l) ∂W(l)
∂Lfirst =∂Lfirst · ∂Y , (7) ∂W(l) ∂Y ∂W(l)
where Y = σ(Yl−1)W(l) + bl, Yl−1 is the (l − 1)-th layer hidden representations, σ is the sigmoid non-linear
activation function and bl is the l-layer biases. For the first term of the Eq. (7), we have:
∂Lfirst =(Loss+LossT)·Y, (8) ∂Y
(9)
where Loss is the loss function for the reconstruction error of our model. Similarly, we have ∂ Lsecond
= ∂ Lsecond · ∂ Xˆ , where X is the input data and Xˆ is the reconstructed
∂W(l) ∂Xˆ ∂W(l)
data. For the first term ∂Lsecond , we have:

∂ Xˆ
∂ Xˆ
∂Lsecond =(X+Xˆ)⊙B, where B is the mathematical form of matrix for {bi,j}n
. If si,j = 0, then bi,j = 1, else bi,j > β,β > 1. In our model, β will be used as one of parameters during
i,j=1
optimization. Therefore, to find a good region of parameter space, we can use ∂Loverall
∂θ
to back-propagate through the network to get updated parameters θ until convergence
because Loverall can be simply treated as ∂Loverall = (Loss + LossT ) · Y + λ(X +
Xˆ ) ⊙ B .
For hyperparameter optimization, we set dropout rate to 0.2 for all layers, L2 reg-
ularization factor for each layer and the number of hidden units. We finally train the model for a
maximum of 100 epochs using Adam optimizer (Kingma and Adam 2015) with a learning rate of 0.01 and
early stopping with a window size of 10 after trying many different settings.
4 Experiments
In this section, we study the empirical performance of our proposed models on two benchmark tasks:
entity classification and link prediction.
4.1 Datasets and preparation
We use two popular corpora to conduct our experiments like many existing works, e.g., Xu et al. (2017),
namely, FB15K (Bordes et al. 2013) and WIN18 (Bordes et al. 2014). We first remove all entities that
have no textual information with their associated triples from these two corpora. Table 1 lists statistics
of the two corpora after pre- processing.
We select several state-of-the-art methods for comparison, TransE (Bordes et al. 2013), TransD (Ji et al.
2019), DKRL(CNN) (Xie et al. 2016), Jointly(LSTM) and Jointly(A-LSTM) (Xu et al. 2017). Besides, we have
selected two popular approaches, HOLE (Nickel et al. 2016) and ConMask (Shi and Weninger 2018) that
also use global
∂θ
123
Table1 Statisticsofdatasets
Dataset
FB15K WIN18
#Rel #Ent
1336 14,885 18 40,100
#Train #Valid
472,860 50,000 140,975 5000
#Test
57,800 5000
J. Zhu et al.
information, and two GCNs based models, namely, GCN (Kipf and Welling 2017) and Deep GCN (Li et
al. 2018). We adjusted these methods to make sure all of them can support KG embedding. All models
including our model are trained using various parameter settings to get the best performances.
4.2 Entity classification
The task of entity classification is a multi-label classification task aiming to predict entity types. We
select the top 50 types for classification from FB15K and WIN18 according to the entity types frequency.
The top 50 types cover 13,306 entities for FB15K, and 38158 entities for WIN18, respectively. We then
use 10-fold cross- validation to perform this evaluation.
As it is a multi-label classification task, we use the Softmax function (Bishop 2006) as the classifier and
the common mean average precision (MAP) as the evaluation indicator. From Table 2, we observe that
our proposed model outperforms all other methods on both datasets. The model achieves
approximately 5% MAP value higher than the second best model which is HOLE, and at least 25% higher
than TransE. The ConMask is originally designed for KG completion; therefore, it is not surprising to see
it is better than TransE and TransD only. The results indicate that features generated by our model are
more capable of catching entity type information and have better robustness. The reason is that it is
natural for GCNs to encode both structural information and textual information from KGs to get a better
understanding of entities. Some models also make use of both information but merely takes structural
information into deep consideration, while TransE only focuses on the local structure of structural
information, failing to encode the textual information. For GCN and Deep GCN, these, two GCNs based
models are not very impressive on this task due to the lack of joint optimization for local and global
structural information.
4.3 Link prediction

Link prediction is a typical task to complete a triplet (h,r,t) of a KG with h or t missing, i.e., given (h,r)
predict t. This task emphasizes more on ranking a set of candidate entities from the knowledge graph.
Similar to many of the existing works, e.g., Bordes et al. (2013), Xie et al. (2016) and Xu et al. (2017), we
use two measures as our evaluation metrics. (1) Mean Rank: the averaged rank of correct entities or
relations; (2) Hits@p: the proportion of valid entities or relations ranked in top p predictions. In this
paper, we set p = 10 for entities and p = 1 for relations. A nice embedding model should achieve lower
Mean Rank and a higher Hits@10. The evaluation results are reported in Table 3.
123
Table2 Theresultsofentity classification (MAP)
Methods
TransE
TransD DKRL(CNN) Jointly(LSTM) Jointly(A-LSTM) HOLE
ConMask
GCN
Deep GCN Proposed model
FB15K
61.5 68.2 73.5 75.0 76.8 78.0 72.6 76.5 77.5 80.4
WIN18 Mean rank
250.0 210.0 206.0
98.0 125.0 95.5 205.0 105.0 92.0 87.6
WIN18
70.0 75.6 80.1 83.0 84.5 85.5 79.5 84.0 85.1 88.9
Hits@10(%)
85.1 91.0 91.0 91.6 90.0 92.7 89.0 91.0 91.5 93.5
Table3 Theresultsoflinkprediction Metrics FB15K
Mean rank
TransE 120.0
Hits@10(%)
46.5 50.3 65.2 68.5 73.4 74.1 62.0 72.3 73.5 78.0
TransD DKRL(CNN) Jointly(LSTM) Jointly(A-LSTM) HOLE
ConMask
GCN
Deep GCN Proposed model
90.0 91.0 85.0 78.0 77.0 90.5 82.4 79.8 73.6
From the results, similar to the entity classification task, we observe that our proposed models are
better than all methods on both metrics regards to the task of link prediction. For example, on FB15K,
our model achieves at least 60% higher Hits@10 value than TransE. This experiment indicates that our
model achieves substantial improvement on Mean Rank and Hits@10 because our model, especially the
design of the second-order proximity computation, is well-suited for KGs embedding. On WIN18,
Jointly(LSTM) performs better than jointly(A-LSTM) because the number of relations in this dataset is
relatively small. Therefore, the attention mechanism of Jointly(A-LSTM) does not have a distinct
advantage. We also find that both GCNs based models can achieve good results in this experiment,
which proves the advantage of GCNs on KGs embedding to some extent.
4.4 Ablation study
To further study the importance of the first-order proximity and the second-order proximity, we conduct
an ablation study for our model. We first remove the component
123
Metrics
SPO FPall_SPO Proposed model
FB15K
Mean rank Hits@10(%)
J. Zhu et al.
Table4 Theperformanceof each feature for the task of entity classification (MAP)
Methods
SPO FPall_SPO Proposed model
FB15K WIN18
Table5 Theperformanceofeachfeatureforthetaskoflinkprediction
77.0 78.5 80.4
WIN18 Mean rank
101.0
84.3 85.1 88.9
Hits@10(%)
81.0 73.0
75.8 76.2 94.6 92.5 73.6 78.0 87.6 93.5
91.0
for the first-order proximity computation from the model, which means we evaluate the performance
based on the second-order proximity only, namely, SPO. We then use SPO to connect the component for
the first-order proximity computation but remove the step of neighbor vertices selection, which means
all neighbor vertices will be considered, namely, FPall_SPO. Tables 4 and 5 show the performance of
each feature for the tasks of entity classification and link prediction.
According to the results from Tables 4 and 5, we find that the performance of SPO is nearly the same as
GCN (Kipf and Welling 2017) because the structure of the neural network is similar. Our model
optimization makes the SPO slightly better than GCN. The performance of FPall_SPO is further improved
from SPO on both tasks because of the role of the first-order proximity. However, without the step of
neighbor vertices selection, the performance of FPall_SPO is worse than the proposed model. This study
shows the importance of the first-order proximity and the neighbor vertices selection to performance
improvement.
4.5 Performance on less labelled data
Last but not least, it is often that sufficient labeled data is not always available in real-world applications.
One of the popular solutions to this problem is to use semi- supervised learning methods, which is also
the manner of our proposed model in this article. As the complementary part of our work, we use
different percentage of labelled data to train our model in this experiment, which means we only assign
attributes to a certain percentage of vertices to generate supervised information for the tasks of entity
classification and link prediction in order to evaluate the ability of our model to handle such situation.
The results are shown in Figs. 7 and 8.
According to Figs. 7 and 8, we learn that our model performs well on both tasks even with 1% labeled
data. Our model can still achieve around 60 MAP value for the task of entity classification and at least 50
Hits@10 value for the task of link prediction on the FB15K and WIN18 datasets, respectively, which are
nearly the same performance
123
Fig.7 Performanceofentityclassification(MAP)withonlycertainpercentageofdataislabelled
Fig.8 Performanceoflinkprediction(Hits@10)withonlycertainpercentageofdataislabelled
as TransE (Bordes et al. 2013). This experiment shows the robustness and practicality of our model in
real-world situation.
5 Conclusion and future work
In this paper, we propose a semi-supervised model for knowledge graph embedding. The model utilizes
both textual information and structural information of entities by designing a neighbor vertices selection
algorithm for the first-order proximity computation and adopting graph convolutional networks based
auto-encoder for the second-order proximity computation. The model is trained by a joint optimization
mechanism so that the local and global structure of a KG can be well preserved.
In experiments, we evaluate our model using two popular tasks on two well-known corpora, entity
classification, and link prediction. Experimental results show that our
123
model achieves better performance than several state-of-the-art methods on both tasks. In addition, we
use different percentage of labeled data to train the model in order to evaluate its practicality. The
result of this experiment proves that even with 1% labeled data, our model can still achieve acceptable
MAP value for entity classification and Hits@10 value for link prediction.
In the future, we intend to combine recurrent neural network (Lai et al. 2015) with GCNs to further
integrate the representations of textual and structural information of entities into a unified architecture
because recurrent neural network is particularly useful to explore the context information of entity
textual information, which should be able to help us more deeply learn the embedding of a KG.
Acknowledgements This work is supported by the Guangzhou Key Laboratory of Big Data and Intelli-
gent Education (201905010009) and the National Natural Science Foundation of China (Nos. 61877020,
61772211).
References
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Comput 15(6):1373–1396
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning
from labeled and unlabeled examples. J Mach Learn Res 7(1):2399–2434
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for model-
ing multi-relational data. In: Proceedings of the 26th international conference on neural information
processing systems, pp 2787–2795
Bordes A, Glorot Xavier, Weston J, Bengio Y (2014) A semantic matching energy function for learning
with multi-relational data. Mach Learn 94(2):233–259
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast
localized spectral filtering. In: Proceedings of the 29th international conference on neural information
processing systems, pp 1–9
Huang C, Zhu J, Huang X, Yang M, Fung G, Hu Q (2018) A novel approach for entity resolution in scientific
documents using context graphs. Inf Sci 432(1):431–441
Ji GL, Liu K, He S, Zhao J (2016) Knowledge graph completion with adaptive sparse transfer matrix. In:
Proceedings of the 30th international conference on artificial intelligence, pp 985–991
Ji GL, He SZ, Xu LH, Liu K, Zhao J (2019) Knowledge graph embedding via dynamic mapping matrix. In:
Proceedings of the 53th international conference on association for computational linguistics, pp
687–696
Kingma D, Adam J Ba (2015) ADAM: A method for stochastic optimization. In: Proceedings of the 3rd
international conference on learning representions, pp 1–15
Kipf TN, Welling M (2016) Variational graph auto-encoders. ArXiv e-prints:1611.07308
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: Proceed-
ings of the 5th international conference on learning representions, pp 1–14
Lai SW, Xu LH, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In:
Proceedings of the 29th international conference on artificial intelligence, pp 2267–2273
Lao N, Subramanya A, Pereira F, Cohen W (2012) Reading the web with learned syntactic-semantic
inference rules. In: Proceedings of the joint conference on empirical methods in natural language
processing and computational natural language learning, pp 1017–1026
Li QM, Han ZC, Wu XM (2018) Deeper insights into graph convolutional networks for semi-supervised
learning. In: Proceedings of the 32nd international conference on artificial intelligence, pp 1–9 Liben-
Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am Soc Inf Sci
Technol 58(7):1019–1031
123
J. Zhu et al.
Lin Y, Liu Z, Sun M, Liu Y, Zhu X (2015) Learning entity and relation embeddings for knowledge graph
completion. In: Proceedings of the 29th international conference on artificial intelligence, pp 2181–
2187
Nickel M, Rosasco L, Poggio T (2016) Holographic embeddings of knowledge graphs. In: Proceedings of
the 30th international conference on artificial intelligence, pp 1955–1961
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings
of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 701–
710
Shi BX, Weninger T (2018) Open-world knowledge graph completion. In: Proceedings of the 32nd inter-
national conference on artificial intelligence, pp 1957–1964
Socher R, Chen DQ, Manning C, Ng A (2013) Reasoning with neural tensor networks for knowledge base
completion. In: Proceedings of the 26th international conference on neural information processing
systems, pp 926–934
Szumlanski S, Gomez F (2010) Automatically acquiring a semantic network of related concepts. In: Pro-
ceedings of conference on information and knowledge management, pp 19–28
Tang J, Meng Q, Wang M, Zhang M, Yan J, Mei Q (2015) Line: large-scale information network
embedding. In: Proceedings of the 24th international conference on World Wide Web, pp 1067–1077
Toutanova K, Chen D, Pantel P, Choudhury P, Gamon M (2015) Representing text for joint embedding of
text and knowledge bases. In: Proceedings of conference on empirical methods in natural language
processing, pp 21–28
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In:
Proceedings of the 6th international conference on learning representions, pp 1–12
Vries GKD (2013) A fast approximation of the weisfeiler-lehman graph kernel for rdf data. In:
Proceedings of the 12th international conference on European conference on machine learning and
knowledge discovery in databases, pp 606–621
Wang Z, Zhang JW, Feng JL, Chen Z (2014) Knowledge graph and text jointly embedding. In: Proceedings
of conference on empirical methods in natural language processing, pp 1591–1601
Wang ZG, Li JZ (2016) Text-enhanced representation learning for knowledge graph. In: Proceedings of
the 25th international joint conference on artificial intelligence, pp 1293–1299
Wang D, Cui P, Zhu W (2016) Structural deep network embedding. In: Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and data mining, pp 1225–1234
Wang Q, Mao ZD, Wang B, Guo L (2017) Knowledge graph embedding: a survey of approaches and
applications. IEEE Trans Knowl Data Eng 29(12):2724–2743
Wang ZC, Lv QS, Lan XH, Zhang Y (2018) Cross-lingual knowledge graph alignment via graph convolu-
tional networks. In: Proceedings of conference on empirical methods in natural language processing, pp
349–357
Weston J, Collobert R (2008) Deep learning via semi-supervised embedding. In: Proceedings of the 25th
international conference on machine learning, pp 1168–1175
Xiao H, Huang ML, Zhu XY (2016) From one point to a manifold: knowledge graph embedding for precise
link prediction. In: Proceedings of the 25th international joint conference on artificial intelligence, pp
1315–1321
Xie RB, Liu ZY, Luan HB, Jia J, Sun MS (2016) Representation learning of knowledge graphs with entity
descriptions. In: Proceedings of the 30th international conference on artificial intelligence, pp 2659–
2665
Xu JC, Qiu XP, Chen K, Huang XJ (2017) Knowledge graph representation with jointly structural and
textual encoding. In: Proceedings of the 26th international joint conference on artificial intelligence, pp
1318–1324
Yao L, Mao CS, Luo Y (2019) Graph convolutional networks for text classification. In: Proceedings of the
33rd international conference on artificial intelligence, pp 1–9
Zhong H, Zhang JW, Wang Z, Wan H, Chen Z (2015) Aligning knowledge and text embeddings by entity
description. In: Proceedings of conference on empirical methods in natural language processing, pp
267–272
Zhou DY, Bousquet O, Lal TN, Weston J, Scholkopf B (2003) Learning with local and global consistency.
In: Proceedings of the 16th international conference on neural information processing systems, pp 321–
328
Zhu X, Ghahramani ZB, Lafferty J (2003) Semi-supervised learning using gaussian fields and harmonic
functions. In: Proceedings of the 20th international conference on machine learning, pp 912–919
123
Zhu J, Xie Q, Zheng K (2015) An improved early detection method of type-2 diabetes mellitus using
multiple classifier system. Inf Sci 292:1–14
Zhu J, Xie Q, Yu SI, Wong WH (2016) Exploiting link structure for web page genre identification. Data Min
Knowl Discov 30(3):550–575
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123
J. Zhu et al.

Knowledge Graph Embedding

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Knowledge Graph Embedding

Uploaded by

Copyright:

Available Formats

Data Mining and Knowledge Discovery https://doi.org/10.

A semi-supervised model for knowledge graph embedding

Received: 13 October 2018 / Accepted: 5 September 2019

Keywords Knowledge graph · Deep learning · Graph convolutional networks

Responsible editor: Shuiwang Ji.

B Jia Zhu jzhu@m.scnu.edu.cn

B Yong Tang ytang@m.scnu.edu.cn

A semi-supervised model for knowledge graph embedding

2. Forthecomputationofthefirst-orderproximity,wefirstdesignaneighborvertices selection algorithm by

A semi-supervised model for knowledge graph embedding

A semi-supervised model for knowledge graph embedding

3.1 Computation for the first-order proximity

Let G = (V, E) to be a graph, v ∈ V, which represents an entity. H is the set

of neighbor entities of v, hi ∈ H, which represents a neighbor entity of v. We then

attributes One of v. The neighbor entities selection process is described in Algorithm

O together to get the attributes O = O 􏰀 O for each entity/vertex. We use O ne e ne t e

3.2 Computation for the second-order proximity

A semi-supervised model for knowledge graph embedding

centroid to construct a neighborhood graph. Regards to the normalization, we use the 11

formula Mˆ = S ̃−2 M ̃ S ̃−2 , M ̃ = M+IN, M ̃ istheadjacencymatrixoftheundirected graph G. IN is the

Given a graph G = (V , E ) with N = |V | vertices. We have an adjacency matrix M of G and an N × D

H(l+1) = f (H(l), M), (1)

f(H(l),M)= ReLU(MH(l)W(l)), (2)

(l) −1 −1 (l) (l)

f(H ,M)=ReLU(Dˆ 2MˆDˆ 2H W ), (3)

where Mˆ = M + I and Dˆ is the diagonal vertex degree matrix of Mˆ . If we set the

A semi-supervised model for knowledge graph embedding

Loverall =Lfirst +λLsecond, (4)

For the Lsecond, we can define it as:

Lsecond = 􏰃􏰄􏰅||H(l+1) − Hl||2

= 􏰃􏰄􏰅|| f (H(l), M) − f (H(l−1), M)||2

= 􏰃􏰄􏰅||ReLU(MH(l)W(l))− ReLU(MH(l−1)W(l−1))||2, (6)

si,j||yi −yj||2,vi ≠ vj, (5)

∂W(l) ∂W(l) ∂W(l) ∂W(l)

A semi-supervised model for knowledge graph embedding

∂Lfirst =∂Lfirst · ∂Y , (7) ∂W(l) ∂Y ∂W(l)

∂Lfirst =(Loss+LossT)·Y, (8) ∂Y

∂W(l) ∂Xˆ ∂W(l)

data. For the first term ∂Lsecond , we have:

∂Lsecond =(X+Xˆ)⊙B, where B is the mathematical form of matrix for {bi,j}n

to back-propagate through the network to get updated parameters θ until convergence

because Loverall can be simply treated as ∂Loverall = (Loss + LossT ) · Y + λ(X +

4.1 Datasets and preparation

1336 14,885 18 40,100

472,860 50,000 140,975 5000

4.2 Entity classification

4.3 Link prediction

A semi-supervised model for knowledge graph embedding

Table2 Theresultsofentity classification (MAP)

TransD DKRL(CNN) Jointly(LSTM) Jointly(A-LSTM) HOLE

Deep GCN Proposed model

WIN18 Mean rank

250.0 210.0 206.0

98.0 125.0 95.5 205.0 105.0 92.0 87.6

Table3 Theresultsoflinkprediction Metrics FB15K

TransD DKRL(CNN) Jointly(LSTM) Jointly(A-LSTM) HOLE

Deep GCN Proposed model

90.0 91.0 85.0 78.0 77.0 90.5 82.4 79.8 73.6

4.4 Ablation study

SPO FPall_SPO Proposed model

Mean rank Hits@10(%)

77.0 78.5 80.4

WIN18 Mean rank