AUC in Link Predicction

Received April 13, 2018, accepted May 12, 2018, date of publication May 18, 2018, date of current
version June 19, 2018.

Digital Object Identifier 10.1109/ACCESS.2018.2838259
Link Prediction on Directed Networks

Based on AUC Optimization
BOLUN CHEN 1,2 , YONG HUA1 , YAN YUAN1 , AND YING JIN1
1 Collegeof Computer Engineering, Huaiyin Institute of Technology, Huaian 233003, China
2 Department of Physics, University of Fribourg, CH-1700 Fribourg, Switzerland
Corresponding author: Yong Hua (huayong1994@126.com)

This work was supported in part by the Chinese National Natural Science Foundation under Grant 61602202, in part by the Natural
Science Foundation of Jiangsu Province under Contract BK20160428 and Contract BK20161302, in part by the Foundation of Jiangsu
Provincial Science and Technology Department under Contract BY2016061-25, and in part by the China Scholarship Council.
ABSTRACT Complex networks have become high-dimensional, sparse, and redundant due to the rapid
expansion of the Internet. Effective link prediction techniques are needed to obtain the most relevant and
important information for Internet users. A new link prediction algorithm based on the area under the receiver
operating characteristic curve (AUC) as the evaluation metric is proposed in this paper. In the proposed
method, the AUC is treated as the objective function and the link prediction problem is transformed into an
optimization problem. A group of topological features is defined for each ordered pair of nodes. By using
those features as the attributes of the node pairs, link prediction can be treated as a binary classification, where
the class label of each node pair is determined by the existence of a directed link between the node pair. Then,
the binary classification problem can be solved by the AUC optimization. According to the empirical results,
high-quality predictions can be achieved by our algorithm.
INDEX TERMS Link prediction, AUC metric, complex networks, target function.
I. INTRODUCTION between people can be found through the analysis of social

As Internet technology rapidly develops, the world is becom- relations [4], [5]. Furthermore, to forecast potential coop-
ing increasingly complicated with frequent networking. In the erators and article types, link prediction could be utilized
real world, various information, biological, and social sys- in academic networks [6]. The links between nodes show
tems, from interpersonal relationships to colony structures, interactive relations in biological networks, such disease-
from transportation to the online world, from ecosystems to gene and metabolic networks and protein-protein interaction
the nervous system, can be considered as networks in which networks [7]. The research on link prediction has significant
vertices represent interactions between vertices or links and theoretical importance and an extensive scope of practical
entities denote relations. Undoubtedly, some potential links value [8], [9]. For instance, the research can offer a unified
remain undetected, and redundant links and errors occur in and convenient platform to fairly compare the mechanisms
complex networks due to space and time limitations and of network development and to help to theoretically compre-
the experimental conditions. Moreover, we need to forecast hend the mechanisms of complex network development to
potential and missing links based on known network infor- study complex network evolution models.
mation. This is the objective of the forecast challenge related The rest of this article is organized as follows.
to networks as a result of the changing interactions in complex Section 2 introduces the related work. Section 3 reviews
networks [1], [2]. the problem of link prediction and presents the AUC score
In fact, link prediction methods can be utilized as auxil- for evaluating the results. Section 4 describes the features
iary means to study the structure of social networks. Indeed, of node pairs for link prediction. Section 5 proposes the
it is essential to forecast future potential links among peo- AUC_LP algorithm and describes the details of implementing
ple. The link prediction problem has an extensive scope of the algorithm. The experimental results of the AUC_LP are
practical applications; for instance, users’ potential friends presented and analyzed in Section 6, and the performance is
can be identified by link prediction and they can even be compared with that of similar approaches. Finally, a summary
introduced to others in online networks [3]. Potential links is provided in Section 7.
2169-3536
2018 IEEE. Translations and content mining are permitted for academic research only.
28122 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
B. Chen et al.: Link Prediction on Directed Networks Based on AUC Optimization
II. RELATED WORK In addition, several learning approaches related to predict-

According to a majority of research on link prediction, ing relations have been proposed on the basis of the index
network links are not directed. In real-world applications, provided by relation predicting metrics, external informa-
numerous systems can be represented by directed networks tion and internal attributes. The above approaches based on
consisting of directed hyperlinks. For instance, followers learning consist of matrix factorization, probabilistic graph
form links to opinion leaders on microblogs and food webs models and feature-based classification. For instance, the ver-
are composed directed linkages from predators to preys. The tex collocation profile metric is considered to be a member
asymmetric nature of links is an aspect of the direct net- of specific characteristic that represents a portions of the
work. As direct networks, modeling links introduce compli- topology information in detail [18]. Additionally, the appli-
cations but provide important advantages for analysis. Two cation of attributes of nodes and relations, including users’
states exist between a pair of nodes when a link is symmet- interests, ages, friends and characteristics, has been reported
ric: absent or present. Additionally, four possibilities exist to be a good way to enhance the relation predicting ability.
between two nodes v and u when the links are asymmetric: Li and Chen [19] proposed a learning approach on the basis
the absence of a link between v and u, u and v are connected of graph kernels and applied various characteristics, including
mutually, v links to u, and u links to v. Because v is more the introduction, book title, age, education level and key-
essential to u than u is to v, v has a status or power edge over words, to predict user-item relations in a bipartite network.
u if there is a directed link from u to v. Scellato et al. [20] believed that global features, social fea-
Some early work in the field of directed networks anal- tures and location features are beneficial characteristics to
ysis aimed to uncover the principles of structural design in predict relations in social networks with a supervised learning
ecology, neurobiology, biochemistry, social and engineering frame.
networks. An algorithm was developed by Milo et al. [10] to Enhancing the ability to predict is one of the superiorities
detect ‘‘network motifs’’, i.e., important, recurring patterns of non-topological characteristics. Nevertheless, most of the
of internal connections. They provided insight into 13 types time these characteristics are either unavailable or hard to
of 3-node connected sub-graphs in 19 real directed networks. obtain. Many non-topological characteristics are critically
Link formation of directed networks has received more atten- interrelated to the domain. To identify and exploit these
tion than that of undirected networks in studies on online characteristics, we need to fully learn the domain. Hence,
social networks, especially microblogging networks, where generic characteristics, such as topological features, net-
ties are asymmetric. Golder and Yardi [11] conducted a web- works and nodes, are typically considered in normal relation
based experiment on Twitter and found that mutuality and prediction applications. Additionally, non-topological char-
transitivity are important indices of the desire to build new acteristics must be considered when the goal is a real relation
connections. Information and social networks were combined prediction learning model. Finally, the main objective in rela-
by Yin et al. [12] who also studied the distribution of relation prediction is how to select features in the same way that
tionship types for new links between nodes that are two hops other information is learned.
apart. Garlaschelli and Loffredo [13] focused on link reci- However, fortunately, we can apply many characteris-
procity, and based on the actual level of correlation between tic selection methods in machine learning. For instance,
mutual links, they proposed a novel measure of reciprocity to Scripps et al. [21] proposed learning technology that can
enable network ordering. automatically determine the most predictive attributes and
According to Zhu et al. [14], in the process of information topological features by aligning the adjacency matrix of a net-
spreading, non-reciprocal links are less essential than recip- work. Pujari and Kanawati [22] proposed a novel topological
rocal ones. They also presented some explanations from the method that uses a supervised rank gathering approach. They
perspective of network efficiency and connectivity in direct applied data to determine weights related to every calculated
online social networks. Methodology and formalization were characteristic on the basis of the competence of each attribute
proposed by Romero and Kleinberg [15] to study two-step to predict known relations. Chiang et al. [23] presented a
direct closure and to prove its essential function in link for- weighted machine learning approach for relation prediction
mation on Twitter. Schall [16] studied triad patterns to predict that applies characteristics from long cycles in a network.
links among nodes in directed networks and proposed an The tactics of machine learning can be applied to network
algorithm called triadic closeness. Moreover, a hypothesis for link prediction. Bliss et al. [24] developed a method to predict
the potential theory of link formation in a general directed future links through the use of covariance matrix adapta-
network was proposed by Zhang et al. [17]. tion evolution strategy (CMA-ES) to optimize the weights
Furthermore, link prediction was used to verify the deduc- applied in the linear integration of node similarity indices
tion. However, to the best of our knowledge, most research on and sixteen neighborhoods. The notion of path analysis and
link prediction in directed networks focuses on predicting the probabilistic link prediction was proposed by Sarukkai [25],
directions of existing or potential links; very few quantitative who used Markov chains. Bhawsar et al. [26] used various
approaches output the probability of the occurrence of a link features of social networks and applied genetic algorithms
between each node pair. to predict links; they built chromosomes via feature selec-
VOLUME 6, 2018 28123

tion in the genetic algorithm. To address the issue of link AUC is the most common measurement to evaluate the
prediction, a non-negative matrix factorization model based quality of predictions results. As a crucial measure of per-
on neighborhoods was proposed by Zhao et al. [27]. The formance, AUC has been applied to various tasks, including
model adopts the factors of characteristics from the topolog- class-imbalance learning, cost-sensitive learning, instances
ical structure that combine with the local neighboring struc- ranking, and information retrieval. For link prediction in
ture of the potential network. A rank aggregation approach networks, regardless of the method we use, our final goal is to
to predict links in complex networks was proposed by optimize the AUC. Since AUC is a measurement of the link
Pujari and Kanawati [28]. Complete Bayesian inference was prediction result quality, we can optimize it directly instead of
performed by Ermis and Cemgil [29] through variational calculating alternative similarity indexes or optimizing other
Bayesian (VB) methods on the coupled and single-tensor parameters, which is required in other models describing the
factorization models. The approach can be performed in large topological features of the network. Inspired by this fact,
models. Cai et al. [30] presented a quick comparison on the we present a link prediction method that directly optimizes
basis of a method used to predict the relations connected to the AUC. In the method, AUC is treated as the objective
a specific node. They used a random walk method to build a function, and link prediction is transformed into an optimiza-
road to the relations of a specific node. Lv et al. [31] presented tion problem. A group of topological features is defined for
a common structural similarity index that independent of each ordered pair of nodes. By using those features as the
previous knowledge of the network in terms of perturbation of attributes of the node pairs, link prediction can be treated as
the adjacency matrix. Li et al. [32] added the determination a binary classification problem in which the class label of
of a definition to the benefits of their own final roads and each node pair is determined by whether a directed link exists
the roads related to the final nodes and proposed a new between the node pair. Thus, the link prediction problem
consistency index called Scop. can be reduced to determining which edges belong to the
Recently, many scholars found that a modular network positive class, where the edges exist, and which belong to the
structure is capable of disclosing future links. Some scholars negative class, where the edges are absent. Then, the binary
have used local partitioning of a network to predict future classification problem can be solved by AUC optimization.
links. For instance, Zhang et al. [33] found that information According to the experimental findings on actual networks,
provided online is both misleading and redundant. Due to the accurate results and high speed are achieved by the algorithm
‘‘less can be more’’ characteristics, some algorithms can be in comparison with other methods.
designed to improve prediction performance by eliminating
links from the original networks. Clemente and Grassi [34] III. LINK PREDICTION AND AUC SCORE
focused on the problem of evaluating the local clustering in A network represented by a directed graph G = (V , E) is
complex networks. This measure can be defined in various discussed in this paper. In the graph, E is the set of links
ways for cases of networks with weighted edges. and V is the set of nodes. G does not allow self-connections
However, less attention has been given to directed and and multiple links. When N = |V | is the number of nodes, U
weighted networks. The authors provided a new local clus- is used to define the set that contains all the possible directed
tering coefficient for these types of networks, beginning links of N (N − 1). When aij = 1 and A = [aij ] is the
with those existing in previous studies for the undirected neighboring matrix of network G, there is a linkage between
and weighted case. Moreover, four specific components are nodes j and i; otherwise, aij = 0. The same derivation can
extracted to separately consider different types of triangle be applied to the weighted links, in which the A entries refer
patterns. to constant or ordinal values. The link prediction task can be
Moreover, an empirical application on several real net- viewed as links that will show up in the future or missing links
works was presented, and the performance of coefficients in the series of links U − E that do not exist.
from previous studies was compared with the performance of Our approach is to give a score, S(x, y), to every pair
our coefficients. Agreste et al. [35] assessed the community of ordered nodes (x, y) ∈ U . The current possibility of a
detection algorithms in terms of scalability and accuracy direct link between nodes is indicated by the score. The larger
and proposed the infomap algorithm, which can be regarded S(x, y) is, the higher the probability that there is a link from
as a promising tool for the purposes of web data analysis, node x to node y for a pair of nodes (x, y) in U \E.
to illustrate the trade-off between computational performance To evaluate the link prediction results, links in the network
and accuracy. Santos et al. [36] proposed a consensus graph are divided into two parts: the probe set E P represents the
that clusters algorithms in terms of directed networks and unknown information for forecasting and the training set
achieves superior performance with small-sized clusters and E T represents the known information. E T ∪ E P = E and
a high mixture index. E T ∩ E P = ∅. An ordered list of all non-observed links (i.e.,
Nevertheless, many algorithms based on similarity con- U −E T ) is provided by the link prediction algorithm, or every
sider only the benefits of roads related to two nodes and non-observed link is provided. Therefore, for (x, y) ∈ U −
ignore the effects the two nodes when considering their com- E T , the score Sxy quantifies the current probability.
parability. Hence, the inaccurate outcomes cannot be fully AUC is the most widely adopted evaluation metric to
trusted. quantify the accuracy of the prediction approaches to assess
28124 VOLUME 6, 2018

the results of link prediction. As a significant measure of

performance that is commonly utilized in various tasks,
including class-imbalance learning, cost-sensitive learning,
information retrieval and rank learning, AUC can be utilized
in datasets with regular stands. For example, because AUC
is blind to the class distribution, recall and accuracy are
inadequate. A score is assigned by the similarity-oriented link
prediction approaches to every non-existing and existing edge
within the network, and an ROC (receiver operating charac-
teristic) curve is constructed based on the scores assigned to
the ordered node pairs. Generally, a larger AUC indicates bet-
ter performance. The AUC of a randomly selected predictor FIGURE 1. ROC curve of Example 1.
is 0.5 while the AUC value of a perfect predictor is 1.0.
Assume n = n1 + n2 and that there are n2 non-exiting uating the quality of prediction results. In essence, the goal
links and n1 existing links in the network. Let si be the score link prediction methods is to optimize the AUC. Inspired by
assigned to ei , and let en1 +1 , en2 +1 , . . . , en and e1 , e2 , . . . , en1 this fact, we present a link prediction method that directly
be the sets of non-existing and existing links, respectively. optimizes the AUC. In this method, AUC is used to construct
The ROC for link prediction can be obtained as follows: the objective function, and link prediction is transformed into
(1) First, sort the links in descending order of score and let an optimization problem.
the sorted sequence of the edges be e01 , e02 , . . . , e0n1 . Example 2: Assume the links a, b, c, d, e, f , and g are in
(2) Generate a coordinating system of the ROC curve, a network, where e, f , and g are non-existing links and a,
where the ordinate indicates the current edges and the b, c, and d are existing links. The prediction result gives the
abscissa represents the non-existing edges. The abscissa scores sa , sb , sc , sd , se , sf , and sg to the related links such
consists of n2 steps, and the length of each step that sa > se = sb > sf > sc > sd > sg . This study first
is 1/n2 . The ordinate consists of n1 steps, and the length of ranks the links in descending order of their scores to obtain
each step is 1/n2 . Thus, a unit-side-length square is formed the sequence a, e, b, f , c, d, and g to generate the ROC curve.
by the ROC coordinate system. Then, as shown in Figure 1, the coordinate system of the ROC
(3) The ROC curve begins at the origin of the coordinate curve can be determine. Herein, the ordinate consists of 4
system and moves rightwards toward the square’s upper- steps and the abscissa consists of 3 steps. As the first link
right corner. In each step, the direction of the movement is in the sequence is a, the curve initially moves upwards one
determined by the sorted sequence of e01 , e02 , . . . , e0n1 . The step, starting from the origin of the coordinate system. Then,
curve in the i − th step moves upward one step if e01 is an as the next two links in the sequence are the existing link b
existing edge. e0i+1 is an existing edge, and s0i = s0i+1 if e01 is and the non-existing link e, se = sb , the curve moves one
a non-existing edge. Then, the curve moves one step along step along the diagonal. The next link is f , a non-existing link.
the diagonal; or it moves one step rightwards otherwise. The Therefore, the curve moves one step rightwards. According to
curve forms the ROC curve when it reaches the upper-right Figure 1, the ROC curve is completed by repeating the move-
corner of the square after n steps. ment until the upper-right corner of the square is reached.
(4) The area under the ROC curve is the AUC value of link The shaded portion under the curve is the AUC, which is
prediction results. The process of generating an ROC curve is 7.5/12 = 0.625.
illustrated with the following examples. The ROC curve does not have to be drawn to calculate
Example 1: The forecasting results of the AUC are esti- the AUC, which can be seen as the possibility that a ran-
mated in example 1. Three non-existing links and 4 existing domly selected non-existing link is given a lower score than
links are included in this example. Therefore, the scores of the a randomly selected existing link. A non-existing link and an
pairs of non-existing and existing links should be compared. existing link are randomly selected to compare their similarity
The existing links among the 12 pairs have higher scores than scores. Among n independent comparisons, the existing link
those of the non-existing links in the 7 pairs: (a, e), (a, f ), has the same score n00 times and it has a higher score n0 times;
(a, g), (b, f ), (b, g), (c, g) and (d, g). The score equals the therefore, the AUC is as below:
score of a non-existing link, and e can only be found in the
AUC = (n0 + 0.5n00 )/n. (1)
existing link b. Thus, AUC = (7 + 0.5)/12 = 0.625, n0 = 7,
and n00 = 1. The following examples are utilized to explain the AUC
How much better the algorithm is than pure chance is calculation based on (1).
indicated by the degree to which the value exceeds 0.5. In fact,
the value of AUC should be approximately 0.5 if the scores IV. FEATURES OF NODE PAIRS FOR LINK PREDICTION
are obtained from an identical and independent distribution. Link prediction can be treated as a classification of node pairs
For link prediction in networks, regardless of the method based on their structural properties reflected by the features
implemented, the AUC is the most common measure for eval- of the node pairs. If there is a link between nodes j and i
VOLUME 6, 2018 28125

or aij = 0, then A = [aij ] is the neighboring matrix of the

network G, in which aij = 1. The same derivation can be
applied to the weighted links, in which the entries of A are
constant or ordinal values. Link prediction aims to identify a
similarity matrix S = [sij ], where sij is the probability that
link (i, j) exists. We first construct topological features for
each node pair ek = (i, j) to form a feature vector xk . The
research is intended to identify a function h that maps xk to FIGURE 2. Four types of triangles involving node vi . (a) OI type.
the target sij = h(xk ) so that sij = h(xk ). A systematic way (b) OO type. (c) IO type. (d) II type.
to incorporate the additional data is to add the information as
covariates. Network entities provide additional information 3. In degree of node vj .
The in degree of node vj is denoted as djin = ni=1 aij and
P
in real applications. In terms of simplicity, our discussion is
limited to only the relational statistics; however, the model is the number of edges pointing to node vj .
can be extended to address covariates. 4. PageRank of node vi .
Constructing the mapping function h, like generalizing the The PageRank algorithm is as follows:
inherent topological structures underlying the relationships of X Pt−1 (vj ) 1 − c
entities, is the key to a successful forecast. The relationships Pt (vi ) = c + , (3)
among nodes, for instance, common paths, neighbors and djout n
(vj ,vi )∈E
node properties, such as centrality and degree, depend on the
structure of the whole or an important portion of the node cen- where Pt (vi ) defines the probability of visiting node vi by
trality and network structure. Moreover, international knowl- a random walker in time step t. The damping factor c is
edge is usually required by the shortest distance between commonly set to 0.85. The random walker moves along the
nodes. For large complex networks, international network links of the network with probability c, conforming to the
knowledge is not available, and the whole network is not first term on the right-hand side, while jumping to a randomly
practical. It has become essential to identify the existence of chosen node with probability 1 − c, conforming to the second
linkages between nodes via local information only. Therefore, term.
to identify the closeness of two nodes among other node-pair 5. Leader-Follower Ranking for node vi .
characteristics, this study extends the concepts of clustering, This study aims to build an optimal ordering of nodes so
distribution sequence, neighborhood, and assortativity. The that a link is likely to point to a node with a higher ranking
node-pair characteristics can be compiled as a feature vector and originate from a node with a lower ranking to predict
for node pairs with known connectivity. Then, to predict the the direction of an edge between two nodes in a directed
existence of a connection from a particular feature vector, network. This ranking is called the Leader-Follower rank.
a model is constructed by AUC optimization. An algorithm is applied to the directed graph to calculate
For an ordered node pair (vi , vj ), which represents the the Leader-Follower rank of a node. The algorithm is slightly
potential directed link eij = (vi , vj ), we use the node-pair modified to allow two nodes to share an index in the list of
properties and node properties to construct a feature vector ordered rankings. Moreover, a highly ranked node receives a
with the following features: low score.
1. Out degree of node vi . 6. The clustering coefficient of node vi .
The out degree of node vi , denoted as diout , is simply the The ratio between the number of triangles that could have
number of edges starting from node vi . To calculate this been formed and the number of triangles node vi belongs to
value, we use the elements of the adjacent centmatrix: diout = can be used to define the clustering coefficient of node vi .
P n Since the edges of the triangles have directions, 4 types
j=1 aij .
2. Node betweenness for node vi . of triangles involving node vi are considered, as shown in
Proposed by the famous sociologist Linton Freeman, node Figure 2, where the 4 types of triangles are denoted OI, OO,
betweenness is a measure of the centrality of the node within IO and II.
the network [55]. Defined as B(vi ), the betweenness of node We define the 4 types of clustering coefficients of node vi
vi is the fraction of all the shortest paths between all the pairs as:
of nodes that penetrate node vi : NOI
C OI (vi ) = out (4)
X ni di × diin
uw NOO
B(vi ) = (2)
guw C OO (vi ) = out (5)
u6=vi =w di × diout
NII
Here, guw is the total number of directed shortest paths C II (vi ) = in (6)
from node u to w and niuw is the number of directed di × diin
shortest paths from node u to node w that pass through NIO
C IO (vi ) = in (7)
node vi . di × diout
28126 VOLUME 6, 2018

Here, NOI , NOO , NIO and NII are, respectively, the numbers Equation (12) as shown at the bottom of this page, yields
of triangles involving node vi of type OI , OO, IO and II , a coefficient that is the mean of all the links. All the calcula-
corresponding to Figure 2(a), (b), (c) and (d), respectively. tions of diout + djin can be dropped to approximate the local
7. The clustering coefficient of ordered node pair (vi , vj ). assortativity between the node pair (vi, vj) for M . M refers to
The ratio between the two nodes and the number of trian- the number of links in the network. The degree of correlation
gles can be used to define the clustering coefficient between rijOI of node pair (vi, vj) is defined as follows:
an ordered node pair (vi , vj ). Since the edges of the triangles
have directions, 4 types of triangles, OI, OO, IO and II, are 4diout djin − diout − djin
rijOI = . (13)
considered, as shown in Figure 3. 2(diout )2 + 2(djin )2 − diout − djin
From (13), we can see that rijOI ranges in 0 < rijOI < 1.
Similarly, we can define the degrees of correlation rijOO , rijII
and rijIO of node pair (vi , vj ) as follows:
4diout djout − diout − djout

rijOO = (14)
2(diout )2 + 2(djout )2 − diout − djout
4diin djin − diin − djin
FIGURE 3. Four types of triangles of clustering coefficients for ordered rijII = (15)
node pair (vi , vj ). (a) OI type. (b) OO type. (c) II type. (d) IO type. 2(diin )2 + 2(djin )2 − diin − djin
We define the 4 types of clustering coefficients of ordered 4diin djout − diin − djout
node pair (vi , vj ) as: rijIO = (16)
2(diin )2 + 2(djout )2 − diin − djout
P
OI h6=i,j aih ahj Calculating the local assortativity requires individual hope
Cij = (8)
diout × djin visibility in the network only, and evaluating rij presupposes
the existence of a link.
P
OO h6=i,j aih ajh
Cij = out (9) 9. Common neighbors.
di × djout
P The neighbors between the node pairs should considered
II h6=i,j ahi ahj in the degree of correlation and clustering. More informa-
Cij = (10)
diin × djin tion can be extracted by observing shared nodes in the
P neighborhood of each node while accounting for the non-
h6=i,j ahi ajh
IO
Cij = , (11) immediate neighbors in the network with a visibility scope
diin × djout greater than one hop. For an ordered node pair (vi, vj) in a
corresponding to Figure 3(a), (b), (c) and (d), respectively. directed network, we consider the directed paths from v to
The coefficient of node-pair clustering is the rate of the u and define the out-neighborhood of node v as 0out (v) =
common neighbors of nodes vj and vi in different directions, (w|w ∈ V , (v, w) ∈ E) and the in-neighborhood of node u as
as shown in Equation(8-11). The clustering coefficient can be 0in (u) = (w|w ∈ V , (w, u) ∈ E). The common neighbor of
determined with one hop visibility within the network only. ordered node pair (v, u) is defined as the set 0out (v) ∩ 0in (u).
8. Degree of Correlation. We call the set 0out
1 = 0 v the first-order out-neighbors of
out
The degree of correlation between two nodes is another node v, and define the i-th-order out-neighbors of node v,
property of node pairs. A majority of networks are considered denoted by 0out i (v), as the set of out-neighbors of the nodes
to show a mixture of disassortative or assortative pairs. In in 0out (v) that do not belong to any of the previous neighbor-
i−1
a network, the Pearson correlation coefficient of two nodes hoods, namely:

represents the assortativity of a node pair. Since there are Similarly, we can define 0in i (u) as the i-th-order in-
two types of degree for each node, we define 4 types of neighbors of node u.
degrees of correlation for a directed network, rijOI , rijOO , rijII We define CN 2 (v, u) = 0out (v) ∩ 0in (u) = 0out
1 (v) ∩ 0 1 (u)
in
and rijIO , corresponding to different combinations of the types as the 2n d-order common neighbors of ordered node pair
(v, u), and i
of degrees of the two nodes. For instance, degree correlation h the i-th-order icommon neighbors as CN (u, v) =
j
j+k=i 0out (v) ∩ 0in (u) . For instance, the 3 d-order
S k r
r OI is defined as:
(vi ,vj )∈E di dj /M − [ (vi ,vj )∈E 2 (di

out in 1 out
+ djin )/M ]2
P P
r OI
= P out 2 in P 1 out in
. (12)
i [(di ) + (dj )]/2M − [ i 2 (din + dj )/M ]
n o
0out
i
(v) = u|u ∈ V and ∃w ∈ 0vi−1 , (u, w) ∈ E and u ∈
/ ∪i−1 0 k
k=1 out (v) . (17)
VOLUME 6, 2018 28127

common neighbors of ordered node pair (v, u), CN 3 (v, u), vj from vi , which is:
is the combination of the intersection of 0out 1 (v) and 0 2 (u)
in
and that of 0out (v) and 0in (u). We use CN i (v, u) (i =
2 1 X 1
sRA
ij = (21)

dzin · dzout
2,i. . . , n)
1, as the features of ordered node pair (vi , vj ). Z ∈0(vi )∩0(vj )
CN (v, u) can be calculated via adjacent matrix A. For
instance, the 3r d-order common neighbors of ordered node The RA index assigns greater weight to less-connected
pair (v, u), CN 3 (v, u), can be calculated as follows: neighbors to refine the counting of common neighbors.
X Of the above features, 1 to 8 are based on node properties,
ahw awk ,
3
CN (v, u) = (18)

and the rest are based on ordered-node-pair properties. Except
h,k,w6=v,u,avh =aku =1 the Leader-Follower ranking, every feature has a similar ver-
sion in undirected networks. Therefore, our method can be
10. Jaccard coefficient.
applied to both directed and undirected networks.
The data are utilized to compare the diversity and similarity
of the neighboring sets. For an ordered node pair (vi , vj ),
V. THE LINK PREDICTION ALGORITHM
the Jaccard coefficient is the size of the intersection divided
The main idea of our algorithm is that we use the AUC
by the size of the neighbor sets 0in (vj ) and 0out (vi ), as shown
metric as the target function of optimization to obtain high-
in the formula below, and it is used to evaluate the similarity
quality link prediction results. AUC is a good measure of
of the neighboring sets of the two nodes.
performance. Several algorithms have been devoted to the
0out (vi ) ∩ 0in (vj ) optimization of AUC, mainly by reducing the surrogate con-

Jaccard
sij = (19) vex loss on a set of training data.
0out (vi ) ∪ 0in (vj )

11. Preferential attachment index (PA). A. THE OBJECTIVE FUNCTION

Preferential attachment can be utilized to create scale-free We treat link prediction as a binary classification problem
networks. The probability of a new link pointing to node vj is that is solved by maximizing the AUC. We use the tuple
proportional to djin , and the probability of a new link originat- (xij , yij ) to represent an ordered node pair (vi , vj ), where
ing from node vi is proportional to diout . A similar mechanism xij = (xij1 , xij2 , . . . , xijm ) is a vector consisting of m features
can produce scale-free networks without any development, of the node pair. The feature vectors of all the nodes form an
in which a new link is created and an old link is eliminated in m ∗ n feature matrix X = [xij ]. yij is the label indicating the
each time step. Tvj is proportional to diout × djin , and the new existence of a directed link between ordered node pair (vi , vj ).
link is connected to vi . The related similarity index motivated If there is a directed link between the node pair, then yi = 1;
by this mechanism can be obtained as below: otherwise, yi = −1. We set our prediction model as:
sPA out
ij = di × djin . (20) h(xij ) = sij = wi xijT , (22)
This formula can be used to quantify the importance of where wi = (wi1 , wi2 , . . . , wim ) is the weight vector, and wij
the links subject to different dynamics based on networks, reflects the importance of the j − th feature on the prediction
for instance, transportation, synchronization, and percolation. results. Our goal is to find the optimal weight vector w∗i such
This feature does not require information about the neighbor- that sij and yij are as close as possible. Let W = [wij ] be an
hood of every node; therefore, the computation is simple. n ∗ n weight matrix, and let S = [sij ] be the similarity matrix.
12. Resource allocation index (RA). The goal is then to identify a function h that maps A to the
The dynamics of resource allocation on the complex net- target S such that S = h(X ) = WX T . Namely formula (23)
works motivate this index. For an ordered node pair (vi , vj ) and (24).
that is not directly connected, node vi can send resources to vj Here, we use P = N (N − 1)/2 to denote the number of
with its neighbors as transmitters. The simple case assumes links in the universal set containing all possible links. The
that each transmitter has a resource unit and will distribute number of links in the network is M . I (b) is an indicator
the information to all its neighbors. The RA index between function that returns 1 if the argument is 0 and true, and the
vj and vi can be defined as the quantity of resources sent to set of neighbors of node vk can be denoted as 0(vk ), with
n
1 X X X 1
AUC = [I (ski > skj ) + I (ski = skj )] (23)
M · (P − M ) 2
/
k=1 vi ∈0(vk ) vj ∈0(v k)
n
1 X X X 1
AUC = [I (h(xki ) > h(xkj )) + I (h(xki ) = h(xkj ))]. (24)
M · (P − M ) 2
/
k=1 vi ∈0(vk ) vj ∈0(v k)
28128 VOLUME 6, 2018

0(vk ) = v|(vk , v) ∈ E. Algorithm AUC_LP (Link Prediction Based on AUC)

( Input: A: Adjacency matrix of the network;
1 b = true
I (b) = (25) Output: S: Scoring matrix between nodes;
0 b = flase Begin:
Since the direct optimization of AUC can be cast as inte- 1. Matrix A regularization;
grated optimization, it usually results in an NP-hard problem. 2. Parameter initialization;
A convex optimization problem that minimizes the objective 3. Construct the feature vectors for each node pair, store in
functions, as below, can approximate Formula (26), as shown a matrix X ;
at the bottom of this page. 4. Set the initial values of parameters W ;
Here, w = [wij ] is the weight matrix, k·kF is the Frobenius 5. While not termination condition do;
norm, ι is the loss function and λ is the regular coefficient. 6. For every node vk in V do;
In our algorithm, we use the squared loss function ι(x) = 7. update wk according to formula (31);
(1 − x)2 , and the term ι[h(xki ) − h(xkj )] is as follows: [1 − 8. End;
wk (xki − xkj )T ]2 . 9. End While;
Therefore, the loss function L(w) can be rewritten as For- 10. Output the result S = W · X T ;
mula (27), as shown at the bottom of this page. End
We define (Formula (28)) as, as shown at the bottom of this
page.
TABLE 1. The topological characteristics of the largest components of the
Then: four network samples.
Xn
L(w) = Lk (wk ). (29)
k=1
The aim of our algorithm is to obtain the optimal weight
vector w∗i by minimizing Lk (wk ) for k = 1, 2, . . . , n. We use
gradient descent to minimize Lk (wk ), and we initially obtain
the gradient as Formula (30), as shown at the bottom of this
page.
(0) (USAir), political blogs (PB), a protein-protein interaction
After setting the initial values of wk , we can update the
(k)
value of wk by the following formula: network (PPI) and the Internet (INT). This research tests the
largest connected component in each dataset. The topological
∂Lk characteristics of the largest components of the networks are
wt+1 = wtk − ηt · . (31)
k
∂wk shown in Table 1. NUMC is the number of associated com-
Such iterations can be conducted until convergence or until ponents in the network and the size of the largest one. M and
the number of iterations exceeds a constant integer N . N are the total numbers of links and nodes, respectively. For
instance, 1222/2 in dataset PB indicates that the network has
B. THE AUC _LP ALGORITHM’S FRAMEWORK FOR LINK 2 associated components and the largest one has 1222 nodes.
PREDICTION e is the efficiency of the network, K is the mean degree of
In this section, we present the AUC_LP algorithm. the network, and r and C are the assortative coefficient and
clustering coefficient, respectively.
VI. EXPERIMENTAL RESULTS AND ANALYSIS We begin by studying the reliance of the forecasting per-
This section discusses the four sets of benchmark data repre- formance on parameters λ ∈ (0, 1] and η ∈ (0, 1] in each
senting networks from disparate areas: the US airport network dataset. The results are shown as heatmaps in Figure 4.
n
λ 1 X X X
L(w) = kwk2F + ι[h(xki ) − h(xkj )]. (26)
2 M · (P − M )
/
k=1 vi ∈0(vk ) vj ∈0(vk)
n
λ 1 X X X
L(w) = kwk2F + [1 − wk (xki − xkj )T ]2 . (27)
2 M · (P − M )
/
k=1 vi ∈0(vk ) vj ∈0(vk)
λ 1 X X
Lk (wk ) = kwk k2F + [1 − wk (xki − xkj )]2 . (28)
2 M · (P − M )
/
vi ∈0(uk ) vj ∈0(u k)
∂Lk 1 X X
= λwk − {2(xki − xkj ) − 2wk (xki − xkj )(xki − xkj )T }. (30)
∂wk M · (P − M )
/
vi ∈0(uk ) vj ∈0(u k)
VOLUME 6, 2018 28129

FIGURE 4. The dependence of the AUC on parameters λ and η in different datasets.
TABLE 2. Comparison of the accuracy of algorithms quantified by AUC. TABLE 3. Comparison of the accuracy of algorithms in terms of precision.
In Figure 4, we find that a maximum AUC still exists when The results from the 11 methods show that the AUC_LP
λ and η are not simultaneously small, and the optimal AUC algorithm outperforms the other ten algorithms in terms of
under all possible values of λ and η occurs in the region where precision. Specifically, compared with those of CN, the pre-
they are greater than 0.3. Specifically, the optimal parameters cision in improved by 13.27% for USAir, 18.19% for PB,
are λ=0.6 and η=0.3 in USAir, λ = 0.5 and η = 0.4 in PB, 164.04% for PPI and 1104.80% for INT. When compared
λ = 0.35 and η = 0.75 in PPI, and λ = 0.9 and η = 0.4 in INT. against Salton, Jaccard and LHN_I, the improvement is even
In addition, the highest accuracy is achieved by the algo- greater. AUC_LP’s performance for dissimilar networks is
rithm when using AUC as the metric under the optimal especially strong: it is either perfect or almost perfect. By con-
parameters λ and η. The mean AUC for the various algorithms trast, no other algorithms are successful.
in the 10-fold cross-validation (CV) are shown in Table 2. The In addition, we also compare our method with some
highest AUC score of the 11 algorithms for each dataset is representative methods (mainly based on community detec-
highlighted in bold. tion, clustering coefficient, and time window) to demon-
As shown in Table 2, AUC_LP index has the best overall strate the performance of our method. Figure 5 presents
performance among the local indices. For example, compared the prediction results four networks obtained with seven
with those of CN, the average AUC is improved by 1.12% research approaches (S-Rank, CMA-ES, VB, Node-LP, WT,
for USAir, 4.64% for PB, 7.48% for PPI, and 51.54% for ConClus-LP, and CD-LP). The study is performed with 90%
INT. When compared against Jaccard, the average AUC is of the data collected from the networks used as the training
improved by 5.28% for USAir, 12.09% for PB, 7.01% for set and the rest used for the test set.
PPI, and 49.92% for INT. When the comparison is against PA, In addition, the size of the training dataset was increased
the improvement is even greater: 30.35% for USAir, 28.60% from 50% to 90%. The results of the different methods with
for PB, 8.13% for PPI, and 51.54% for INT. Therefore, different training dataset sizes are shown in Figure 6 and
AUC_LP achieves the best accuracy, as measured by AUC. Figure 7.
Besides AUC, precision is another important metric. The According to the test results, the efficiency of the
mean precision of the algorithms in the 10-fold CV is shown approaches can be improved by selecting appropriate network
in Table 3. The highest precision achieved by the algorithms characteristics. Moreover, the prediction accuracy is unre-
in each dataset is shown in bold. lated to the size of the network. Specifically, the prediction
28130 VOLUME 6, 2018

FIGURE 5. The comparison of the accuracy of algorithms according to AUC and Precision in different
datasets.
FIGURE 6. Comparison of the average AUC of different algorithms with different training dataset sizes.
accuracy can be improved if the study targets a set of net- can provide insight into reactions to disturbances, such as the
works. extinction of a species [38]. Network robustness is helpful
Extensive experiments have been performed to observe for biologists to explore mutations and diseases and also
networks and to verify the validity of the AUC_LP algorithm for the recovery of mutations [39]. The network robustness
compared with other link prediction algorithms. The results principles help to promote the understanding of the risks and
of this study can be used to assess other network-prediction stability of banking systems [40]. Additionally, in engineer-
algorithms. Moreover, this study confirms previous research ing, network robustness is helpful to assess the resilience of
results and simultaneously provides a reference for the infrastructure mechanisms, for example, power grids or the
study of information screening technology and relevant Internet [41].
professionals. Most networks are not random networks. Although the
number of nodes is small, there can be a massive num-
A. ROBUSTNESS ANALYSIS OF THE ALGORITHM ber of connections. The majority of networks are quite
Robustness refers to the capacity to endure perturbations and small, in general, consistent with the law promoted by
failures and serves as the key attribute to many complex Zipf [42]. A complex network with a power law dis-
mechanisms, including complex networks. tribution and degree distribution is called a scale-free
Studies of the robustness of complex mechanisms are of network.
great importance in many fields [37]. Robustness serves as an Scale-free networks are seldom impacted by random fail-
important attribute of ecosystems in the field of ecology and ures. Usually, the small nodes are randomly selected. Even
VOLUME 6, 2018 28131

FIGURE 7. Comparison of the average precision of different algorithms with different training dataset sizes.
FIGURE 8. The relationship between the degrees of nodes and the number of nodes.
when these nodes are randomly removed, the basic connectiv- MA-CR, namely, the memetic algorithm, through a two-level
ity is maintained. A scale-free network is highly vulnerable to learning strategy to promote network community robustness
deliberate attacks compared with random networks because while maintaining the community structure and degree distri-
of the non-uniformity, fault tolerance and survivability. The bution with the goal of decreasing two-level targeted attacks.
nodes of a scale-free network do not have to be removed if Research on real-world networks and scale-free networks
the strongest central nodes with the maximum connectivity in has assessed the stability and effectiveness of the proposed
the system maintain a strong influence on the connectivity of algorithms relative to other high-end algorithms. Baig and
the whole network and satisfy the critical point. Afterward, Akoglu [46] conducted the first large-scale empirical corre-
the network is separated into isolated islands that are inca- lation analysis of attack strategies, i.e., the node importance
pable of contacting each other. metrics they adopt for the graph robustness.
Thus, many studies have proposed robust evaluation This task is approached in three ways and illustrates that
indexes and designs to prevent intentional and robust attacks. some computationally complex strategies can be closely pre-
For instance, De Meo et al. [43] found that graph robust- dicted by simpler strategies. Moreover, a few strategies can
ness can be predicted rapidly via the Randic index. Alenazi be adopted to serve as a close proxy for consensus.
and Sterbenz [44] claimed that adding links to maintain the The original dataset is analyzed first during the robustness
balance of the link-betweenness would lead to the best net- analysis. Additionally, the relationship between the degrees
work resilience against the aforementioned attacks among of the nodes and the number of nodes in the original network
the robustness metrics studied. Liu et al. [45] created a new must be analyzed. Figure 8 presents the experimental results.
28132 VOLUME 6, 2018

FIGURE 9. Analysis of the nodes’ degree in the training set and probe set.
FIGURE 10. The error of W between adjacent time periods.
Figure 8 shows an inverse proportional relationship serves as the ordinate. Thus, the partitioning of the dataset can
between the node degree and number of nodes. Therefore, be seen as a network attack. The ratio of training set is set to
in these networks, when the node degree is small, there is 50%, and correlation of the node degree is obtained randomly.
a greater number of occurrences. By contrast, a large node Figure 9 (a-d) presents the results of these four datasets. The
degree is associated with a smaller number of occurrences. scatter plots in panels (a) to (d) demonstrate the node degree
The distribution of the four network degrees generally in the NP, that is, the training set, and NT, that is, the probe
follows the Zipf law. Therefore, the four network degrees set, when the dataset is randomly classified. Panel (e) and
mentioned above can be regarded as scale-free networks. panel (h) present the relationship between the changes in the
Malliaros et al. [37] considered scale-free networks under ratio of the training set and the nodes’ degrees in the probe
deliberate and random attacks. Furthermore, the robustness set and training set. The red line corresponding with the sets
of the network connectivity with the two class node removal represents random classification.
strategies was compared. The results showed that scale-free Figure 9 shows that NP and NT have an almost linear
networks are robust to random node failures, so the robustness relationship because the links are selected as training set and
of these four networks can be analyzed preliminarily. probe set with equal probability. Additionally, C–Node is
Moreover, because connections between these objects may approximately 1 in the datasets divided randomly as training
exist in both the test set and the training set, we need to set ratio changes.
consider the number of node links in the training set, which By comparing the network topology properties of the orig-
serves as the abscissa, and the node links in the test set, which inal dataset and the training set, we can conclude that the
VOLUME 6, 2018 28133

FIGURE 11. The change in the AUC of the AUC_LP algorithm with NC .
scheme used to divide datasets in a deliberate attack does Therefore, the AUC_LP algorithm not only achieves higher
not affect the original network degree distribution. The basic AUC values on real networks than those of other algorithms
characteristics of the original network can be maintained but is also robust.
while achieving high robustness.
VII. CONCLUSION
Various areas, including anthropology, sociology, computer
B. CONVERGENCE ANALYSIS OF THE ALGORITHM science, and information science, have recently paid great
We also study the convergence conditions of the algorithm. attention to the issue of link prediction due to the current
Formula (28) is convex because it is a quadratic loss function. large number of network statistics in electronic form. A novel
Therefore, a gradient descent strategy (31) can be used to test framework is proposed to assess the performance of a link
the convergence of the extreme points of Formula (28). prediction algorithm in this article. In the algorithm, the score
The rate of convergence depends on the step size µt . If the matrix is effectively decomposed with the weight matrix and
step size is too large, convergence to the optimal solution the adjacency matrix. We use the AUC metric as the target
may be difficult, but a small step size may lead to slow function of optimization and obtain the optimal weight matrix
convergence. To achieve a better convergence rate, we use a via the gradient descent algorithm. According to the empirical
variable step size determined by the Armijo rule [47]. findings, the AUC_LP algorithm yields high-quality forecast-
In the process of the iteration of weight matrix W , we cal- ing results compared with those of other algorithms. Thus,
culate the error of W between
adjacent time 2periods. The error the findings in this article are beneficial because they can
formula is: Error (t) = W (t) − W (t−1) F . Then, we plot help Internet retailers to better utilize the current forecasting
the relation between the error and the iteration number of approaches.
the algorithm for different datasets. The results are shown
in Figure 10. REFERENCES
Figure 10 shows that in the iteration of weight matrix W , [1] B. Barzel and A.-L. Barabási, ‘‘Network link prediction by global silencing
the error trend presents a sharp decrease and then rapidly of indirect correlations,’’ Nature Biotechnol., vol. 31, no. 8, pp. 720–725,
reaches the state of convergence. Therefore, the iteration time 2013.
[2] L. Zhu, D. Guo, J. Yin, G. Ver Steeg, and A. Galstyan, ‘‘Scalable temporal
complexity of this step is O(NC), which is equivalent to O(1). latent space inference for link prediction in dynamic social networks,’’
Finally, we study how the AUC of the AUC_LP algorithm IEEE Trans. Knowl. Data Eng., vol. 28, no. 10, pp. 2765–2777, Oct. 2016.
under optimal parameters λ and η changes with the number [3] A. Papadimitriou, P. Symeonidis, and Y. Manolopoulos, ‘‘Fast and accurate
link prediction in social networking systems,’’ J. Syst. Softw., vol. 95, no. 9,
of iterations NC for different datasets. The results are shown pp. 2119–2132, 2012.
in Figure 11. [4] J. Tang, S. Chang, C. Aggarwal, and H. Liu, ‘‘Negative link prediction in
From Figure 11, we can see that the AUC of the AUC_LP social media,’’ in Proc. 8th ACM Int. Conf. Web Search Data Mining, 2015,
pp. 87–96.
algorithm shows an increasing trend and always maintains a
[5] K. Jahanbakhsh, V. King, and G. C. Shoja, ‘‘Predicting missing contacts
high value as NC increases in the four datasets. For example, in mobile social networks,’’ Pervasive Mobile Comput., vol. 8, no. 5,
the AUC value remains constant at approximately 0.935 when pp. 698–716, 2012.
the number of iterations is greater than 80 for the USAir [6] W. Liang, X. He, D. Tang, and X. Zhang, ‘‘S-Rank: A supervised rank-
ing framework for relationship prediction in heterogeneous information
dataset. For the PB dataset, the AUC is maintained at approxi- networks,’’ in Proc. Int. Conf. Ind., Eng. Other Appl. Appl. Intell. Syst.,
mately 0.940 when the number of iterations is greater than 75. Springer, 2016, pp. 305–319.
28134 VOLUME 6, 2018

[7] H. Hwang, D. Petrey, and B. Honig, ‘‘A hybrid method for protein–protein [33] Q.-M. Zhang, A. Zeng, and M.-S. Shang, ‘‘Extracting the information
interface prediction,’’ Protein Sci., vol. 25, no. 1, pp. 159–165, 2016. backbone in online system,’’ PLoS ONE, vol. 8, no. 5, p. e62624, 2013.
[8] P. Wang, B. Xu, Y. Wu, and X. Zhou, ‘‘Link prediction in social networks: [34] G. P. Clemente and R. Grassi, ‘‘Directed clustering in weighted net-
The state-of-the-art,’’ Sci. China Inf. Sci., vol. 58, no. 1, pp. 1–38, 2015. works: A new perspective,’’ Chaos, Solitons Fractals, vol. 107, pp. 26–38,
[9] M. E. J. Newman, ‘‘Clustering and preferential attachment in growing Feb. 2018.
networks,’’ Phys. Rev. E, Stat. Phys. Plasmas Fluids Relat. Interdiscip. [35] S. Agreste et al., ‘‘An empirical comparison of algorithms to find commu-
Top., vol. 64, no. 2, p. 025102, 2001. nities in directed graphs and their application in Web data analytics,’’ IEEE
[10] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, Trans. Big Data, vol. 3, no. 3, pp. 289–306, Sep. 2017.
‘‘Network motifs: Simple building blocks of complex networks,’’ Science, [36] C. P. Santos, D. M. Carvalho, and M. C. V. Nascimento, ‘‘A consensus
vol. 298, no. 5594, pp. 824–827, 2002. graph clustering algorithm for directed networks,’’ Expert Syst. Appl.,
[11] S. A. Golder and S. Yardi, ‘‘Structural predictors of tie formation in twitter: vol. 54, pp. 121–135, Jul. 2016.
Transitivity and mutuality,’’ in Proc. IEEE 2nd Int. Conf. Social Comput. [37] F. D. Malliaros, V. Megalooikonomou, and C. Faloutsos, ‘‘Estimating
(SocialCom), Aug. 2010, pp. 88–95. robustness in large social graphs,’’ Knowl. Inf. Syst., vol. 45, no. 2,
[12] D. Yin, L. Hong, X. Xiong, and Brian D. Davison, ‘‘Link formation pp. 645–678, 2015.
analysis in microblogs,’’ in Proc. 34th Int. ACM SIGIR Conf. Res. Develop. [38] R. V. Solé and M. Montoya, ‘‘Complexity and fragility in ecological
Inf. Retr., 2011, pp. 1235–1236. networks,’’ Proc. Roy. Soc. London B, Biol. Sci., vol. 268, no. 1480,
[13] D. Garlaschelli and M. I. Loffredo, ‘‘Patterns of link reciprocity in directed pp. 2039–2045, 2001.
networks,’’ Phys. Rev. Lett., vol. 93, no. 26, p. 268701, 2004. [39] A. E. Motter, N. Gulbahce, E. Almaas, and A.-L. Barabási, ‘‘Predicting
[14] Y.-X. Zhu, X.-G. Zhang, G.-Q. Sun, M. Tang, T. Zhou, and Z.-K. Zhang, synthetic rescues in metabolic networks,’’ Mol. Syst. Biol., vol. 4, no. 1,
‘‘Influence of reciprocal links in social networks,’’ PLoS ONE, vol. 9, no. 7, p. 168, 2008.
p. e103007, 2014. [40] A. G. Haldane and R. M. May, ‘‘Systemic risk in banking ecosystems,’’
[15] D. M. Romero and J. M. Kleinberg, ‘‘The directed closure process in hybrid Nature, vol. 469, no. 7330, pp. 351–355, 2011.
social-information networks, with an analysis of link formation on twitter,’’ [41] R. Albert, I. Albert, and G. L. Nakarado, ‘‘Structural vulnerability of the
in Proc. ICWSM, 2010, pp. 138–145. North American power grid,’’ Phys. Rev. E, Stat. Phys. Plasmas Fluids
[16] D. Schall, ‘‘Link prediction in directed social networks,’’ Social Netw. Relat. Interdiscip. Top., vol. 69, no. 2, p. 025103, 2004.
Anal. Mining, vol. 94, no. 1, p. 157, 2014. [42] M. E. J. Newman, ‘‘Power laws, Pareto distributions and Zipf’s law,’’
[17] Q.-M. Zhang, L. Lü, W.-Q. Wang, Y. Xiao, and T. Zhou, ‘‘Potential theory Contemp. Phys., vol. 46, no. 5, pp. 323–351, 2005.
for directed networks,’’ PLoS ONE, vol. 8, no. 2, p. e55437, 2013. [43] P. De Meo et al., ‘‘Estimating graph robustness through the Randic index,’’
[18] R. N. Lichtenwalter and N. V. Chawla, ‘‘Vertex collocation profiles: Sub- IEEE Trans. Cybern., vol. 10, pp. 1–14, 2017.
graph counting for link analysis and prediction,’’ in Proc. 21st Int. Conf. [44] M. J. F. Alenazi and J. P. G. Sterbenz, ‘‘Evaluation and comparison of
World Wide Web, 2012, pp. 1019–1028. several graph robustness metrics to improve network resilience,’’ in Proc.
[19] X. Li and H. Chen, ‘‘Recommendation as link prediction in bipartite 7th Int. Workshop IEEE Reliable Netw. Des. Modeling (RNDM), Oct. 2015,
graphs: A graph kernel-based machine learning approach,’’ Decision Sup- pp. 7–13.
port Syst., vol. 54, no. 2, pp. 880–890, 2013. [45] W. Liu, M. Gong, S. Wang, and L. Ma, ‘‘A two-level learning strategy based
[20] S. Scellato, A. Noulas, and C. Mascolo, ‘‘Exploiting place features in memetic algorithm for enhancing community robustness of networks,’’ Inf.
link prediction on location-based social networks,’’ in Proc. 17th ACM Sci., vol. 422, pp. 290–304, Jan. 2018.
SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2011, pp. 1046–1054. [46] M. B. Baig and L. Akoglu, ‘‘Correlation of node importance measures: An
[21] J. Scripps, P.-N. Tan, F. Chen, and A.-H. Esfahanian, ‘‘A matrix alignment empirical study through graph robustness,’’ in Proc. 24th Int. Conf. World
approach for link prediction,’’ in Proc. 19th Int. Conf. Pattern Recognit. Wide Web, 2015, pp. 275–281.
(ICPR), Dec. 2008, pp. 1–4. [47] L. Armijo, ‘‘Minimization of functions having Lipschitz continuous first
[22] M. Pujari and R. Kanawati, ‘‘Link prediction in complex networks by partial derivatives,’’ Pacific J. Math., vol. 16, pp. 1–3, Nov. 1966.
supervised rank aggregation,’’ in Proc. IEEE 24th Int. Conf. Tools Artif.
Intell. (ICTAI), Nov. 2012, pp. 782–789.
[23] K.-Y. Chiang, N. Natarajan, A. Tewari, and I. S. Dhillon, ‘‘Exploiting
longer cycles for link prediction in signed networks,’’ in Proc. 20th ACM
Int. Conf. Inf. Knowl. Manage., 2011, pp. 1157–1162.
[24] C. A. Bliss, M. R. Frank, C. M. Danforth, and P. S. Dodds, ‘‘An evolution-
BOLUN CHEN received the Ph.D. degree from the
ary algorithm approach to link prediction in dynamic social networks,’’ Nanjing University of Aeronautics and Astronau-
J. Comput. Sci., vol. 5, no. 5, pp. 750–764, 2014. tics, Nanjing, China, in 2016. He was a Visiting
[25] R. R. Sarukkai, ‘‘Link prediction and path analysis using Markov chains,’’ Scholar with the University of Fribourg, Fribourg,
Comput. Netw., vol. 33, nos. 1–6, pp. 377–386, 2000. Switzerland. He is a Professor with the Huaiyin
[26] Y. Bhawsar, G. S. Thakur, and R. S. Thakur, ‘‘Model for link prediction in Institute of Technology, Lianyungang, China. His
social network by genetic algorithm approach,’’ Data Mining Knowl. Eng., research interests include link prediction, recom-
vol. 7, no. 5, pp. 191–196, 2015. mender systems, and data mining. He is currently
[27] Y. Zhao, S. Li, C. Zhao, and W. Jiang, ‘‘Link prediction via a neighborhood- a reviewer of the Journal of Information Science,
based nonnegative matrix factorization model,’’ in Proc. 3rd Int. Conf. World Wide Web Journal, and Reliability Engi-
Commun., Signal Process., Syst., 2015, pp. 603–611. neering & System Safety.
[28] M. Pujari and R. Kanawati, ‘‘Supervised rank aggregation approach for
link prediction in complex networks,’’ in Proc. 21st Int. Conf. World Wide
Web, 2012, pp. 1189–1196.
[29] B. Ermis and A. T. Cemgil. (2014). ‘‘A Bayesian tensor factorization
model via variational inference for link prediction.’’ [Online]. Available:
https://arxiv.org/abs/1409.8276
YONG HUA received the M.S. degree from the
[30] C. Dai, L. Chen, and B. Li, ‘‘Link prediction based on sampling in complex Huaiyin Institute of Technology, Lianyungang,
networks,’’ Appl. Intell., vol. 47, no. 1, pp. 1–12, 2017. China. His research interests include maximiza-
[31] L. Linyuan, P. Liming, Z. Tao, Z. Yi-Cheng, and H. E. Stanley, ‘‘Toward tion of influence of complex network and recom-
link predictability of complex networks,’’ Proc. Nat. Acad. Sci. USA, mender systems.
vol. 112, no. 8, pp. 2325–2330, 2014.
[32] L. Li, L. Qian, J. Cheng, M. Ma, and X. Chen, ‘‘Accurate similarity index
based on the contributions of paths and end nodes for link prediction,’’
J. Inf. Sci., vol. 41, no. 2, pp. 167–177, 2015.
VOLUME 6, 2018 28135

YAN YUAN received the M.S. degree from the YING JIN is currently a Professor with the Huaiyin
Huaiyin Institute of Technology, Lianyungang, Institute of Technology, Lianyungang, China. Her
China. Her research interests include complex net- research interests include deep learning, recom-
work, data analysis, and machine learning. mender systems, and data analysis.
28136 VOLUME 6, 2018

AUC in Link Predicction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AUC in Link Predicction

Uploaded by

Copyright:

Available Formats

Received April 13, 2018, accepted May 12, 2018, date of publication May 18, 2018, date of current

version June 19, 2018.

Link Prediction on Directed Networks

Corresponding author: Yong Hua (huayong1994@126.com)

I. INTRODUCTION between people can be found through the analysis of social

II. RELATED WORK In addition, several learning approaches related to predict-

VOLUME 6, 2018 28123

28124 VOLUME 6, 2018

the results of link prediction. As a significant measure of

VOLUME 6, 2018 28125

or aij = 0, then A = [aij ] is the neighboring matrix of the

28126 VOLUME 6, 2018

4diout djout − diout − djout

a network, the Pearson correlation coefficient of two nodes hoods, namely:

(vi ,vj )∈E di dj /M − [ (vi ,vj )∈E 2 (di

VOLUME 6, 2018 28127

11. Preferential attachment index (PA). A. THE OBJECTIVE FUNCTION

28128 VOLUME 6, 2018

0(vk ) = v|(vk , v) ∈ E. Algorithm AUC_LP (Link Prediction Based on AUC)

VOLUME 6, 2018 28129

FIGURE 4. The dependence of the AUC on parameters λ and η in different datasets.

28130 VOLUME 6, 2018

VOLUME 6, 2018 28131

28132 VOLUME 6, 2018

FIGURE 10. The error of W between adjacent time periods.

VOLUME 6, 2018 28133

28134 VOLUME 6, 2018

VOLUME 6, 2018 28135

28136 VOLUME 6, 2018

You might also like