Edge Classification in Networks

Charu C.
Aggarwal, Peixiang Zhao, and Gewen He

Florida State University
IBM T J Watson Research Center
Edge Classification in Networks
ICDE Conference, 2016

Introduction
• We consider in this paper the edge classification problem in

networks, which is defined as follows.
• Given a graph-structured network G(N, A), where N is a set

of vertices and A ⊆ N × N is a set of edges, in which a subset
Al ⊆ A of edges are properly labeled a priori.
• Determine for those edges in Au = A \ Al the edge labels

which are unknown.
Applications
• The edge classification problem has numerous applications

in a wide array of network-centric scenarios:
– A social network may contain relationships of various types

such as friends, families, or even rivals. A subset of the
edges may be labeled with the relevant relationships. The
determination of unlabeled relationships can be useful for
making targeted recommendations.
– An adversarial or terrorist network may contain unknown

relationships between its participants although a subset of
the relationships have been known already.
– In some cases, the edges may have numerical quantities

associated with them corresponding to strengths of rela-
tionships, specification of ratings, likes, or dislikes.
Contributions
• We formulate the edge classification problem and the edge

regression problem in graph-structured networks.
• We propose a structural similarity model for edge classifica-

tion.
• In order to support edge classification in large-scale networks

that are either disk-resident or stream-like, we devise prob-
abilistic, min-hash based edge classification algorithms that
are efficient and cost-effective without comprising the clas-
sification accuracy significantly;
• We carry out extensive experimental studies in different real-

world networks, and the experimental results verify the effi-
ciency and accuracy of our edge classification techniques.
Problem Definition
• We provide a formal model for edge classification in networks.
• We assume that we have an undirected network G = (N, A),

where N denotes the set of vertices, A denotes the set of
edges, and n = |N |, m = |A|.
• A subset of edges in A, denoted by Al , are associated with

edge labels.
• The label of the edge (u, v) is denoted by luv .
• The remaining subset of unlabeled edges is denoted by Au =

A \ Al .
Formal Definitions
• Given an undirected network G = (N, A), and a set Al ⊆ A

of labeled edges, where each edge (u, v) ∈ Al has a label luv ,
the edge classification problem is to determine the labels for
the edges in Au = A \ Al .
• In some scenarios, it is also possible for the edges in A to be

labeled with numeric quantities.
• As a result, edge classification turns out to be the edge re-

gression problem in networks, defined as follows,
• Consider an undirected network G = (N, A), and a set Al ⊆ A

of edges, each of which is annotated by numerical labels,
denoted by luv ∈ R. The edge regression problem is to de-
termine the numerical labels for the edges in Au = A \ Al .
Overview
• We will propose a structural similarity model for edge classi-

fication.
• This approach shares a number of similarities with the nearest

neighbor classification, except that the structural similarity of
one edge to another is much harder to define in the context
of a network.
Overview of Steps
• Consider an edge (u, v) ∈ Au, which needs to be classified

within the network G = (N, A).
– The top-k most similar vertices to u, denoted by S(u) =

{u1 . . . uk }, are first determined based on a structural sim-
ilarity measure discussed slightly later.
– The top-k most similar vertices to v, denoted by S(v) =

{v1 . . . vk }, are determined based on the same structural
similarity measure as in the case of vertex u;
– The set of edges in Al ∩ [S(u) × S(v)] are selected and

the dominant class label of these edges is assigned as the
relevant edge label for the edge (u, v).
Similarity Computation
• How are the most similar vertices to a given vertex u ∈ N are

determined in a network?
• Here the key step is to define a pair-wise similarity function

for vertices.
• We consider Jaccard coefficient for vertex-wise similarity

quantification.
Similarity Computation
• Let I −(u) ⊆ N be the set of vertices incident on the vertex

u belonging to the edge label −1.
• Let I + (u) ⊆ N be the set of vertices incident on u belonging

to the edge label 1.
• The Jaccard coefficient of vertices u and v, J +(u, v), on the

positive edges (bearing the edge label 1) is defined as follows:
+ |I +(u) ∩ I +(v)|
J (u, v) = + (1)
|I (u) ∪ I +(v)|
Similarity Computation (Contd.)
• The Jaccard coefficient on the negative edges can be defined

in an analogous way, as follows:
− |I −(u) ∩ I −(v)|
J (u, v) = − (2)
|I (u) ∪ I −(v)|
• As a result, the similarity between vertices u and v can be

defined by a weighted average of the values of J +(u, v) and
J −(u, v).
• The most effective way to achieve this goal is to use the

fraction of vertices belonging to the two classes.
Weighted Jaccard
• The fraction of vertices incident on u belonging to the edge

+ |I + (u)|
label 1 is given by fu = + .
|I (u)|+|I −(u)|
• Similarly, the value of f − (u) is defined as (1 − f + (u)).
• The average fraction across both vertices is defined by

f + (u, v) = (f + (u) + f + (v))/2, and f −(u, v) = (f − (u) +
f − (v))/2.
• Therefore, the overall Jaccard similarity, J(u, v), of the ver-

tices u and v is defined by the weighted average:
J(u, v) = f +(u, v) · J +(u, v) + f − (u, v) · J −(u, v) (3)

Differential Weight
• An important variation of the edge classification process is

that all edges are not given an equal weight in the process
of determining the dominant edge label.
• Consider the labeled edge (ur , vq ), where ur ∈ S(u), and vq ∈

S(v). The weight of the label of the edge (ur , vq ) can be set
to J(u, ur ) × J(v, vq ) in order to ensure greater importance of
edges, which are more similar to the target vertices u and v.
• It is also interesting to note that the above approach can

easily be extended to the k-way case by defining a separate
Jaccard coefficient on a class-wise basis, and then aggregat-
ing in a similar way, as discussed above.
Sparse Labeling
• In real-world cases, the edges are often labeled rather

sparsely, which makes it extremely difficult to compute simi-
larities with only the labeled edges.
• In such cases, a separate component of the similarity,

J 0(u, v), between vertices u and v is considered only with
unlabeled edges, just like J +(u, v) and J −(u, v) which are the
Jaccard coefficients with positive and negative edges, respec-
tively.
J(u, v) = f +(u, v)·J +(u, v)+f − (u, v)·J −(u, v)+µJ 0(u, v) (4)
• Here, µ is a discount factor less than 1, which is used to pre-

vent excessive impact of the unlabeled edges on the similarity
computation.
Numerical Labels
• The case of numerical class variables in edge regression is

similar to that of binary or categorical class variables in edge
classification, except that the vertex similarities and the class
averaging steps are computed differently in order to account
for the numerical class variables.
• Let I +(u) be the set of all vertices incident on a particular

vertex u, such that for each v ∈ I +(u), the edge (u, v) has
a numerical label whose value is greater than the average of
the labeled edges incident on u.
• The remaining vertices incident on u, whose labeled edge

values are below the average, are put in another set, I −(u).
• The similarity values J +(u, v), J −(u, v), and J(u, v) can be
defined in an analogous way according to the previous binary
edge classification case.
Classification with Numeric Labels
• In order to determine the numeric label of an edge (u, v),

a similar procedure is employed as in the previous case.
The first step is to determine the closest k vertices S(u) =
{u1 . . . uk } to u, and the closest k vertices {v1 . . . vk } to v with
the use of the aforementioned Jaccard similarity function.
• Then, the numeric labels of the edges in Al ∩ [S(u) × S(v)]

are averaged.
• As in the case of edge classification, different edges can be

given different weights in the regression process.
Challenges
• The main problem with the exact approaches discussed in

the previous section is that they work most efficiently when
the entire graph is memory-resident.
• However, when the graph is very large and disk-resident, the

computation of par-wise vertex similarity is likely to be com-
putationally expensive because all the adjacent edges of dif-
ferent vertices need to be accessed, thus resulting in a lot of
random accesses to hard disks.
• In many real-world dynamic cases, the graph where edge

classification is performed is no long static but evolving in a
fast speed.
Basic Approach
• Use synopsis structure to summarize the graph
• Synopsis structure can be efficiently updated in real time
• Synopsis structure can be leveraged to efficiently perform

classification
Broad Approach
• In order to achieve this goal, we will consider a probabilistic,

min-hash based approach, which can be applied effectively
for both disk-resident graphs and graph streams.
• In this min-hash approach, the core idea is to associate a

succinct data structure, termed min-hash index, with each
vertex, which keeps track of the set of adjacent vertices.
• Specifically, the positive, negative, and unlabeled edges (and

their incident adjacent vertices) of a given vertex are handled
separately in different min-hash indexes.
Min-Hash Technique
• The use of the min-hash index is standard in these settings
• Details of the use of the min-hash index in paper
• Theoretical bounds provided in paper

Experimental Results
• Experimental studies demonstrate the effectiveness and ef-

ficiency of our proposed edge classification methods in real-
world networks.
• We consider a variety of effectiveness and efficiency metrics

on diverse networked data sets to show the broad applicability
of our methods.
Data Sets
• Epinions (binary)
• Slashdot (binary)
• Wikipedia (binary)
• Youtube (multiple labels)

Metrics
• Accuracy: we randomly select 1, 500 edges from each net-

work dataset and examine the average accuracy of classifi-
cation for these edges w.r.t. the true labels provided in the
network;
• Time: we gauge the average time consumed for edge clas-

sification for the 1, 500 edges selected randomly from within
the networks;
• Space: as opposed to the exact methods that need explic-

itly materialize the whole networks for edge classification,
the probabilistic methods build space-efficient min-hash in-
dexes, which are supposed to be succinct in size and memory-
resident. We record the total space cost (in megabytes) of
min-hash structures in the probabilistic methods, which need
not store the entire graphs for edge classification.
Accuracy
100 100
90 90
Accuracy (%)
Accuracy (%)
80 80
70 70
60 60
50 Ex Pro Ex Pro Ex Pro 50 Ex Pro Ex Pro Ex Pro

tU bU tW bW tW bW tU bU tW bW tW bW
F F F F P P F F F F P P
Edge Classification Methods Edge Classification Methods
(a) Epinion (b) Slashdot

50
100
40
90
Accuracy (%)
Accuracy (%)
30
80
20
70
60 10
50 Ex Pro Ex Pro Ex Pro 0 Ex Pro Ex Pro Ex Pro

tU bU tW bW tW bW tU bU tW bW tW bW
F F F F P P F F F F P P
(c) Wikipedia (d) Youtube

Running Time
7 5
6
Average Runtime (sec.)

4
5
4 3
3 2
2
1
1
0 Ex P Ex Pro Ex Pro 0 Ex Pro Ex Pro Ex Pro
tUF robU tW bW tW bW tUF bU tW bW tW bW
F F F P P F F F P P

10
0.14
0.12 8
0.1
6
0.08
0.06 4
0.04
2
0.02
0 Ex P Ex P Ex P 0 Ex Pro Ex Pro Ex Pro
tUF robU tW robW tW robW tUF bU tW bW tW bW
F F F P P F F F P P

Effect of the number of nearest
neighbors
100 100
ExtWF ProbUF ProbWP ExtWF ProbUF ProbWP
95 ExtUF ExtWP ProbWF ExtUF ExtWP ProbWF
90
90
Accuracy (%)
Accuracy (%)
85 80
80 70
75
60
70
65 40 60 80 100 50 40 60 80 100
The Number of Top Similar Neighbors (k) The Number of Top Similar Neighbors (k)

50
100 ExtWF ProbUF ProbWP
ExtWF ProbUF ProbWP 45 ExtUF ExtWP ProbWF
95 ExtUF ExtWP ProbWF
40
90
Accuracy (%)
35
Accuracy (%)
85
30
80
25
75
70
20
65 15
60 10 40 60 80 100
40 60 80 100
The Number of Top Similar Neighbors (k) The Number of Top Similar Neighbors (k)

Effect of hash functions
100 1
20 20
90 30 30

0.8
80 40 40
Accuracy (%)
50 50
70 0.6
60
50 0.4
40
0.2
30
20 Epinion Slashdot Wikipedia Youtube 0 Epinion Slashdot Wikipedia Youtube
Networks Networks
(a) Accuracy (b) Runtime

80
20
70 30
60 40
Space Cost (MB)
50
50
40
30
20
10
0 Epinion Slashdot Wikipedia Youtube
Networks
(c) Space
Effect of the number of unlabeled edges
100 100
ExtWP ExtWP
95 ProbWP 95 ProbWP
90 90
Accuracy (%)
Accuracy (%)
85 85
80 80
75 75
70 70
65 65
60 10% 30% 50% 70% 60 10% 30% 50% 70%
The Percentage of Unlabeled Edges (t) The Percentage of Unlabeled Edges (t)

100 50
ExtWP ExtWP
95 ProbWP ProbWP
40
90
Accuracy (%)
Accuracy (%)
85 30
80
75 20
70
10
65
60 10% 30% 50% 70% 0 10% 30% 50% 70%
The Percentage of Unlabeled Edges (t) The Percentage of Unlabeled Edges (t)

Conclusions
• New technique for edge classification in networks
• Uses probabilistic min-hash data structures for efficient clas-

sification

Edge Classification in Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Edge Classification in Networks

Uploaded by

Copyright:

Available Formats

Charu C.

Aggarwal, Peixiang Zhao, and Gewen He

Edge Classiﬁcation in Networks

ICDE Conference, 2016

• We consider in this paper the edge classiﬁcation problem in

• Given a graph-structured network G(N, A), where N is a set

• Determine for those edges in Au = A \ Al the edge labels

• The edge classiﬁcation problem has numerous applications

– A social network may contain relationships of various types

– An adversarial or terrorist network may contain unknown

– In some cases, the edges may have numerical quantities

• We formulate the edge classiﬁcation problem and the edge

• We propose a structural similarity model for edge classiﬁca-

• In order to support edge classiﬁcation in large-scale networks

• We carry out extensive experimental studies in diﬀerent real-

• We provide a formal model for edge classiﬁcation in networks.

• We assume that we have an undirected network G = (N, A),

• A subset of edges in A, denoted by Al , are associated with

• The label of the edge (u, v) is denoted by luv .

• The remaining subset of unlabeled edges is denoted by Au =

• Given an undirected network G = (N, A), and a set Al ⊆ A

• In some scenarios, it is also possible for the edges in A to be

• As a result, edge classiﬁcation turns out to be the edge re-

• Consider an undirected network G = (N, A), and a set Al ⊆ A

• We will propose a structural similarity model for edge classi-

• This approach shares a number of similarities with the nearest

• Consider an edge (u, v) ∈ Au, which needs to be classiﬁed

– The top-k most similar vertices to u, denoted by S(u) =

– The top-k most similar vertices to v, denoted by S(v) =

– The set of edges in Al ∩ [S(u) × S(v)] are selected and

• How are the most similar vertices to a given vertex u ∈ N are

• Here the key step is to deﬁne a pair-wise similarity function

• We consider Jaccard coeﬃcient for vertex-wise similarity

• Let I −(u) ⊆ N be the set of vertices incident on the vertex

• Let I + (u) ⊆ N be the set of vertices incident on u belonging

• The Jaccard coeﬃcient of vertices u and v, J +(u, v), on the

• The Jaccard coeﬃcient on the negative edges can be deﬁned

• As a result, the similarity between vertices u and v can be

• The most eﬀective way to achieve this goal is to use the

• The fraction of vertices incident on u belonging to the edge

• Similarly, the value of f − (u) is deﬁned as (1 − f + (u)).

• The average fraction across both vertices is deﬁned by

• Therefore, the overall Jaccard similarity, J(u, v), of the ver-

J(u, v) = f +(u, v) · J +(u, v) + f − (u, v) · J −(u, v) (3)

• An important variation of the edge classiﬁcation process is

• Consider the labeled edge (ur , vq ), where ur ∈ S(u), and vq ∈

• It is also interesting to note that the above approach can

• In real-world cases, the edges are often labeled rather

• In such cases, a separate component of the similarity,

• Here, µ is a discount factor less than 1, which is used to pre-

• The case of numerical class variables in edge regression is

• Let I +(u) be the set of all vertices incident on a particular

• The remaining vertices incident on u, whose labeled edge

• In order to determine the numeric label of an edge (u, v),

• Then, the numeric labels of the edges in Al ∩ [S(u) × S(v)]

• As in the case of edge classiﬁcation, diﬀerent edges can be

• The main problem with the exact approaches discussed in

• However, when the graph is very large and disk-resident, the

• In many real-world dynamic cases, the graph where edge

• Use synopsis structure to summarize the graph

• Synopsis structure can be eﬃciently updated in real time

• Synopsis structure can be leveraged to eﬃciently perform

• In order to achieve this goal, we will consider a probabilistic,