You are on page 1of 4

2016 IEEE International Conference on Web Services

An Efficient and Effective Overlapping Communities Discovery based on


Agglomerative Graph

Ying Yin1, Liang Chen2, Yuhai Zhao1,3, He Li1, Bin Zhang1, Yongming Yan1
1
College of Computer Science and Engineering, Northeastern University, Shenyang, China
E-mail: yinying@mail.neu.edu.cn; Corresponding author: zhaoyuhai@mail.neu.edu.cn;
2
School of Computer Science & Information Technology, RMIT, Melbourne
Australia liang.chen@rmit.edu.au
3
Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China

Abstract—Community discovery is a popular way to solve the community. Overlapping communities is a universal
personal service recommendation problem and has recently phenomenon existing in social network. Overlapping
attracted more and more attentions of the researchers. The Communities structure has practical significance [7]. on one
communities are often practically overlapping with each other, hand, overlapping part is the key "bridge" in the network, so
thus more and more research focus on the problem of
overlapping communities detection. A common drawback of
overlapping communities could contact; Overlapping
the existing algorithms to this problem is the low efficiency communities, on the other hand, as a cover to the network, it
when dealing the large scale network. In this paper, we reflects the more real network structure and has guiding
propose a graph compression based overlapping communities significance in studying the real structure of network
discovery algorithm, which greatly enhances the power of topology. Therefore, Overlapping Communities discovery
handling large networks even using a single computer. First, a becomes a new hotspot in community discovery.
graph compression based social network model, namely There are many related work on Overlapping community
agglomerative graph, is introduced, which is a lossless discovery. For example, Palla [8] et al proposed the first
compression to the original network. Then, inspired by the Overlapping Communities discovery algorithm, CPM, based
idea of iteration based on the selected seeds, the algorithm
expands the selected seeds to the communities by optimizing
on k-clique faction filter in 2005. Shen [9] proposed the
the proposed community fitness function iteratively. Finally, it community discovery algorithm based on merging greatly
merges the communities of high similarity with each other to similar level factions; Lee [10] obtained Overlapping
get the final results. Since the network is lossless compressed, Communities structure of the algorithm by the study of the
and massive redundant computations are avoided, the results greedy extension of factions. Evans [11] and others came up
can be exactly obtained in an efficient and effective way. The with a method based on transformation of the graphs into
experiments based on both real and synthetic datasets edges. That is to transform the original graph into Line
demonstrate efficiency and effectiveness of the proposal graph, then clustering the nodes in the Line graph and
method in detecting overlapping communities over large scale returning the result of clustering to original graph. As nodes
networks.
belonging to different communities in the Line graph being
Keywords-overlapping community; social network; service transformed into edges in the original graph, they might be
recommendation; agglomerative graph; joined by the same node. Then these nodes are regarded as
the overlapping nodes. Lim and others made an improvement
algorithm, LinkSCAN [12]. When Lim and others cluster
I. INTRODUCTION nodes in the Line graph, the degree of similarity between
With the development of network technology, especially nodes is calculated in the original graph. However, the
the emergence of social networking sites like Facebook 1 , current mainstream overlapping communities discovery
Renren2, more and more individual users join the network algorithm whose computational complexity is too high and
build by these virtual relationships, and social network have the efficiency is low in the scenario of large dense network.
developed rapidly [1]. Web service community discovery is This paper mainly studies such an algorithm, which
to divide the social network into several independent sub- improves the scale of single machine processing network in
communities by mining and analyzing the relationships the case of large-scale complex networks and in the case of
between users [2]. The goal of Web service community guaranteeing accuracy.
discovery is to find a series of similar user set with the object In this paper, the main contributions are as follows: (1)
user and realize the personal recommendation [3-6]. For designing a kind of graph compression based agglomerative
example, community discovery in E-commerce could help to graph; (2) proposing an algorithm called CLEAR in which
establish a personalized service based on the structure of the the community detection in the agglomerative graph can be
implemented; (3) proving the effectiveness and high
1 http://www.facebook.com/ efficiency of the proposed algorithm.
2
http://www.renren.com/

978-1-5090-2675-3/16 $31.00 © 2016 IEEE 708


DOI 10.1109/ICWS.2016.99
The remainder of the paper is organized as follows: B. Overlapping communities discovery algorithm(CLEAR)
Section 2 details the Overlapping Communities detection Given agglomerative graph G'=(V', E'), Overlapping
algorithm CLEAR based on agglomerative graph; Section 3 Communities Discovery(CLEAR) algorithm doesn’t need to
gives the experimental results and analyzes the results; compress and can directly detect Overlapping Communities
finally, Section 4 summarizes the paper and points out the in agglomerative graph. The algorithm is made up of three
direction of further work. main steps: (1) to select a number of agglomerative nodes as
II. OVERLAPPING COMMUNITIES DISCOVERY “seeds”; (2) continually optimizing the community fitness
ALGORITHM function to expand “seed” for community; (3) to merge
community with high similarity to obtain the ultimate
A. The agglomerative graph construction Overlapping Communities. We will describe the three steps
respectively in detail.
Due to the huge amounts of nodes and edges
information contained by large-scale dense network, the C. the selection of “seeds”
present Overlapping Communities detection algorithm will First of all, how to select suitable nodes to be the “seed”
lead to the too much higher complexity and lower is the first step. Based on the above idea, in the CLEAR
computational efficiency. Aiming at this problem, the paper algorithm, we choose “great faction” to be “seed”. “Faction”
first puts forward a model of Overlapping Communities is a complete subgraph. As already mentioned, “faction”
based on graph compression which is agglomerative graph. structure represents the most strictly defined community
Fig. 1 shows an original network and the compressed structure. “Great faction” refers to factions that are not
agglomerative graph. Due to the large number of nodes and included by any other factions in network. To choose like
edges, it’s hard to discovery and understand the interaction this, on the one hand, because “great faction” itself is a strict
between patterns. However the network became smaller in define of community structure which tightly links inside and
size after the initial network is converted into the is suitable to act as the most center role in the community.
agglomerative graph. So that it slowly absorbs its neighboring nodes surrounding
it to form the community. On the other hand, due to the
CLEAR algorithm being applied to the agglomerative graph,
and due to the “great faction” exactly corresponding to the
“faction” structure model of agglomerative graph, so we
don’t need to traverse all the nodes in network in the process
of looking for “seed”. We only need to find the “faction”
structure model which meets the “seed” condition in
 agglomerative graph, which greatly save the search time for
“seed”.
Fig. 1 the original network and the compressed agglomerative graph
D. “seed” extension
The model can not only compress large-scale graphs "Seeds" extension involves a very important concept,
losslessly, but also permit Overlapping Communities namely the community fitness function.
detection directly on the compression graphs, without Community fitness function to expand the “seed” into
unzipping them. We can get agglomerative nodes through community is very important. The set constituted by all
the clustering of nodes in the original graph. According to nodes in a subgraph S of G is denoted by C. If the deletion
the similarity of neighboring nodes, the paper describes of any element in set C can’t make the community fitness
which nodes to choose to form the same agglomerative function F of S become bigger, and adding any neighboring
node. In particular, given a set of nodes in the original node of S to set C can’t make F become bigger, then S is a
graph, if these nodes share common neighboring nodes, then community.
they constitute an agglomerative node in the corresponding
agglomerative graph. Based on this metric function, we can
use hierarchical clustering to produce the final
agglomerative node.
From the generation process of agglomerative graphs,
generating agglomerative nodes is based on the topology of
the original graph and retains the node information in the
original graph; at the same time, the agglomerative edges Fig.2. the extension process of “seed”
can represent all the edges in the original graph, and there is As shown in Fig. 2, the process of “seed” extension can
also no information loss. So the agglomerative graph is the be described as: (1) for each neighboring node v in S,
lossless compression of the original graph. calculate the contribute value of v to S. That is, after joining
v, the change value of community fitness degree of S. (2)
select node vmax which contributes most to S. (3) If the

709
contribution value of vmax to S is positive, add it to S and A. Datasets
return (1), otherwise stop expanding and return S. The experimental data includes artificial data sets and
Extension process likes raindrops falling down, which is real data sets. Where, artificial data is generated by LFR
a kind of dynamic scene that spread slowly from the center benchmark mesh generator provided by [13] by constantly
to surrounding areas. For every “seed”, expand it in this adjusting parameters. LFR is a recognized benchmark
way. When the extension of two “seeds” overlaps, we find network used to test the community detection algorithms.
the overlapping nodes of the two communities. In extension, We use two LFR benchmark networks which respectively
community fitness function and selection of “seeds” is not include 1000 and 5000 nodes, and specific parameter
fixed. We can adopt different community fitness function settings as shown in table 1. N represents the number of
and different selection methods of “seeds” according to nodes in a network; k represents the average degree of
different circumstance. nodes in a network; Cmin represents the number of nodes
E. Community merging contained by the smallest community; Cmax represents the
number of nodes contained by the largest community; mix is
In the process of expanding the “seeds” into
a mix parameter, which represents the ratio of the number of
communities, there may be high degrees of Overlapping
outside edges to the number of total edges; On represents the
Communities and even be the situation that community
number of overlapping nodes; Om represents the number of
contains each other. Given the above two situations, we can
communities to which each overlapping node belongs.
merge communities whose overlapping degrees are high or Table 1 LFR parameter settings of benchmark network
communities which complete cover other communities. One Artificial data N k Cmin Cmax mix On/N Om
of the simplest measurement methods of overlapping degree LFR1 1000 20 10 50 0.1-0.85 0.2 2-5
is to calculate the ratio between the number of overlapping LFR2 5000 20 20 100 0.1-0.85 0.3 2-5
nodes in the two communities and the number of all the This article chooses two real data sets to avoid the
nodes in two communities. situation that the result is not convincing because of using
In the process of merger, by defining a threshold e, two only one type of test network.
community S and S’ merge when δ (S,S′)≥e (δ (S,S′) is the Table 2 real data sets
overlapping degree of community S and S’). After all the Real data sets #nodes #links <C>
Facebook 4039 88234 0.6055
communities whose overlapping degree is larger than the Amazon 334863 925872 0.3967
threshold e being merged, the final result of community in
the network is obtained. Where, the nodes which are Table 2 gives the total description of the two real data
obtained by multiple communities are called overlapping sets, Where, #nodes represents the number of nodes
nodes and communities which contain overlapping nodes contained by data sets, and #links represents the number of
are called overlapping communities. edges contained by data sets. <c> is the average clustering
coefficient. It is such coefficient that represents the
F. CLEAR Algorithm clustering degree of nodes in networks. These two real data
Next, the CLEAR algorithm is described in detail. The sets can be both downloaded in [14].
input of the algorithm are agglomerative graph G'=(V',E') of This section analyzes the efficiency of CLEAR algorithm
the original graph G, threshold k which means the size of based on the module degree of Mov [15] of Overlapping
seed, parameter  which controls the scale of community Communities. On the premise of not knowing the real
and threshold e of overlapping degree. The output is the community structure, Mov is an evaluation standard to
community result of the whole network. measure the effect of community partition. The value of
The CLEAR algorithm finally returns a serious of node Mov is between -1 and 1. The bigger the value is, the better
sets C1,C2,, which means the nodes contained by each the effect of the community partition is.
community. If some of the nodes are contained by multiple
communities, then these nodes are overlapping nodes.
Communities which contain overlapping nodes are
overlapping communities.
III. EXPERIMENTS
This section analyzes and validates the CLEAR
algorithm from two aspects of effectiveness and efficiency. 
Algorithm is written by C++ and all experiments are Fig. 3 Results of accuracy
completed by implementing them on a PC, where HP basic
frequency is 2.33GHZ, the memory is 4G for Windows 7 The compared results of Mov values on such two real
OS. Contrast algorithms are three classic community data sets as “Facebook” and “Amazon” are shown as Fig. 3.
detection algorithms: LFM, LinkSCAN and CPM. We can see that the module degree which is obtained by
CLEAR algorithm is the highest. That is, to the network of

710
different size, CLEAR algorithm has better validity. CPM starting point to simplify the large-scale networks, by
algorithm is too strict to the definition of community processing the graph models which represent the initial
structure, which leads to accuracy degrees not being high. social networks, and then directly executes the process of
While the LinkSCAN algorithm is prone to produce too overlapping communities detection on the compressed
much unnecessary small communities, this leads to too graphs, with no need to decompress. Further, we design a
many overlapping communities. representation model of community structure-agglomerative
graph, which is based on graph compression. At the same
time, we design the CLEAR algorithm, which can detect
overlapping communities directly on agglomerative graphs.
Finally, we verify that the algorithm is effective and
efficient.
ACKNOWLEDGMENT

Project supported by the National Nature Science Foundation of
(a)overlapping quality vs. N (b) overlapping quality vs. k
China (No. 61272182, 61100028, 61572117), State Key Program
Fig. 4 Results of overlapping quality of National Natural Science of China (61332014), Key Laboratory
of Computer Network and Information Integration of southeast
Next, we uses F-score [12] standard to test the accuracy University(K93-9-2014-03B) and Fundamental Research Funds for
of overlapping nodes detected by algorithms. F-score the Central Universities (N150402002, N150404008).
standard is on the basis of knowing the similarity of real
values, by comparing the similarity of real values and REFERENCES
experimental values. Fig.4 shows the F-score values [1] Aggarwal, Charu C. Social Network Data Analytics[M]. Berlin,
corresponding to four algorithms with the change of Germany: Springer, 2011.
parameter k. We can see from Fig.4 that regardless of [2] W. Cui, Y. Xiao, H. Wang, Y. Lu, and W. Wang. Online search of
overlapping communities. Proceedings of the 2013 ACM SIGMOD
whether the number of nodes is 1000 or 5000, the accuracy international conference on Management of data, 2013, 277-288.
of overlapping nodes detected by CLEAR algorithm is [3] Buqing Cao, Jianxun Liu, Mingdong Tang, Zibin Zheng, Guangrong
always better than other three algorithms. Because CPM Wang: Mashup Service Recommendation Based on User Interest and
algorithm is too strict to the definition of community Social Network. ICWS 2013:99-106.
structure, so that when the network dense is too small, the [4] Liang Chen, Jian Wu, Hengyi Jian, Hongbo Deng,Zhaohui Wu:
accuracy of detected overlapping nodes is too low. But Instant Recommendation for Web Services Composition. IEEE T.
Services Computing (TSC) 7(4):586-598 (2014)
LinkSCAN algorithms are prone to produce too much
[5] Qi Yu, Zibin Zheng, Hongbing Wang: Trace Norm Regularized
unnecessary clusters, which leads to too many overlapping Matrix Factorization for Service Recommendation. ICWS 2013:34-41
communities, indirectly affecting the accuracy of the [6] Yan Wang, Lei Li, Guanfeng Liu: Social context-aware trust
overlapping nodes. F-score values of the 4 algorithms are on inference for trust enhancement in social network based
the rise with the increase of the value of parameter k. The recommendations on service providers. World Wide Web (WWW)
network changes form sparse to dense when k is in the 18(1):159-184 (2015)
process of the variation between 5 and 20. The larger k is( [7] S. Fortunato. Community detection in graphs. Physics Reports, 2010,
486(3): 75-174.
no more than 20), the easier to recognize the community
[8] Palla, Gergely, et al. Uncovering the overlapping community
structure and thus the more accurate to detect overlapping structure of complex networks in nature and society. Nature 435,
nodes. We can also see from Fig.4 that the applicable scope 2005, 7043 (2005): 814-818.
of CLEAR algorithm to network dense is much bigger when [9] H.Shen, X.Cheng, K.Cai and M.B.Hu. Detect overlapping and
k equals 5, overlapping nodes detected by CLEAR hierarchical community structure in networks. Physics A:Statistical
Mechanics and its Application, 2009, 388(8):1706-1712.
algorithm still have good accuracy.
[10] C.Lee, F.Reid, A.McDaid and N.Hurley. Detecting highly
IV. CONCLUSIONS overlapping community structure by greedy clique expansion.
Tech.Rep.arXiv:1002.1827, 2010.
Overlapping communities detection has important [11] T. S. Evans, R. Lambiotte. Line graphs, link partitions, and
guiding significance to service recommendation on the overlapping communities[J]. Physical Review E, 2009, 80(1):016105.
topology structure of real networks and is a new hotspot in [12] Sungsu Lim, Seungwoo Ryu, Sejeong Kwon, Kyomin Jung,Jae-Gil
community discovery. The computation efficiency of most Lee: LinkSCAN*: Overlapping community detection using the link-
space transformation. ICDE 2014:292-303.
existing overlapping communities detection is low and can
[13] https://sites.google.com/site/santofortunato/inthepress2.
only be used in small networks. For this situation, this paper
[14] http://snap.stanford.edu/data.
proposes a new theory used to detect overlapping
[15] A. Lázár, D. Ábel, and T. Vicsek. Modularity measures of networks with
communities: that is to regard compression of graphs as a overlapping communities[J]. Europhysics Letters, 2010, 90(1):18001.

711

You might also like