Professional Documents
Culture Documents
RokiaMissaoui
IdrissaSarr Editors
Social Network
Analysis
Community
Detection and
Evolution
Lecture Notes in Social Networks
Series editors
Reda Alhajj, University of Calgary, Calgary, AB, Canada
Uwe Glsser, Simon Fraser University, Burnaby, BC, Canada
Advisory Board
Charu Aggarwal, IBM T.J. Watson Research Center, Hawthorne, NY, USA
Patricia L. Brantingham, Simon Fraser University, Burnaby, BC, Canada
Thilo Gross, University of Bristol, UK
Jiawei Han, University of Illinois at Urbana-Champaign, IL, USA
Huan Liu, Arizona State University, Tempe, AZ, USA
Ral Mansevich, University of Chile, Santiago, Chile
Anthony J. Masys, Centre for Security Science, Ottawa, ON, Canada
Carlo Morselli, University of Montreal, QC, Canada
Rafael Wittek, University of Groningen, The Netherlands
Daniel Zeng, The University of Arizona, Tucson, AZ, USA
More information about this series at http://www.springer.com/series/8768
Rokia Missaoui Idrissa Sarr
Editors
Social Network
Analysis Community
Detection and Evolution
123
Editors
Rokia Missaoui Idrissa Sarr
Dpartement dInformatique et Ingnirie Dpartement de Mathmatiques et
Universit du Qubec en Outaouais Informatique
Gatineau, QC Universit Cheikh Anta Diop
Canada Dakar
Senegal
Creatures including humans, animals, insects, etc. avoid living in isolation and tend to
form communities or societies. Though Ferdinand Tnnies distinguished between a
community and a society in 1887, we may roughly say a community is a group of
individuals who agreed or asked to be together in order to achieve a certain task,
socialize, etc. Communities range from static and closed to dynamic and open. Some
communities are persistent, while others are volatile or ad hoc. Examples of com-
munities include families, friends, neighbors, schoolmates, employees working on a
project, etc. Even birds immigrate in communities with specic leadership. Tradi-
tionally, the establishment of communities was location indexed, i.e., required the
existence of individuals in the same location. However, the development in the
communication technology triggered a revolution in the way human communities are
established and dissolved. There is a visible rapid shift from physical to virtual
communities, i.e., from expecting individuals within a community to know and see
each other to accepting the ability of individuals to communicate as sufcient to form
a community. The latter trend allows communities to grow and shrink without a real
control. However, not all individuals within a community are likely equal when it
comes to skills and influence. Thus, analyzing communities to identify and study key
individuals, information propagation, evolution, behavior, structure, etc. is essential
for knowledge discovery leading to informative decision-making. Thanks to the rapid
development in the information technology and computing, which allows researchers
to build scalable solutions capable of handling big data. Such an analysis could have
been otherwise impossible. In fact, when the study of social communities started as a
branch of sociology and anthropology, applications and discoveries remained lim-
ited, mainly because researchers concentrated on the study of small communities,
which remained small due to the restrictions, which have been released when the
ability to communicate became the only requirement and raised the need for the study
and analysis of large communities. In other words, earlier studies concentrated on
physical communities, while recently virtual communities do exist and are evolving
and dominating. Realizing the need to handle evolving communities, researchers
from various elds, including computer science, mathematics, statistics, physics, and
vii
viii Foreword
many other domains joined the efforts to develop new and more powerful techniques
capable of accomplishing various types of studies related to communities. A number
of new contributions and discoveries are described well in this volume titled Social
Network Analysis Community Detection and Evolution, edited by two leading
researchers Prof. Rokia Missaoui and Dr. Idrissa Sarr.
This volume is indeed unique in its coverage and the background of the elite
community of authors who have written in various chapters. Some of the important
topics covered include the study of complex networks from understanding group
cohesion to group detection, to internetwork community evolution, as well as
dealing with Information propagation without relying entirely on the link structure
of social networks. The key novelty of the approach relies on the ability to mine the
published messages within a microblog platform and to extract the hidden topics to
identify the seed users. The volume also discusses the notion of consensual com-
munities and to show that they do not exist within a random graph, yet another
evidence in support of the targeted formation of communities. Online communities
and behavior are also discussed with emphasis on dating sites to understand how
user attributes can help predict who will date whom, and hence provide a recom-
mendation system for online dating website. Further, a group of authors discuss the
modeling and visualization of hierarchical structures in large organizational email
networks. The evolution of groups and communities on Twitter is also tackled by
employing a technique that mixes natural language processing and social network
analysis. Another interesting study covers the influence of social media in the
election process with a case study on the analysis of tweets related to Iranian
presidential elections. Finally, by combining all these topics related to communities
and evolution this volume is an attractive source and reference for researchers,
practitioners, and students who want to learn some interesting latest developments
in the eld.
Introduction
Most of the contributions in the present book contain recent studies on community
detection and/or evolution and represent extended versions of a selected collection
of articles presented at the 2013 IEEE/ACM international Conference on Advances
in Social Network Analysis and Mining (ASONAM), which took place in Niagara
Falls in Canada between August 25 and 28, 2013. The topics covered by this book
can be categorized into two groups: community detection and evolution in the rst
seven chapters, and two other related topics, namely link prediction and influence/
information propagation or maximization, in the last four chapters.
The discovery of cohesive groups, cliques, and communities inside a network is one
of the most studied topics in social network analysis. It has attracted many
researchers in sociology, biology, computer science, physics, criminology, and so
on. Community detection aims at nding clusters as subgraphs within a given
network. A community is then a cluster where many edges link nodes of the same
group and few edges link nodes of different clusters.
A general approach to community detection consists in considering the network
as a static view in which all the nodes and links in the network are kept unchanged
throughout the study. Recent studies focus also on community evolution since most
social networks tend to evolve over time through the addition and deletion of nodes
and links. As a consequence, groups inside a network may expand or shrink and
their members can move from one group to another one over time.
Most of the studies on community evolution use topological properties to
identify the updated parts of the network and characterize the type of changes such
as network shrinking, growing, splitting, and merging. However, recent work has
ix
x Preface
that the approach tracks the evolution of the network and identies the perspective
communities, it gives a basic way to identify both active and passive users. The
latter group of users can be seen as churners in customer relationship management
(CRM) applications. Furthermore, mapping perspective communities to an initial
(or important) network adds new links that improve the network accessibility, and
hence the information flow circulation.
Chapter titled Study of Influential Trends, Communities, and Websites on the
Post-election Events of Iranian Presidential Election in Twitter by Seyed Amin
Tabatabaei and Masoud Asadpour analyzes 1,375,510 tweets of Twitter users who
were interested in Iranian Presidential election and its post-events. The top URLs that
appeared on the tweets indicate that the most influential websites are those related to
social networking and social media websites. Important keywords used in the tweets
during nine days are extracted and the most popular websites among two distinct
groups of users (Persian and English speaking users) are found. These groups rep-
resent the core part of the network and help in interacting with abroad to communicate
the news, events, and messages. Peripheral users are identied as well as a few
subcommunities within the groups. The specication of subcommunities (i.e., the
supporters of political groups) is done based on the keywords extracted from the
tweets using a customized version of TF-IDF. Another result shows a strong link
between the posted tweets and the political events that occurred the same day.
Chapter titled Entanglement in Multiplex Networks: Understanding Group
Cohesion in Homophily Networks by Benjamin Renoust, Guy Melanon, and
Marie-Luce Viaud deals with group cohesiveness in complex networks, mainly, in
bipartite graphs. The authors use the homophily concept to assess similarity
between actors and the group homogeneity they have. The key idea is that attributes
are exploited while investigating how they interact. In other words, authors focus on
measuring the cohesion of a group through the interactions that take place between
attributes of actors. Hence, actor behavior is used to measure the intensity of
interactions and group cohesiveness. Therefore, it can be stated that interactions
between actors are a key element to identify group structure and cohesiveness.
Instead of projecting a bipartite network onto a single-type network with entities of
a same type, which can lead to a loss of information or hide subtle characteristics of
the original data, the authors propose to directly study the multiplex networks. By
doing so, they demonstrate the feasibility of detecting community structure within
complex networks without the need to compute one-mode projections.
Chapter titled An Elite Grouping of Individuals for Expressing a Core Identity
Based on the Temporal Dynamicity or the Semantic Richness by Billel Hamad-
ache, Hassina Seridi-Bouchelaghem, and Nadir Farah is related to group detection
and especially to core identication in social networks. The core of a network can
be seen as a central part having a high influence on the communication flows that
involve the other nodes. Basically, the work can be seen as another contribution to
existing studies in group detection by adding the semantic and temporal dimen-
sions. In fact, temporal dynamic behavior or semantic concepts of social entities are
an additional input to exploit in order to characterize and strengthen signicantly a
group structure and highlight its cohesiveness. The key idea of this work is that
xii Preface
actors of a social network are likely to change their interactions over time by adding
or removing relations with others. This has an impact on their social position in the
network and/or their possible afliation to one or more social groups. The temporal
change is in fact induced by many factors influencing actor behavior. Therefore,
using a semantic dimension such as the connection causality, the positive opinion of
socializing, and relationship kinds may help gauge the shape of groups and their
cohesiveness.
Chapter by Romain Campigotto and Jean-Loup Guillaume on The Power of
Consensus: Random Graphs Have No Communities denes the notion of con-
sensual communities and shows that they do not exist within a random graph. The
principle exploited by the authors is that the outcome of multiple runs of a non-
deterministic community detection algorithm is certainly more signicant than the
outcome of a single run. Authors dene a consensual community as a set of nodes,
which are frequently classied in the same community through multiple compu-
tations. In other words, a consensual community is a repeatable outcome (set of
communities) obtained from a set of community detection algorithm computations.
The main reason for using consensual communities rather than classical commu-
nities comes from the fact that most techniques used to compute communities can
usually provide more than one solution. This may depend on the initial congu-
rations or the order in which nodes are considered. Moreover, consensual
communities can provide a deeper insight into the structure of the network since
they summarize many partitions and encode more information on the structure such
as guring out the overlapping communities. However, when considering random
graphs, authors show that it is quite impossible to nd consensual communities.
The reason is that all pairs of nodes have the same probability to be connected in
random graphs. Furthermore, authors demonstrate through various community
detection algorithms the existence of a threshold beyond which a trivial consensual
community containing all the nodes is found and below which each node forms a
consensual community.
The remainder of the book covers a few use cases of community structures that
address other issues in social network analysis, namely link prediction and influ-
ence/information propagation and maximization.
Link Prediction
This important topic in social network analysis aims at predicting if two given
nodes have a relationship or will form one in the near future. It is exploited in many
social media applications such as the ones that need an embedded recommender
system to suggest new and relevant ties to the users. Like in community detection,
similarity and proximity principles are widely used for link prediction. Moreover,
information about network communities can improve the accuracy of similarity-
based link prediction methods.
Preface xiii
xv
xvi Contents
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Contributors
xvii
xviii Contributors
the members of the communities, and urging them to take the prescribed action.
The results show that the government agencies had limited participation on Twitter
during 2011 Japan Tsunami compared to an extensive participation during 2012 Hur-
ricane Sandy. The behavior of Twitter users during both events was consistent with
the issuance of actionable information (i.e. warnings). The findings suggest higher
cohesion among the virtual community members during 2011 Japan Tsunami than
during 2012 Hurricane Sandy event. However, during both events members displayed
an agreement on required protective action (i.e. if some members were propagating
messages to take action the other members were taking action). Additionally, higher
differentiation of leadership roles was demonstrated during 2012 Hurricane Sandy
with stronger presence of official sources in leadership roles.
1 Introduction
applying the methodology. The results are then described in detail in the following
section. The paper concludes with a discussion of contributions and suggestions for
future research.
2 Related Work
Social media has been used by the public as well as governmental and non-
governmental organizations during emergencies. Some examples of the use include
rapid information dissemination of ones well-being as it was demonstrated by the
researchers in [15]. In Haiti, U.S. government was able to utilize social media, such
4 Y. Tyshchuk et al.
as Wikipedias and workspace sharing media, as a knowledge based system [40]. The
researchers in [35] were able to develop a unique annotation, which facilitated the
emergence of the digital volunteers. Social media provides a natural environment for
facilitating decentralized coordination for onsite field response teams [34]. During
2011 Japan Tsunami, people utilized Twitter for information milling, warning prop-
agation, providing information about recovery efforts, and emotional support [36].
The traditional approach for first story detection uses a term vector to represent each
document (e.g., an news article) [1, 2]. Each new document is then compared with
the previous ones, and if its similarity with the closest document is below a threshold,
it is declared to be a new story. However, this approach is not feasible for large data
sets (e.g., tweets) because of its high computational cost. A computationally better
approach for first story detection task utilizes locality-sensitive hashing (LSH) with
a variance reduction strategy [28]. This method can achieve similar performance
while gaining more than an order of magnitude speedup compared with the system
previously described in [2]. Experiments using this method were conducted on large
streaming Twitter data sets and achieved reasonable results. In this paper, the above-
described approach is used for first story detection in tweets. Given a large amount
of tweets sorted in the timeline, we apply LSH to group similar tweets together and
identify all the tweets that discuss a new bit of information. In addition, we also link
later tweets to the previous ones if they are talking about the similar bit of information
in order to generate information clusters.
3 Methodology
3.1 Overview
An overview of the approach taken in this paper is illustrated in Fig. 1. First, data
was collected via streaming Twitter API during the time of an emergency. Then the
data was processed using the Support Vector Machines (SVMs) based on topic/off
topic binary classifier to extract tweets related to the emergency. Note that the on/off
topic classification was conducted on 2010 Japan Tsunami event only. The 2012
6 Y. Tyshchuk et al.
Hurricane Sandy data set was collected using hashtags #Sandy and #Hurricane,
therefore, all tweets were on-topic. Next, a selected set of search terms was used to
annotate the tweets with actionable eventspropagate the warning, seek infor-
mation or confirmation, and take prescribed action. To overcome the unstructured
format of the tweets text an appropriate set of NLP techniques was used. The anno-
tation was further enriched through assignment of attributes for each tweetpolarity
and modality. This was accomplished via SVMs based event attribute classification.
Subsequently, the first story analysis was conducted using Locality Sensitive Hashing
algorithm to detect the information clusters as well as the tweets that first introduced
the information on Twitter.
The timelines were either constructed utilizing data collected from on-site
interviews and publicly available information on the Internet or based on the 24 h
time slices. The timelines were used to construct communication networks for each
time slice. A random walk algorithm was employed to discover communities in
Twitter communication networks by time slice. SNA was used to identify the leaders
The Emergence of Communities and Their Leaders . . . 7
of these communities. The knowledge obtained from NLP about the tweet content
actions, attributes, first story identification, and story ranking, enabled us to make
inferences about the behaviors of community members and roles of their leaders.
3.2.1 Terminology
According to the hashtag definition from Twitter, the hashtag symbol, #, together
with a relevant keyword or a phrase in a tweet is used to categorize tweets and allow
them to be displayed more easily in Twitter Search. Also, popular hashtagged words
are often characterized as trending topics.
8 Y. Tyshchuk et al.
After filtering out off topic tweets, we developed a bootstrapping framework to predict
actionable events. To expand the key word seeds, we followed the cross-lingual
event trigger clustering approach described in [24] to discover words with similar
meanings. The algorithm exploited the idea that if two wordsw1 and w2 on the
source side of bi-lingual parallel corpora were aligned to the same word on the target
side with high confidence, they should have similar meanings. For each English key
word seed, the search was to find other English words that shared the same frequently
aligned Chinese terms and vice versa. The word alignment information between
each bi-lingual sentence pair was obtained by running Giza++ [27]. To eliminate
the noise introduced by automatic alignment, we filtered out stop words and those
English-Chinese word alignment pairs with frequency (in parallel corpora) less than
a threshold.2 Finally, we used each expanded keyword set as keywords to retrieve
actionable events.
gap between news and tweets: (1) lexical features including unique words, lower-
case words, lemmatized words and part-of-speech tags; (2) N-gram features, where
an n-gram n g (n = 1, 2, 3) was selected as an indicative context feature if it matched
one of the following two conditions(i) n g appeared only in one class, and with
frequency higher than a threshold; and (ii) the probability that n g occurring in one
class was higher than a threshold; where both thresholds were optimized from a small
development set including 30 events; and (3) dictionary features, such as expression,
consideration, subjective, intention, condition, and negation, were used.
The Locality Sensitive Hashing (LSH) method was used to remove the curse of
dimensionality and applied to the FSD problem [28]. LSH was first proposed by
Indyk and Motwani [17]. The underlying foundation was that if two documents
are close together, then after a projection operation these two documents would
remain close together. In other words, similar documents have a higher probability
to be mapped into the same bucket thus the collision probability will be higher for
documents that are close to each other. Given a LSH setting of k bits and L hashtables,
two documents x and y are collide if and only if:
where u ij are randomly generated vectors with components selected randomly from
a Gaussian Distribution, e.g., N (0, 1).
Algorithm 1 shows the pseudocode of LSH approach for First Story Detection and
event clustering. All the tweets are sorted in chronological order. Novelty score is then
assigned to document d by Score(d), given a threshold t [0, 1],4 if Score(d)
t then d is a first story, otherwise cluster d with its most similar document that
chronologically appears before it. To calculate distance between two documents we
adapt the standard Cosine Similarity between two vectors:
AB
distance(d, d ) = cos( ) =
||A||||B||
n (3)
i=1 Ai Bi
=
n n
i=1 (A i ) 2
i=1 (Bi )
2
The advantage of LSH is that it only needs to find the nearest neighbor from the
set of documents that were mapped to the same bucket instead of all the previous
tweets. Compared with the brute force search, the computation cost of score function
dropped from O(|Dt |) (|Dt | is the number of tweets have the time stamp before the
current tweets) to O(1).
The communication network of Twitter data was constructed using the communication
directional identifiers@ for directed and mention tweets and RT for the re-tweets.
Two relationships were incorporated into the communication networkthe directed/
mention and the re-tweet relationships. For directed/mention relationship an edge
existed if one user tweeted and/or mentioned another user. The user doing the tweet-
ing was at the head of the edge and the user who was mentioned or the tweet
was directed to was at the tail of the relationship. For re-tweet relationship the
edge existed if a user re-tweeted another users tweet. The user who was doing
the re-tweeting was at the tail of the edge and user sending the original message was
at the head of the relationship. The network was constructed for each of the time
slices of the event timeline previously discussed. This allowed for investigation of
the evolution and the dynamics of the network. The research evaluated actionable
behaviors on Twitter, therefore, only actionable tweets were utilized to construct the
network. The constructed network is referred to as Twitter communication network
in the following sections.
The Emergence of Communities and Their Leaders . . . 11
Currently, most of the algorithms can not handle the directedness of the edges
when detecting the communities [20]. In order to overcome this issue, networks are
often converted into undirected graph for the purposes of community detection [11].
When Twitter users communicate among each other and direct their messages to
other users the evidence of communication (tweets) is displayed in the profiles of
both users. This dichotomy allowed us to justify the modification of the network
from directed to undirected graph for community detection purposes. The commu-
nity finding approach utilized in the research was a random walk community detection
algorithm. The foundation of the approach lies with the assumption that there are
only a few edges that leave communities. Therefore, the algorithm uses a number
of random walks on the network and then uses those walks to merge the separate
communities in a bottom up manner [29]. This particular algorithm is most appro-
priate to find communities in the large sparse networks, which commonly occur in
the Twitter data.
The social science literature informs the research on the properties of cohesive
groups. It suggests that the people in the same community tend to have similar
and redundant information. Moreover, there is an ease of information transfer in
cohesive groups [7, 30]. In this research, this concept was evaluated in the context
of Twitter communication network during emergencies. In order to ascertain if this
theory of group behavior applies to the communications and behaviors on Twitter the
correlation between the community members based on behaviors derived from the
12 Y. Tyshchuk et al.
Twitter users behavioral attributes was evaluated. The size of the communities found
in the data enabled us to determine how many people obtained similar information
and shared similar intents. The ten largest communities for each time slice were
evaluated by examining the similarity (correlation) of behavior among the community
members to discover the prevalent behavior.
Once the communities were identified the task was to find the community leaders.
Each community was taken separately and a community leaders were identified as the
most central/prestigious actors. The centrality/prestige measures that were utilized
in this research were outDegree, inDegree, betweenness, and eigenvalue centrality
(power). An outDegree centrality measure is simply a number of messages sent by a
Twitter user to other users in the network. An outDegree measure is associated with
faster information diffusion as it reaches more people. In [36], the researchers showed
that people with high outDegree engage in information propagation. An inDegree
measure represents a number of incoming messages sent to a Twitter user by other
users. Another measure of betweenness represents a level of control one user has
over the communication between other users. Users with high betweenness values
serve as information gatekeepers [36], the betweenness of a node is the number of
the shortest paths between any two nodes in the network that have to pass through
this node [37]. A power measure represents the nodes connectedness to other central
nodes [6].
Each centrality measure is associated with a different kind of behavior, users
scored high on each of those measures can represent different types of leadership.
Therefore, three types of leaders are definedthe diffuser, the gatekeeper, and the
information broker. The diffuser leader is a leader which diffuses the information
through the network. This type of leader is associated with an outDegree measure as
it measures the number of tweets (edges) a node sends out. Another type of leader is a
gatekeeper. A gatekeeper is a node that controls an information flow in the network.
Measures associated with the role of a gatekeeper are betweenness [12, 13] and
power [9]. There are two types of gatekeepers that emerge when betweenness and
power measures are combinedcritical gatekeeper and unique access gatekeeper [9].
A critical gatekeeper is associated with high betweenness and low power values
whereas a unique access gatekeeper is tied to low betweenness and high power
values [9]. We defined the final type of the leader as information broker, who has
access to valuable information and brokers it to other nodes in the network upon
request. An information broker is associated with high inDegree and high power
measures. A high power measure suggests access to other central actors and infor-
mation they able to provide. A high inDegree measure suggests high frequency of
inquiry from other users in the community. The frequency of inquiry for information
can be inferred from the action attributeseek and obtain confirmation.
Once the community leaders were identified their behavior was evaluated based
on the type of actionable tweets they sent out. That behavior was then compared
The Emergence of Communities and Their Leaders . . . 13
to the overall behavior of the community members. For example, when a leader of
the community sent out a warning to evacuate, which was accompanied by action
attributepropagate the warning and polaritytrue, the expected result was for
the community to follow the lead and send out the tweets with action attributes
propagate the warning and/or take a prescribed action and polaritytrue.
4 Data Description
or received during the time of the events. In addition to the tweet messages, it also
included user names, time stamps, and directed communication identifiers such as
for directed messages and RT for re-tweets. The data was stored locally and can be
accessed upon request.
5 Results
For 2011 Japan Tsunami data set, we were able to annotate 800 hashtags in a very short
time period (1.5 h) and gathered a large number of human annotated tweets (311,735).
As a result, 37 hashtags were annotated as on-topic and the rest were annotated as
off-topic and thus 26,554 on-topic tweets and 285,181 off-topic tweets were gathered
respectively. To balance the training and testing data, we randomly sampled the same
amount of off-topic tweets as on-topic tweets to conduct the experiments. 42,486
tweets were randomly selected for training, and the remaining 10,622 tweets were
used for blind test. The accuracy for on-topic classification for 2010 Japan Tsunami
was 81.93 %. The accuracy results for both datasets, 2011 Japan Tsunami and 2012
Hurricane Sandy, for polarity and modality were 96.8 and 78.4 % respectively.
The actionable tweets were aggregated per time period to evaluate the results and
compare analyzed data and Twitter user behavior with the timeline of the events.
Table 3 represents the results for 2011 Japan Tsunami. There is a spike in the volume
of tweets during the time slice 4. This is natural as thats when most of the tsunami
warnings were issued and evacuations were ordered along the affected coastline.
Moreover, it is evident that the receive the warning tweets are prevalent in earlier
time slices and then gradually drops off as the event concludes. This is a natural
progression and corresponds to the event timeline. The take prescribed action tweets
peak in time slices five, six, and seven after the evacuation orders have been issued.
Finally, the confirmation tweets increase in the later time slices after the warnings
and evacuation orders were issued. Additionally, during the later time slices people
were confirming the well-being of their friends and relatives affected by the event.
Similar results can be seen in 2012 Hurricane Sandy in Table 4. The volume of
receive the warning tweets rises leading up to and peaks on the day the landfall in
southern New Jersey (October 28). The volume of seek and obtain confirmation and
take the prescribed action tweets rise leading up to and peaking on the day prior to
the landfall. The warnings issued by the government emergency organizations for the
northeastern states required impacted population to take action on October 29th. The
peaks occurring on Twitter on October 29th for seek and obtain confirmation and
take the prescribed action show that users on Twitter followed the patterns of the
evolution of the event. The analysis shows that the evolution of behaviors extracted
from the NLP action assignments to the tweets correspond to a warning response
process cycle and the overall evolution of both events.
First the community results for the 2011 Japan Tsunami are evaluated. Table 5 shows
the results produced by the random walk algorithm. Note that the time slice (TS)
one was omitted from the results there were no communities discovered during that
time slice. The range in the table represents the size range of the communities
i.e. for time slice 2 the size of the smallest community was 2 and the size of
the largest community was 11. A higher percentage of communities of size larger
than four (Percentage of >4 com.) occur during time slices two, three, and four.
This result is expected as the users are exchanging warning information recently
issued and confirming prescribed action.
When the communities and its members were examined more closely there was
significant correlation found in the behaviors of community members. Over all time
slices, every community had 80 % or greater of its members that had exactly the same
behaviori.e., the same actionable event, modality, and polarity. For those communi-
ties, where there was a difference among the members behaviors, the difference was
in actionable events, and not in modality or polarity. The members usually split into
two groups within the community, based on the actionable eventwarning group,
those who received and propagated the warning, and take action group, those who
expressed intent to take the prescribed action. The finding suggests that people of a
The Emergence of Communities and Their Leaders . . . 17
community tend to exhibit similar behaviors. It is important for all members of the
community to share similar polarity for their behavior. For example, if the leader
sends out a message urging people to evacuateaction propagate the warning and
polaritypositive, the expected result for the rest of the community is to respond
with either action of propagate the warning or take prescribed action with the
same polarity. When the polarity was evaluated among the members of the commu-
nities only 5 % or less of all communities exhibited difference in polarity among its
members. Additionally, the tweets with confirmation actionable event rarely occurred
in the large communities and were more typical of communities of size <4. More-
over, when the communities were traced from time slice to time slice there was little
overlap discovered between its members. This suggests that the communities formed
on Twitter serve a purpose in each time slice such as propagate the warning, obtain
information or confirmation, and exhibit an intent to take the prescribed action. Once
the action is completed there is no longer a need to participate on Twitter.
The 2012 Hurricane Sandy event spanned over nine days from its formation on
October 22nd to its completion on October 31st. This timespan allows for higher
participation in information exchange on Twitter. Table 6 shows the results produced
by the random walk algorithm for each of the seven days of collected data (October
25October 31).
Unlike 2011 Japan Tsunami the anticipated impact of 2012 Hurricane Sandy
varied and spanned over the entire east coast of the United States. The vast area
First the leaders of the communities discovered in 2011 Japan Tsunami were
evaluated. Specifically, only the communities of size larger than four were exam-
ined. It was discovered that the roles of diffuser and gatekeeper were assumed by the
same nodes. Additionally, it was confirmed that the action ofseek information or
confirmation is a characteristic of communities of size smaller than four, therefore,
the information broker role was taken by a selected set of users in those communities.
As shown in Tables 7 and 8, ten largest communities for time slice four, when the
critical warning information was issued, were selected for analysis, and diffuser and
gatekeeper roles were combined and defined as community leaders.
The community leaders were the members of traditional media, and primarily
focused on the diffusing the informationaction attribute of propagate the warn-
ing, and the other community members were following the leaders by either taking
the prescribed action or propagating the warning. When the leaders were issuing
information to evacuate, actionable eventpropagate the warning and polarity
true, the rest of the community followed one of two actionspropagate the warn-
ing or take the prescribed action, with the same polarity. When the lack of overlap
between the communities across the timeline was discovered, a significant finding
was the presence of the leaders in all time slices. As the members of communities
participated in the communication only during a particular time slice, the leaders con-
tinued their participation throughout the event. This evidence suggests that Twitter
The Emergence of Communities and Their Leaders . . . 19
users were gravitating towards the leaders who were sources of information and at
the same time in control of the information, i.e. diffusers and gatekeepers.
Next the leaders of the communities were evaluated for the 2012 Hurricane Sandy
event. Only the communities of size larger than four were examined. Two days were
selected for demonstration of the results are October 28th, the day prior to the landfall
in southern New Jersey, and October 29th the day of the landfall. The finding that
the single leader serving as diffuser and gatekeeper is consistent for both 2011 Japan
Tsunami and the 2012 Hurricane Sandy events. In contrast to 2011 Japan Tsunami,
the broker type leader, i.e. the leader who was high in InDegree value and was
high in confirmation actionable tweets, was now present in the communities of size
larger than four. This type of leader provided confirmations to other members of
communities on Twitter. The list leaders, which emerged in the day prior to the
landfall in southern New Jersey and during the landfall for top ten communities can
be seen in the Tables 9 and 10.
As previously discussed, the behaviors of the members of the communities varied
due to the variabilities of warnings, however, the peaks and valleys in the distributions
20 Y. Tyshchuk et al.
of the aggregated actions of the community members followed the peaks and valleys
of the distribution of leaders actions. This finding can be demonstrated in Tables 11
and 12 for the day prior to the landfall in southern New Jersey and Tables 13 and 14
for the day of the landfall.
The Tables show Rec for receive the warning, Seek for seek confirmation or
information, Act for take the prescribed action, and (+) for positive polarity and
() for negative polarity. These results suggest that community members followed
the actions of their respective leaders. The first story analysis has been used to
evaluate the role of the leaders in the communities and assessing the uniqueness of
the information they had shared through out the event. The number of first stories
were aggregated per each leader to identify the percentage of the unique information
shared by each leader. The result of the analysis suggests that in the days leading
up to the landfall in southern New Jersey the leaders of the communities were sharing
unique information with their respective communities. During the landfall and the
day after the landfall, the information being shared by the leaders was no longer
unique and consisted of previously transmitted information. Moreover, the most
The Emergence of Communities and Their Leaders . . . 21
unique information was being shared by the official sources such as MikeBloomberg
and NYCMayorsOffice. This finding suggests that Twitter users who were part of the
communities led by the official sources obtain first hand information quicker than
the rest of the users on Twitter.
Two different events were evaluated. Events differ in impact areas, time span, and
magnitude of impact. The 2011 Japan Tsunami event spanned over just one day with
very limited time to respond, whereas, the 2012 Hurricane Sandy spanned over nine
days with much more time to prepare and respond. During 2011 Japan Tsunami the
governmental emergency management organizations made limited use of Twitter.
However, the traditional media outlets utilized Twitter extensively to disseminate
warnings. In contrast, during 2012 Hurricane Sandy local as well as state and federal
governmental emergency management organizations made an extensive use of social
media providing a vast majority of unique information to the Twitter users.
To overcome a lack of knowledge of who are the individuals or organizations that
disseminate warning information, provide confirmations of an event and associated
actions, and urge others to take action, a methodology that combines natural language
processing and social network analyses was successfully applied to two data sets
collected from Twitter during 2011 Japan Tsunami and 2012 Hurricane Sandy. The
methodology employed was as follows: (1) assign actionable events to each on-topic
tweet using NLP; (2) construct a communication network of tweets associated with
actionable events; (3) use the network to discover communities with SNA; (4) extract
the leaders of the communities and identify their roles with SNA; and (5) evaluate
the behavior of the community members and their leaders using NLP.
The analysis was able to demonstrate that the behavior of the Twitter users was
consistent with the issuance of actionable information based on warnings. It was
The Emergence of Communities and Their Leaders . . . 23
also discovered that members of the same community demonstrate similar behaviors
when faced with very limited time to respond and diverse behaviors when faced
with longer time to respond. Additionally, the diversity of the levels of impact and
prescribed actions also facilitated diverse behaviors among the members of the same
communities during 2012 Hurricane Sandy. During 2011 Japan Tsunami the leaders
of the communities were typically the traditional media who were propagating the
warnings and urging the other community members to take the prescribed action.
However, during 2012 Hurricane Sandy the leaders of the communities ranged from
celebrities, specialized organizations (e.g. various weather reporting agencies), and
local, state, and federal emergency management organizations. Moreover, it was dis-
covered that the leaders maintained their role throughout the entire event, while the
rest of the community members were present during a selected time period. The
communities formed around the information sourcesi.e. the leaders. The leaders
of the communities during 2012 Hurricane Sandy were able to introduce unique
information into the communities, moreover, it was the local official organizations
who introduced the majority of the unique information. The uniqueness of the infor-
mation shared by the leaders peaked prior to the hurricane landfall in southern New
Jersey and declined during and the day after the event.
The key contributions of the research consist of the insight into the human behavior
on Twitter during two major extreme events. The paper showed how extreme events
with different characteristics can prompt different human behavior on Twitter. The
research explored collective human behavior and demonstrated that events that allow
more time to respond and impact larger territories can result in weaker cohesion in
virtual communities on Twitter. The research also conveyed stronger adoption of
Twitter by official emergency response organization during 2012 Hurricane Sandy, a
year and a half after 2011 Japan Tsunami. The official sources are not only adopting
the new technology offered by Twitter, but also become leading information sources
on Twitter as evident from leadership and first story detection analyses for 2012 Hur-
ricane Sandy. In future research, the authors will attempt to include additional event
attributesi.e. location, to better understand the impact of emergencies on commu-
nities. In addition, this will allow us to study the co-evolution of the behavior of the
community and its leaders and the structure of the network throughout an emergency.
It will also provide the means to investigate the flow of actionable information and
its distortion over time.
Acknowledgments This material is based upon work sponsored by the Army Research Lab under
Cooperative Agreement number No. W911NF-09-2-0053 (NS-CTA), U.S. NSF under the grant
number CMMI V 1162409, U.S. NSF CAREER Award under Grant IIS-0953149, U.S. DARPA
Award No. FA8750-13-2-0041 in the Deep Exploration and Filtering of Text (DEFT) Program,
IBM Faculty award and RPI faculty start-up grant. The views and conclusions contained in this
document are those of the authors and should not be interpreted as representing the official policies,
either expressed or implied, of the Army Research Laboratory, DARPA, the National Science
Foundation or the U.S. Government.
24 Y. Tyshchuk et al.
References
1. Allan J, Lavrenko V, Jin H (2000) First story detection in tdt is hard. In: CIKM, pp 374381
2. Allan J, Lavrenko V, Malin D, Swan R (2000) Detections, bounds, and timelines: Umass and
tdt-3. In: Proceedings of topic detection and tracking workshop, pp 167174
3. Benson E, Haghighi A, Barzilay R (2011) Event discovery in social media feeds. In: ACL,
pp 389398
4. Billion-dollar weather/climate disasters. In: National climatic data center and national oceanic
and atmospheric administration, 12 January 2014
5. Blair C (2011) Update: Hawaii Tsunami damage in tens of millions of dollars. In: Honolulu
civil beat. 14 March 2011
6. Bonacich P (1987) Power and centrality: a family of measures. Am J Sociol 92:11701182
7. Burt R, Lin N, Cook K (2011) Structural holes versus network closure as social capital. In:
Social captial: theory and research. Aldine Transaction
8. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM TIST 2(3):27
9. Conway D (2009) Social network analysis in R
10. Ewing L (2011) The Tohoku tsunami of march 11, 2011: a preliminary report on effects to the
california coast and planning implications. In: California coastal commission report. Natural
Resources Agency, San Francisco
11. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3):75174
12. Freeman LC (1979) Centrality in social networks conceptual clarification. Soc Netw 1(3):215
239
13. Freeman LC (1980) The gatekeeper, pair-dependency and structural centrality. Qual Quant
14(4):585592
14. Huberman BA, Romero DM, Wu F (2009) Social networks that matter: Twitter under the
microscope. First Monday 14(1):8
15. Hughes A, Palen L (2009) Twitter adoption and use in mass convergence and emergency events.
In: Proceedings of the 6th international conference on information systems for crisis response
and management (ISCRAM), Gothenburg, Sweden
16. Hurricane sandy: timeline. In: Federal emergency management agency. 12 January 2014
17. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of
dimensionality. In: STOC, pp 604613
18. Ji H, Grishman R (2008) Refining event extraction through cross-document inference. In: ACL,
pp 254262
19. Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov
7(4):373397
20. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant
communities in networks. PLoS ONE 6(4):e18961
21. LDC: Ace (automatic content extraction) english annotation guidelines for events (2005). http://
projects.ldc.upenn.edu/ace/docs/english-events-guidelines_v5.4.3.pdf
22. Li H, Ji H, Deng H, Han J (2011) Exploiting background information networks to enhance
bilingual event extraction through topic modeling. In: Proceedings of international conference
on advances in information mining and management
23. Li Q, Ji H, Huang L (2013) Joint event extraction via structured prediction with global features.
In: Proceedings of the 51st annual meeting of the association for computational linguistics.
Association for Computational Linguistics, Sofia, Bulgaria, pp 7382
24. Li H, Li X, Ji H, Marton Y (2010) Domain-independent novel event discovery and semi-
automatic event annotation. In: PACLIC, pp 233242
25. Lindell M, Perry R (2012) The protective action decision model: theoretical modifications and
additional evidence. In: Risk analysis, vol 32(4), pp 616632
26. Mileti D, Sorensen J (1990) Communiction of emergency public warnings: a social science
perspective and state-of-the-art assessement. In: State-of-the-art assessement. Report prepared
for federal emergency management agency, Oak Ridge National Laboratory, Oak Ridge
The Emergence of Communities and Their Leaders . . . 25
27. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput
Linguist 29(1):1951
28. Petrovic S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to
Twitter. In: HLT-NAACL, pp 181189
29. Pons P, Latapy M (2006) Computing communities in large networks using random walks.
J Graph Algorithms Appl 10(2):191218
30. Reagans R, McEvily B (2003) Network structure and knowledge transfer: the effects of cohe-
sion and range. In: Administrative science quarterly, vol 48(2), pp 240267
31. Ritter A, Mausam Etzioni O, Clark S (2012) Open domain event extraction from Twitter. In:
KDD, pp 11041112
32. Romero DM, Kleinberg JM (2010) The directed closure process in hybrid social-information
networks, with an analysis of link formation on Twitter. In: ICWSM
33. Romero DM, Meeder B, Kleinberg JM (2011) Differences in the mechanics of information
diffusion across topics: idioms, political hashtags, and complex contagion on twitter. In: WWW,
pp 695704
34. Sarcevic A, Palen L, White J, Starbird K, Bagdouri M, Anderson KM (2012) beacons of hope
in decentralized coordination: learning from on-the-ground medical Twitterers during the 2010
Haiti earthquake. In: CSCW, pp 4756
35. Starbird K, Palen (2011) voluntweeters: self-organizing by digital volunteers in times of
crisis. In: CHI, pp 10711080
36. Tyshchuk Y, Wallace WA (2012) Actionable information during extreme eventscase study:
warnings and 2011 tohoku earthquake. In: SocialCom/PASSAT, pp 338347
37. Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge
University Press, Cambridge
38. Weng J, Lee BS (2011) Event detection in Twitter. In: ICWSM
39. Yang Y, Pierce T, Carbonell JG (1998) A study of retrospective and on-line event detection.
In: SIGIR, pp 2836
40. Yates D, Paquette S (2011) Emergency knowledge management and social media technologies:
a case study of the 2010 haitian earthquake. Int J Inf Manag 31(1):613
Hierarchical and Matrix Structures
in a Large Organizational Email Network:
Visualization and Modeling Approaches
Abstract This paper presents findings from a study of the email network of a large
scientific research organization, focusing on methods for visualizing and modeling
organizational hierarchies within large, complex network datasets. In the first part
of the paper, we find that visualization and interpretation of complex organizational
network data is facilitated by integration of network data with information on for-
mal organizational divisions and levels. By aggregating and visualizing email traffic
between organizational units at various levels, we derive several insights into how
large subdivisions of the organization interact with each other and with outside orga-
nizations. Our analysis shows that line and program management interactions in
this organization systematically deviate from the idealized pattern of interaction pre-
scribed by matrix management. In the second part of the paper, we propose a power
law model for predicting degree distribution of organizational email traffic based on
hierarchical relationships between managers and employees. This model considers
the influence of global email announcements sent from managers to all employees
under their supervision, and the role support staff play in generating email traffic,
acting as agents for managers. We also analyze patterns in email traffic volume over
the course of a work week.
This chapter was created within the capacity of an US governmental employment. US copyright
protection does not apply.
1 Introduction
In this paper, we present results of our analyses of large organizational email datasets
derived from the email traffic records of Los Alamos National Laboratory (LANL).1
Analyzing such large email datasets from complex organizations poses a number
of challenges. First, considerable work is required to parse large quantities of raw
data from network logs and convert it into a format suitable for network analysis and
visualization. Second, a great deal of care is required to analyze and visualize net-
work data in a way that makes sense of complex formal organizational structuresin
our case, 456 organizational units that are connected through diverse organizational
hierarchies and management chains. Finally, it can be difficult to sort out the effects
of email traffic generated by mass announcements and communications along man-
agement chains from the more chaotic, less hierarchical traffic generated by everyday
interactions among colleagues.
This paper addresses these complexities in two ways. First, we demonstrate
methods for understanding large-scale structural relationships between organiza-
tional units by using carefully thought-out visualization strategies and basic graph
statistics. Second, we propose a power law model for predicting the degree distri-
bution of email traffic for nodes of large degree that engage in mass emails along
hierarchical lines of communication. This likely characterizes a significant portion of
email traffic from managers (and their agents) to employees under their supervision.
This model goes beyond existing models of node connectivity in organizations by
considering the influence of specific email usage practices of managers.
Our motivation for this analysis is primarily sociological, with a focus on
understanding structural relationships among formal organizational divisions and
along defined management chains within a particular organization. Email network
analysis enables us to draw conclusions about the respective roles of different ele-
ments in the organizational hierarchy, beyond what is specified in organizational
charts and management plans. This offers insight into the functioning of the organi-
zation, and could have practical implications for management and communications.
Further, it provides a case study that can be compared to other organizational studies,
and demonstrates a general set of methods that can be employed to gain organiza-
tional insight from email data.
The study of social networks in organizations has a long history, going back at least
as far as the Hawthorne studies of the 1920s, in which anthropological observations
of worker interactions at Western Electrics Hawthorne Works were represented as
networks [11, 20]. The convention of representing social connections as graphs, with
outside coding schemes to bring in new information and ideas, while internal coding
schemes facilitate close working relationships between colleagues. In the laboratory
they studied, Allen and Cohen found that the key mechanism for managing this ten-
sion was to place a limited number of individuals in informal gatekeeper roles. These
gatekeepers had more ties to technical disciplinary communities and colleagues out-
side the laboratory, and more familiarity with the research literature. Being in this
gatekeeper position relative to the outside world also made them preferred sources
of information and advice within the organization. Tortoriello et al., in a more recent
study [33], note that the tight relationships and shared knowledge individual organi-
zational units need to function effectively inhibits their ability to interact effectively
with other organizational units. Having a limited number of people in gatekeeper roles
is a mechanism that enables groups to maintain a cohesive identity while preserving
access to important knowledge and information from elsewhere in the organization.
The rise of electronic mail as a central communication mechanism in organizations,
along with extensive archiving of email communications, has created a body of data
that can be used to analyze organizational interactions at very large scales. Auto-
matically collected email data has significant advantages for capturing interactions
among organizational units: although email does not capture all relevant interactions,
it provides comprehensive coverage across the entire organization without the over-
head involved in large-scale survey-based studies. Studies have shown that email
communication patterns generally reflect the underlying social network structure of
an organization [34].
The Enron corpus, released by regulators as part of an investigation into the
companys bankruptcy, is one of the few publicly available email datasets of significant
scope available to researchers. As such, it has played a key role in the development
of email analysis techniques [5, 8]. However, the Enron corpus is quite small (half a
million messages between 158 individuals) compared to the total email volume of a
large organization. Unfortunately, larger email corpora (like the one analyzed here)
are often not considered publicly releasable, and are accessible only to researchers
internal to the organization in question. For example, [19] describes a very large
email network of email communications among Microsoft employees. A key feature
of many of these email studies, which we build upon here, is that they track both
individual-level communications and communications across formal divisions of the
organization. Aggregating relationships based on formal organizational structures
offers an important level of insight, which can be particularly useful for managers
and analysts interested in interactions among business units, capabilities, or functions
rather than individuals.
Hierarchical and Matrix Structures in a Large Organizational Email Network . . . 31
Fig. 1 a Schematic
representation of a typical
organizational chart for a
fully matrixed organization.
Each employee reports to
one line and one program
manager, and line and
program managers
independently report to
upper management. b The
idealized communication
pattern that results from a.
Dotted line indicates less
frequent communication. c
The actual communication
pattern at LANL, revealed
through analysis of email
data. (UM =
upper management, PM =
program/project management,
LM = line management,
E = employee.)
Fig. 2 Email traffic between organizational units at LANL, using a force-vector layout. Node
size represents betweenness centrality. Edge color is a mix of the colors of the connected nodes.
Although individual edges are difficult to discern at this scale, the overall color field reflects the
type of units that are most connected in a given region
Fig. 3 Email traffic between organization types at LANL. Node diameter represents total degree
(i.e. total number of incoming and outgoing emails) of the node; edge width represents email volume
in the direction indicated
between these entities along any other path. The operations side of the organization
does not display this pattern, indicating that relationships between groups, programs,
and management are more fluid there. The strength of the ties between technical pro-
gram organizations and both technical groups and technical management, in the
absence of a strong direct tie between technical groups and technical management,
suggests that technical program organizations serve as a broker between these ele-
ments of the organization. This contrasts with the role program organizations play in
a true matrix organization, where they represent an independent chain of command
from line management. The structure of this relationship at LANL is depicted in
Fig. 1c.
Figure 3 also indicates that operations organizations have lower overall volumes
of incoming and outgoing email than technical organizations, even though there are
similar numbers of employees in each category [18]. There could be a number of rea-
sons for this. Operational knowledge may be less complex and more readily codified
than technical knowledge, reducing the need for strong interactional ties. Alterna-
tively, the nature of operational work, which can take place in the field and involve
significant manual labor and use of machinery, may inhibit email communication.
Some workers may not have constant access to email during working hours, and
communication needs may be more localized and readily satisfied by direct personal
interaction. Additional research would be required to fully explore these possibilities.
Another way of understanding the roles different types of organizational units play
is in terms of their relationships with outside entities. Figure 4 plots the number of
emails each type of organization sends and receives to/from commercial versus non-
commercial domains. This indicates that all types of operational units communicate
significantly more with commercial entities, which is probably driven by relation-
ships with suppliers and contractors. Technical groups, technical management, and
administration communicate about equally with commercial and non-commercial
Hierarchical and Matrix Structures in a Large Organizational Email Network . . . 35
domains. The outlier here is technical programs, which communicate more with
external addresses than any other type of organizational unit, and are much more
highly connected to non-commercial domains.
These findings suggest that program organizations at LANL occupy the gatekeeper
position described in [1, 33]: they serve as brokers between organizational levels,
as well as a key link between the laboratory and the outside worldparticularly
non-commercial entities like academic institutions and other government agencies.
Their position between upper management and technical work organizations may
reflect their role in translating between management coding schemes and those of
technical domain experts, while their position between LANL and external entities
suggests a broader role in translating between internal and external coding schemes.
There are a number of possible applications of this kind of analysis. Studies
have shown that individuals, including managers, are not always accurate in their
perceptions of the structure of informal networks in their organizations, beyond the
individuals with whom they regularly interact [23]. Quantitative network analysis
and visualization can therefore provide significant, data-driven insights that are not
ordinarily available to managers and other employees in organizations. The findings
presented here show that program organizations at LANL have shifted from their
original role as one axis of a management matrix scheme to a role as organizational
gatekeepers. In an organization undergoing this kind of shift, some managers or
workers may not be completely aware of the nature of the change. In that case,
this kind of analysis can provide insights into how to effectively interact with and
make use of program organizations. For example, the manager of an administration
unit could hypothetically fill a structural hole by developing direct contacts with
key program units, in order to gain more insight into the organizations external
relationships. Alternatively, in some organizations, a shift in the nature of program
management might pose problems: for example, if management expects program
managers to play an active role in matrix management, their role as gatekeepers
might conflict with organizational needs. In such a case, analysis and visualization
of network relationships between organizational levels could provide a basis for
accurate organizational assessment and realignment.
Fig. 5 Email network for 2 week period in smaller group. Size of a node is proportional to logarithm
of its betweenness centrality. Nodes with different colors correspond to different communities that
were identified by application of the Girvan-Newman algorithm to the groups email network
[12, 15]. Link widths are proportional to the logarithm of the number of emails exchanged along
these links. The network was visualized by assigning repulsion forces among nodes and spring
constants proportional to the link weights, and then finding an equilibrium state
centrality (Fig. 6). These include administrative assistants, seminar organizers, and
several project leaders. This indicates a flatter, less centralized organizational struc-
ture. In order to explore group structure, we applied the Girvan-Newman community
detection algorithm to each graph [12]. For the first group, this algorithm identified
four communities, the significance of which is not clear to us; for the second group, it
revealed two main communities that correspond to two previous groups that merged
to form the current group. These interpretations could be expanded by use of alterna-
tive centrality measures and comparison of various community detection methods.
Hierarchical and Matrix Structures in a Large Organizational Email Network . . . 37
Several network types, including biological metabolic networks [31], the World Wide
Web, and actor networks [30], are conjectured to have power law distributions of
node connectivity. In the case of metabolic networks, the interpretation of scale free
behavior is complicated by the lack of complete knowledge and relatively small sizes
(103 nodes) of such networks, while the mechanisms of self-similarity in many large
social networks are still the subject of debate. However, organizational hierarchy has
been shown to generate degree distributions for contacts between individuals that
follow power laws [2].
Managers prefer to use email to communicate with subordinates in many different
communication contexts [25]. We propose that, in addition to the general effects of
organizational hierarchy, particular email communication practices of managers may
provide an underlying mechanism that generates power law distributions in node con-
nectivity of organizational email networks. To explore this possibility, we develop a
scale-free behavioral model that considers the effects of mass email announcements
sent by managers to subordinates. In this model, the self-similarity of the connec-
tivity distribution of the email network is a consequence of the static self-similarity
of the management structure, rather than resulting from a dynamic process, such
as preferential attachment [26] or optimization strategies [27]. More specifically,
self-similarity is due to the ability of a manager to continuously and directly com-
municate only with a relatively small number of people, while communications with
other employees have to be conveyed in the form of broad announcements.
Suppose that the top manager in an organization sends emails to all employees
from time to time. This manager must correspond to the node in the email network
that has highest connectivity N . Suppose that the top manager also talks directly (in
person) to l managers that are only one step lower in the directors hierarchy (lets
call them 1st level managers). Each of those 1st level managers, presumably, control
their own subdivisions in the organization. Assuming roughly equal spans of man-
agerial control, we can expect that, typically, one 1st level manager sends emails
38 B.H. Sims et al.
to N /l people. In reality, each manager also has a support team, such as assistants,
administrators, technicians, etc. who also may send announcements to the whole
subdivision.
Let us introduce a coefficient a which says how many support team employees are
involved in sending global email announcements in the division on the same scale
as their manager. We can then conclude that at the 1st level from the top there are al
persons who send emails to N /l employees at a lower level.
Each 1st level manager controls l 2nd level ones and we can iterate our arguments,
leading to the conclusion that there should be (al)2 managers on the 2nd level who
should be connected to N /(l 2 ) people in their corresponding subdivisions. Continu-
ing these arguments to the lower levels of the hierarchy, we find that, at a given level
x, there should be (al)x managers (or their proxies) who write email announcements
to N /(l x ) people in their subdivision.
Consider a plot that shows the number of nodes n versus the weight of those
nodes, i.e. their outdegree w. Considering previous arguments, we find that the weight
w = N /(l x ) should correspond to n = (al)x nodes. Excluding the variable x, we
find
log(al)
log(n) = (log(N ) log(w)) , (1)
log(l)
data quality, the simplicity of the model, and logarithmic dependence of the power
law on some of these parameters [6]. We found that our data for w > 40 could be
well fitted by log(n) 14.0 2.47log(w) (Fig. 8). If, e.g., we assume l = 4, then
a 7, i.e. each manager has the support of typically a 1 = 6 people, who help
her post various announcements to her domain of control. The power law should
terminate at the level of hierarchy x given by (al)x = N /(l x ), which corresponds to
x 3, i.e. the email network data suggest that there are typically x = 3 managers
of different ranks between the working employee and the top manager of the orga-
nization. The typical number of email domains to which the lowest rank manager
sends announcements is wmin N /l x 48. This should also be the degree of the
nodes at which the power law (1) should be no longer justified. Indeed, we find the
breakdown of the power law (1) at w < 40. This estimate also predicts that a typical
working employee receives emails from (x + 1)a = 28 managers or their support
teams.
40 B.H. Sims et al.
Fig. 9 The frequency of non-manager nodes receiving emails from a given number of different
managers during the considered time interval. Managers are defined as nodes sending emails to
more than 45 different addresses
Fig. 10 The number of emails sent per minute (top) and number of addresses sending email per
minute (bottom) over a one week time interval
Figure 10 shows total email traffic and number of addresses sending email over one
week with a one minute resolution. Working days have a bi-modal distribution with
heaviest activity at the beginning and end of the day. The lower level of activity
on Friday is related to an alternative work schedule that most LANL employees
42 B.H. Sims et al.
follow. This schedule enables employees to take every other Friday off in exchange
for working longer hours MondayThursday. As a consequence, only slightly more
than 50 % of the workforce is at work on a given Friday. This is directly reflected in
the amount of email traffic on Fridays.
5 Conclusion
References
1. Allen TJ, Cohen SI (1969) Information flow in research and development laboratories. Adm
Sci Q 14(1):1219
2. Barabasi A-L, Ravasz E, Vicsek T (2001) Deterministic scale-free networks. Phys A 299:
559564
3. Bugos GE (1993) Programming the American aerospace industry, 19541964: the business
structures of technical transactions. Bus Econ Hist 22:210222
4. Burt RS (1992) Structural holes: the social structure of competition. Harvard University Press,
Cambridge
5. Chapanond A, Krishnamoorthy MS, Yener B (2005) Graph theoretic and spectral analysis of
Enron email data. Comput Math Organ Theory 11:265281
6. Clauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in empirical data. SIAM
Rev 51:661703
7. Collins HM (1985) Changing order: replication and induction in scientific practice. Sage,
London
8. Diesner J, Frantz TL, Carley KM (2005) Communication networks from the Enron email corpus
Its always about the people. Enron is no different. Comput Math Organ Theory 11:201228
9. Doreian P, Batagelj V, Ferligoj A (2005) Generalized blockmodeling. Cambridge University
Press, Cambridge
10. Freeman LC (2009) Methods of social network visualization. In: Meyers RA (ed) Encyclopedia
of complexity and systems science. Springer, Berlin, pp 29812998
11. Gillespie R (1991) Manufacturing knowledge: a history of the Hawthorne experiments. Cam-
bridge University Press, Cambridge
12. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc
Natl Acad Sci 99:78217826
13. Granovetter MS (1973) The strength of weak ties. Am J Sociol 78(6):13601380
Hierarchical and Matrix Structures in a Large Organizational Email Network . . . 43
14. Hansen MT (1999) The search-transfer problem: the role of weak ties in sharing knowledge
across organization subunits. Adm Sci Q 44(1):82111
15. Hansen DL, Shneiderman B, Smith MA (2011) Analyzing social media networks with NodeXL:
insights from a connected world. Elsevier, Burlington
16. http://gephi.github.io/
17. http://www.cytoscape.org/
18. http://www.lanl.gov/about/facts-figures/talent.php
19. Karagiannis T, Vojnovic M (2008) Email information flow in large-scale enterprises. http://
research.microsoft.com/pubs/70586/tr-2008-76.pdf
20. Kilduff M, Tsai W (2003) Social networks and organizations. Sage, London
21. Kossinets G, Watts DJ (2006) Empirical analysis of an evolving social network. Science 311:
8890
22. Krackhardt D (1992) The strength of strong ties: the importance of Philos in organizations. In:
Nohria N, Eccles RG (eds) Networks and organizations: structure, form, and action. Harvard
Business School Press, Boston
23. Krackhardt D, Hanson JR (1993) Informal networks: the company behind the chart. Harvard
Bus Rev 71(4):104111
24. Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking
diameters. ACM Trans Knowl Discov Data 1(2)
25. Markus ML (1994) Electronic mail as the medium of managerial choice. Organ Sci 5:502527
26. Mitzenmacher M (2004) A brief history of generative models for power-law and lognormal
distributions. Internet Math 1:226251
27. Papadopoulos F, Kitsak M, Serrano MA, Boguna M, Krioukov D (2012) Popularity versus
similarity in growing networks. Nature 489:537540
28. Phelps C, Heidl R, Wadwha A (2012) Knowledge, networks, and knowledge networks: a review
and research agenda. J Manag 38:11151166
29. Polanyi M (1966) The tacit dimension. Doubleday, Garden City
30. Ravasz E, Barabasi A-L (2003) Hierarchical organization in complex networks. Phys Rev E
67:026112
31. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi A-L (2002) Hierarchical organization
of modularity in metabolic networks. Science 297:15511555
32. Sims BH, Sinitsyn N, Eidenbenz SJ (2013) Visualization and modeling of structural features
of a large organizational email network. In: Proceedings of the 2013 IEEE/ACM international
conference on advances in social networks analysis and mining. ACM, New York, pp 787791
33. Tortoriello M, Reagans R, McEvily B (2012) Bridging the knowledge gap: the influence of
strong ties, network cohesion, and network range on the transfer of knowledge between orga-
nizational units. Organ Sci 4:10241039
34. Wuchty S, Uzzi B (2011) Human communication dynamics in digital footsteps: a study of the
agreement between self-reported ties and email networks. PLoS ONE 6(11):e26972
35. Zeini S, Ghnert T, Hoppe U, Krempel L (2012) The impact of measurement time on subgroup
detection in online communities. In: Proceedings of the 2012 IEEE/ACM international confer-
ence on advances in social networks analysis and mining. IEEE, Los Alamitos, pp 389394
Overlaying Social Networks of Different
Perspectives for Inter-network Community
Evolution
1 Introduction
1.1 Motivations
Recent studies propose to analyze dynamic networks [1, 8, 17] to detect community
evolution. Most of these studies use topological properties to identify the updated
parts of the network and characterize the type of changes such as network shrinking,
growing, splitting, and merging [3].
There are many studies about community detection [6, 21]. A well-known
approach for community detection is described in [7] and is based on the intuition
that groups within a network may be detected through natural divisions among
the vertices without requiring to set the number of groups or put restrictions on their
size.
Many other approaches have been developed for tracking the evolution of social
communities over time [1, 16, 23, 27]. To that end, they use several static views of
the network at different time slots. For each view, one may use an existing community
detection algorithm [6] to depict the community topology. Therefore, between two
time points, changes may occur such as a network growth or partition. Most of the
new community detection approaches are devised on an underlying event framework
that defines a specific behavior of a community like birth, growth, and merging in
network evolution [1].
More recent approaches study different issues for heterogeneous information
networks [25] which contain more than one type of links or nodes. Each type of link
indicates a specific relationship between actors. A simple example is a network that
describes two types of nodes: Researcher and Publication and two categories of links:
collaboration between researchers and authorship between researchers and publica-
tions. Indeed, the authors in [25] report different studies on mining and analyzing
Overlaying Social Networks of Different Perspectives . . . 47
such networks and tackle many challenging issues such as dynamic network/group
detection, behavior analysis of an actor over time based on the network content or
the actions of other actors [11], relationship prediction, node ranking combined with
clustering (or classification), and similarity search (e.g., look for researchers who
have similar profiles).
In [11], authors rely on social bookmarking to analyze communities over time.
The approach assumes that aggregating the non coordinated tagging actions of a
large and non homogeneous group of actors can be exploited for enhanced knowledge
discovery and sharing. Therefore, based on the tags and the actors who choose them,
they provide a framework for community-based organization of web resources.
To summarize, community evolution has the advantage to foresee the overall
trend of a group and anticipate some positive or negative effects they lead to. For
example, detecting the growth of a botnet at its early stage may help foresee criminal
or suspicious attacks. The approach proposed in the present work is well related to
the recent approaches that oversee evolving networks since it relies entirely on the
actor behavior with respect to the activities that occur in a single network or even in
many networks. Moreover, contrary to most of the studies, we set the relevance of
social activities using possibility theory that helps find communities in an accurate
way.
1.2 Contributions
In this paper, we do not focus directly on detecting the community evolution as
it is often the case in the literature, but we aim to track temporary communities,
which are built based on temporary ties created between a set of actors during a
time slot. Basically, we assume that actors may have temporary links (e.g., during a
set of activities) that might disappear afterwards. Such links are mined in order to
extract dominant features of the network like temporary communities that we call
perspective communities. Moreover, we use temporary links to identify active and/or
passive actors.
Our approach relies on the methods described in [18, 24] where the authors
identify a social network from collected temporary data. Moreover, most of the
solutions proposed to detect communities generally use statistical inference meth-
ods based on the probability theory which achieves relatively good performance.
In most of the cases, modeling processes are built to get results with high proba-
bilities (90 %). In this work, we try to go beyond such techniques and our main
contributions can be summarized as follows:
A method to track changes within a social network by identifying temporary links
established between actors during activities in a given set of time slots. The tempo-
rary links are obtained using probability and mined afterwards in order to extract
dominant features of the network such as perspective communities.
A relationship prediction method based on possibility distributions to overlay a
set of networks in order to unveil hidden communities. Our approach is based
48 I. Sarr et al.
on a very simple principle between probability and possibility that may be stated
in an informal way as: what is probable should be possible. Using possibility
rather probability theory has the advantage to overcome the knowledge about the
incompleteness and the uncertainty of data from which prediction is conducted.
Consequently, the approach has the advantage to detect more precise temporary
links as well as perspective communities that highlight the dynamic changes in
one or many networks over time.
The rest of this paper is structured as follows: Sect. 2 gives basic concepts and
definitions about social networks. Sections 35 present a mechanism to detect the
network evolution over time and mainly how we figure out active nodes and virtual
communities. Section 6 covers the approach validation while Sect. 7 summarizes our
contribution and presents future work.
2.1 Activity
An activity is a social or professional event or task conducted by users. It could
be a meeting, conference, festival, concert, post, image publication, tweet/re-tweet,
etc. Inside a community or a whole network, activities are numbered and tags are
associated with them. For example, a tag in the Twitter micro-blogging platform may
be sport, high technology, culture, movie, etc.
Furthermore, actors may be involved or not in a given activity. Formally, the
behavior of an actor k with respect to an activity ai is represented as:
1 if actor k attends activity ai
bk (ai ) =
0 otherwise.
To track activities over time, we consider that they happen in a given time window
j = [T j , T j + ]. For each window, we capture a snapshot of activities which may
be of different types. To illustrate our approach, we consider a collaboration network
of researchers. Basically, the network is drawn based on co-authorship patterns, and
we track the co-participation of actors to activities such as meetings, conferences or
social events. Moreover, we assume that ten activities happen within a single time
window. Table 1 depicts the matrix that shows the participation of researchers to a
set of activities. One may see that Researcher 1 takes part to activities a2 , a3 , a5 , a8 ,
and a10 since b1 (ai ) is equal to 1 for these activities.
Overlaying Social Networks of Different Perspectives . . . 49
Actors participating to activities may have joint interactions. For example, actors may
be linked to interact or collaborate during a meeting or a conference. Such interactions
are considered as temporary since they are established during a time period and may
be broken later on. With this in mind, we define a perspective community as a set
of participating actors and the temporal ties they share for joint activities performed
during a given time period.
The goal of this section is to describe how we track the behavior of network nodes over
time in order to identify active and passive actors. The advantage of such identification
is also discussed.
3.2 Algorithm
Given a network, Algorithm 1 computes the set of active actors for a set of
time windows. The input covers the nodes of the network, the set as well as the
matrices related to the participation of actors to activities in the windows within
. For each actor and each time window, the algorithm computes the laziness ratio
(Lines 39). After processing all time windows in , it checks whether the number
of times a given actor is active reaches at least the threshold R (Line 11). Function
add(AN , k) adds the node k to the set of active actors AN .
The complexity of the algorithm is proportional to the cardinality of the set C of
nodes in the network.
3.3 Applications
The identification of active and passive nodes in a network has positive effects in
many real-life applications. In the following we provide two possible utilizations of
our approach:
Churn detection. After a subsequent set of time windows, our method identifies
inactive nodes that may be considered as churners [10]. Churn detection is fruit-
ful for most service-based companies like telecommunication, banking and social
network services that may see their profitability decreased with the loss of cus-
tomers. This is also useful to predict employee attrition based on the decrease of
employees participation to social or professional activities within an organiza-
tion. Therefore, predicting or detecting customer or employee attrition in its early
stage gives more flexibility to companies to apply appropriate incentives to keep
customers or employees in their business.
Targeted marketing/advertising. Detecting active actors can be applied to identify
and/or rank actors who may react positively to a social or professional invitation
or a product/service advertising. In such a framework, only active actors will be
targeted since they exhibit an important participation to past activities and could
be future attendees or customers of the promoted event, product or service.
Most of the online systems created in recent years like Facebook and MySpace offer a
rich set of activities and facilities for extensive interactions [4]. These systems record
both activities and interactions, thereby enabling the construction of a social network
after a unique sequence of activities. However, our goal is not to find perspective com-
munities after each activity but after a set of activities that happen within a collection
of time windows. The main reason is that two nodes may interact during activities in
a selected window and never for the rest of subsequent windows. Hence, using only
data from one window of activities is not enough to estimate the intensity of the link
between two nodes. Therefore, we consider a universe = {1 , . . . , j , . . . , q }
of time windows. For each couple of actors (k, l), we consider the parameter vector
1 , p 2 , . . . , p q ) that characterizes the probability distribution of a ran-
pk,l = ( pk,l k,l k,l
dom variable X (the relation or link between actors) on the set . The parameter
j
pk,l is the probability that actor k is linked to actor l during the n activities found in
window j :
n j
j i=1 Mk,l (ai )
pk,l = n n (2)
min i=1 bk (ai ), i=1 bl (ai )
j
where Mk,l (ai ) is the Meeting function that indicates if both k and l have attended
activity ai in window j and have interacted. It corresponds formally to:
1 if bk (ai ) = bl (ai ) = 1
Mai (k, l) =
0 otherwise.
j n
In other words, pk,l is the overlap coefficient while i=1 Mai (k, l) corresponds to
the matching coefficient. We use the overlap coefficient because it is shown in [20]
that it is more adapted to social network analysis than the matching and Jaccard
coefficients.
The intensity of the relation between two nodes can be set as their total
co-occurrences in the whole set of windows. This value is namely represented by the
parameter vector pk,l . A heuristic method may consist to apply a threshold vector
c = (1 , 2 , . . . , q ) to pk,l to decide if a link can be added between k and l after
observing activities in the set of windows. In fact, a link is added between k and l if
i for every time window .
pk,l i i
Finally, the perspective communities based on a set of activities are identified as
follows:
run Algorithm 1 to compute the set AN of all active actors
for each couple of actors k and l in AN , add a link between k and l whenever the
computed value pk,l is at least equal to the user-defined threshold c for all the
time windows.
Overlaying Social Networks of Different Perspectives . . . 53
4.2 Example
To illustrate our approach we consider the collaboration network of researchers
described in Sect. 2. Basically, we assume that the network is drawn based on
co-authorship patterns. The initial network is depicted in Fig. 1.
For the sake of clarity, we assume that contains only the time window j .
As a consequence, the threshold vector c is reduced to the single value 1 . With
this insight, we set two distinct values of the threshold c : 40 and 60 %, and we
draw the resulting networks. Figures 2 and 3 depict the perspective networks when
c = (40 %) and c = (60 %) respectively. With the value 40 %, the perspective
community is dense since new links are added even when two actors share a low
number of activities. That is the reason we have more links in Fig. 2 than in Fig. 1
which represents the initial network. Furthermore, with a low value of c , a real
closeness of two actors is not guaranteed. However, when c = (60 %), links are
added only between actors who participate to at least 60 % of activities. This leads to
more cohesive groups that share a common behavior. Figure 3 highlights two distinct
groups formed based on the intensity of temporary links established between actors.
Moreover, one may observe in Table 1 that nodes in the group {4, 5, 6, 8} shown in
Fig. 3 have a participation rate smaller than the group {1, 2, 3, 7, 9}. If ever such
a behavior is observed (or reinforced) over subsequent time windows (or over a
long period of time), an attrition of the corresponding group may be expected. We
recall that perspective communities depict only temporary interactions (e.g., who
co-participates with whom), and are different from more stable communities in the
initial network (e.g., co-authorship network). However, when mapped over the initial
network, perspective communities give additional insight about new cohesive groups
that arise from activity participation.
4.3 Discussion
The approach presented before is heavily based on using thresholds. However, it
is not easy to find the right and adapted threshold values for each case. Generally,
an heuristic method is used to compute such values. Even though efficient methods
54 I. Sarr et al.
can be devised to set threshold values and identify perspective communities based
entirely on probabilities, we believe that it might be more useful and effective to
reinforce such techniques by appropriate considerations. Therefore, we propose to
combine both possibility and probability theory to improve the accuracy of perspec-
tive communities built from the activity data. The main reason is that possibility
theory can be viewed as an upper bound on a probability theory.
The modeling and management of uncertainty is one of the main issues in the
design process of complex decision systems. Due to the diversity of information
sources, uncertainty can take one of the following forms: randomness, incomplete-
ness, and inconsistency. In our framework, different kinds of uncertainty can be
found with respect to: (i) the quality of the selected activities, (ii) the selection
Overlaying Social Networks of Different Perspectives . . . 55
of the appropriate number of time windows, and (iii) the choice of the underlying
distribution of identified random variables such as links between nodes.
It is important to note that both possibility and probability theories can be used to
represent uncertainty [5]. However, they do not capture the same aspects of uncer-
tainty. In fact, the basic feature of probabilistic representations of uncertainty is
additivity. Uniform probability distributions may be used to model randomness on
finite sets. They are adapted for expressing total ignorance in belief modeling. As
a consequence, probability theory offers a quantitative model for randomness and
inconsistency while possibility theory offers a qualitative model of incompleteness.
In Sect. 4 where we propose a method for capturing community evolution using
probability theory, many important questions can be raised: (i) What is the most
appropriate number of time windows to consider during the process? (ii) How can
we quantify properly the possibility of having links between actors for each time
window? and finally (iii) How can we assign a relevance degree to each time window
when the importance of the underlying events is taken into account?
In the following, we consider these questions and rely on possibility theory to
identify perspective communities in a more accurate manner. For a thorough view
of possibility theory, we refer the reader to [5, 29, 30]. Initiated by Zadeh [29],
possibility theory is based on a principle which involves the operation supremum.
The supremum (sup) is the least upper bound of a subset S of a totally or partially
ordered set T . According to Dubois et al. [5], a possibility measure on a set X is
characterized by a possibility distribution : X [0, 1], and is defined by:
To predict a link, we need to find a threshold and apply it to the vector of possibility
distributions for each pair of actors as done in Sect. 4 with probability distributions.
Thereafter, we consider that two actors may be linked if they both interact during
several time windows. The number of time windows within which two actors interact
is named the major participation (MP). Hence, if the major participation of two actors
exceeds a threshold, then a link will be added between them.
To reach our goal, we define a new measure norm that helps decide whether a link
can be added between two actors. norm corresponds to the minimum of the possibility
distribution vector ts . Basically, the choice of the minimum value is trivial and is
based on the worst case, i.e., the time window with the lowest number of interactions.
In other words, the lowest case is the window with less important activities and has
the lowest degree of possibility. However, even though this worst window should
be the one that is less possible, some actors might keep a little interest to participate
to its underlying activities. Then, this window can be used to establish the worst
scenario where only a very few number of actors are linked to others.
Overlaying Social Networks of Different Perspectives . . . 57
The co-participation of two actors to the majority of time windows should at least
be greater than this minimum to decide whether a link may be drawn between them.
Thus, we can define the major participation of two actors as the percentage of their
co-participation over all windows of activities as follows:
MP(k, l) = [(i norm ) == 1] [(i norm )] . (5)
In this formula, we recall that i is the possibility distribution vector containing the
different values of the intensity of the relation for a given pair (k, l) of actors in
all windows. We consider that a link exists between two actors if and only if their
major participation exceeds a given user-defined threshold that reflects the term
majority of time windows.
Finally, actor k is linked to actor l if the following equation holds:
MP(k, l) . (6)
5.3 Algorithm
Algorithm 2 builds a set of perspective communities via two steps. The first step
computes the probability distributions while the second one identifies links between
actors after inferring possibility distributions.
The first step of the algorithm (Lines 48) finds probability distributions for each
couple of actors. To this end, we compute for each time window the co-participation
rate of the couple of actors k and l (Line 7). Afterwards, we add the result in a multi-
j
dimensional array by using Function addProb( pk,l , pk,l ) (Line 8). Once this task
is completed, we compute for each vector of probabilities the related distribution
possibility (Line 11). If the distribution possibility for k and l exceeds a thresh-
old for a given time window, then we increment the number of times these actors
co-participate to activities (Line 16). Finally, the last part of the algorithm checks
if the number of times two nodes co-participate at the same time is greater than a
threshold. If so, a link is added between them (Lines 2021).
6 Model Evaluation
In this section we aim to validate our approach, mainly the possibility theory solution
in order to assess the accuracy of the perspective community identification.
58 I. Sarr et al.
In order to successfully apply the proposed procedure, one can begin with an initial
network with linked actors. To that end, we run a simple algorithm based on prob-
abilities, which relies on the fact that two actors are linked if their total number of
co-occurrences exceeds a predefined threshold. For each pair of actors, we calcu-
late the probability of their co-participation within each window. If this probability
exceeds a threshold , then the actors are linked within the given time window. More-
over, we decide to set a tie between two actors if they are at least linked in at least
half of time windows, i.e., R = 0.5 %.
The initial networks are shown in Figs. 4, and 5. We detect respectively 17 links
both for scenarios C and E and 34 links for scenario D.
6.3 Validation
In the following we validate our approach on the initial networks shown in Figs. 4
and 5 and discuss the output of our procedure about perspective community detection.
As an illustration, we consider the initial network of scenario C (see Fig. 4) and
we build the entire procedure on the universe = {1 , 2 , 3 , . . . , 10 } with
ten time windows and ten actors who participate to the various activities within
each TW. The number of activities for the TWs is given by the following vector:
Overlaying Social Networks of Different Perspectives . . . 61
Number of votes
Number of votes
Actor 2 Actor 3
20 20
10 10
0 0
Number of activities Number of activities
Actor 4 Actor 5
40 20
20 10
0 0
Actor 6 Actor 7
20 40
10 20
0 0
Actor 8 Actor 9
20 4
10 2
0 0
Actor 10
2
1
0
Fig. 6 Number of appearances of the nine active actors in the first time window of scenario C
(1254, 1277, 1363, 1460, 1460, 1490, 1497, 1497, 1497, 1497). The actors are named
respectively {funny, GifSound, pics, gifs, atheism, gaming, WTF, aww,
reddit.com, 6}, but for simplicity we use numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 to
identify them respectively. The execution of Algorithm 1 to discover active actors
gives the vector AN () = {2, 3, 4, 5, 6, 7, 8, 9, 10}. This result is obtained when
we set manually the threshold to the value R = 2/3. Only Actor 1 is considered
non active since he does not have a sufficient number of participation within the
considered TWs. In Fig. 6, we show the number of actors activities inside the first
TW. Similar results are obtained for Scenarios E and F. For each time window, Table 3
shows the value of pi and pi+ (see Appendix B for more details) and indicates that
the value of the threshold norm is equal to 0.2775 (i.e., the lowest value of i ).
To identify new links between nodes, we set the threshold to 70 %. In Table 4,
after running Algorithm 2, we show the measures of possibility distribution that a
possible link may hold between a pair of actors.
For sake of clarity, we report values only for five pairs of actors. We observe the
possibility values between Actors 2 and 3 and notice that seven (7) of the ten (10)
values of the vector are greater than or equal to the threshold normal = 0.2775.
Thus, MP(2, 3) = 7/10, i.e., MP(2, 3) , and consequently a link between
Actors 2 and 3 is added. There is also a link between 2 and 4 and between 7 and 9
because MP(2, 4) = 7/10 . Conversely, one can see that only five values of the
possibility vector for Actors 2 and 5 are greater than norm , i.e., MP(2, 5) = 5/10,
62
Table 3 Interval-valued probabilities, possibility distributions, and length of each time window for scenario C and = 5 %
Time window i 1 2 3 4 5 6 7 8 9 10
pi 0.0832 0.0848 0.0907 0.0973 0.0973 0.0993 0.0998 0.0998 0.0998 0.0998 0.05
pi+ 0.0925 0.0941 0.1003 0.1072 0.1072 0.1094 0.1099 0.1099 0.1099 0.1099
iS 0.2775 0.2808 0.7884 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Length of window i 1254 1277 1363 1460 1460 1490 1497 1497 1497 1497
I. Sarr et al.
Overlaying Social Networks of Different Perspectives . . . 63
Table 4 Probabilities vectors of links between nodes and corresponding possibility distributions,
( = 0.05, i.e., to set confidence bounds at 95 %)
Time window t 1 2 3 4 5 6 7 8 9 10
Possibility and distribution vector for nodes 2 and 3
Probabilities 0.10 0.09 0.09 0.09 0.10 0.08 0.09 0.07 0.14 0.15
Possibilities 0.71 0.42 0.51 0.24 0.61 0.15 0.32 0.07 0.85 1.00
Possibility and distribution vector for nodes 2 and 4
Probabilities 0.10 0.10 0.09 0.09 0.10 0.10 0.10 0.10 0.10 0.11
Possibilities 0.49 0.79 0.09 0.19 0.59 0.29 0.89 0.69 0.39 1.00
Possibility and distribution vector for nodes 7 and 9
Probabilities 0.08 0.09 0.10 0.09 0.11 0.08 0.06 0.06 0.14 0.18
Possibilities 0.28 0.47 0.57 0.38 0.68 0.20 0.12 0.06 0.82 1.00
Possibility and distribution vector for nodes 2 and 5
Probabilities 0.06 0.07 0.05 0.03 0.07 0.04 0.03 0.06 0.33 0.25
Possibilities 0.28 0.42 0.16 0.06 0.34 0.11 0.03 0.22 1.00 0.67
Possibility and distribution vector for nodes 7 and 8
Probabilities 0.07 0.07 0.10 0.06 0.05 0.06 0.04 0.08 0.25 0.23
Possibilities 0.28 0.35 0.52 0.15 0.09 0.21 0.04 0.43 1.00 0.75
Case of scenario C
i.e., MP(2, 5) < and, thus, there is no link between these two nodes. There is no
link between Actors 7 and 8 because their MP(7, 8) = 6/10 is less than the value of
the threshold .
After computing MP for each pair of nodes, we get the perspective communities
shown in Fig. 7 where dashed lines represent new added links. After running our
procedure, links are added to the initial networks shown in Figs. 4 and 5. Such new
links help identify perspective communities. In the top left part of Fig. 7 built for
scenario C, we observe that Nodes 2, 3, 5 and 7 form a community even though the
other actors are also active. Other detected communities are {2, 3, 4, 7}, {2, 3, 4, 8}
and {2, 3, 9, 7}. The same reasoning can be done for the top right graph of Scenario
E. In the third graph in the bottom, one can see that Actors 3 and 9 have the most
important number of links with other nodes. These cases are interesting in the sense
that one can focus on perspective communities and leading nodes to take appropriate
real-life decisions about their underlying activities and evolution.
Fig. 7 Perspective community evolution. The top left graph represents scenario C while the top
right one is for scenario E and the bottom graph is for scenario D
hence lead to more reliable results. In Fig. 8 one can see the effect of the confidence
level variation on the number of detected links. This result is not surprising but it was
not clear that tuning parameter does make the algorithm a rich stationary process
where the number of detected links does not increase beyond a certain value. Another
result is related to the variation of the link detection threshold, i.e. the variable . We
set this threshold to 70 % but in Fig. 9 we find obviously how the number of detected
links depends on the value of . As the increases, the number of links decreases.
Since the possibility measure lies between 0 and 1, increasing the detection threshold
has the natural effect to reduce the detected links.
In this paper, we present the premises of a new approach based on user activities over
time to detect community evolution within a social network. We first report snapshots
of the network at different time periods and then we analyze the underlying social
network in order to identify active actors and perspective communities. In fact, nodes
that have a high rate of participation are called the active ones and are considered as
nodes of the perspective communities formed from those nodes and their interactions.
Overlaying Social Networks of Different Perspectives . . . 65
Scenario C
70
Scenario D
Scenario E
60
50
40
30
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Variation of the confidence bounds (1alpha)
Fig. 8 Tradeoff between the value of alpha and the number of detected links when the threshold
is set to 70 %
300
Scenario C
Scenario D
250 Scenario E
Scenario F
Number of links detected
200
150
100
50
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Variation of the detection threshold
Fig. 9 Tradeoff between the detection threshold and the number of links detected for the set of 10
TWs for scenario C
Our approach can be useful to show central actors. It can also highlight how
using perspective communities defined over time may increment the information
flow circulation. In fact, beside the fact that our approach tracks the evolution of
the network, it gives a basic way to figure out churners. Churn detection in its early
stage is very fruitful since it gives more flexibility to companies to apply appropriate
incentives to keep their customers. Furthermore, mapping perspective communities to
an (initial or important) network adds new links that improve the network accessibility
66 I. Sarr et al.
and hence, the information flow circulation. These benefits combined with the low
complexity of our algorithms let us argue that our approach is promising.
We plan to carry out a set of new experiments to assess the performance of the
proposed approach and its accuracy regarding churn detection and social influence
identification. Presently, we assume that all activities have the same importance.
However, ongoing work is conducted to differentiate activities within a window. Fur-
thermore, we plan to provide a way to estimate the reasonable size of time windows,
and to study the correlation between users interaction in several time windows.
Finally, we are collecting data from various networks in order to find perspective
communities that emerge from the superposition of several networks.
Acknowledgments The third author acknowledges the financial support of the Natural Sciences
and Engineering Research Council of Canada (NSERC).
The most specific possibility distribution, compatible with the probability distribution
p = ( p1 , p2 , . . . , pq ) can then be obtained by taking the maximum over all possible
Overlaying Social Networks of Different Perspectives . . . 67
permutations:
i = max pj (10)
u=1,L
{ j| u1 ( j)u1 (i)}
The permutation is a bijection and the reverse transformation 1 gives the rank
of each pi in the list of the probabilities sorted in the ascending order. The number
L of permutations depends on the duplicated pi in p. It is equal to 1 if there is no
duplicate pi , i and for this case P is a strict linear order on .
P( p Cn ) 1 (11)
We can use the Goodman [9] formulation in a series of derivations to solve the
problem of building the simultaneous confidence intervals.
A = 2 (1 /K , 1) + N (12)
68 I. Sarr et al.
Bi = 2 (1 /K , 1) + 2n i , (13)
n i2
Ci = , (14)
N
Finally, for each class of activities A K the bounds of the confidence intervals are
defined as follows: 1 1
B 2
B + 2
[ pi , pi+ ] = i
i i i
, (16)
2A 2A
This partial order may be represented by the set of its compatible linear extensions
(P) = {lu , u = 1, L}, or equivalently, by the set of the corresponding permutations
{u , u = 1, L}. Then, for each possible permutation u associated with each linear
order in (P), and each class Ai , we can solve the following linear program:
i u = max pj (18)
p1 ,..., p K
{ j| u1 ( j)u1 (i)}
Finally, we can take the distribution of the class Ak dominating all the
distributions u :
i = max i u i {1, . . . , K } (20)
u=1,L
Overlaying Social Networks of Different Perspectives . . . 69
Complexity
References
Abstract The Iranian presidential election and its post-events was the most engaging
topic of the year 2009 among Twitter users. In this paper, we study the social net-
work among the users that were engaged in that topic during an 18 month period
of observation. We analyze the content of tweets that were published in English or
Persian by Iranian people or others around the world and extract the most trending
topics in critical days. We also study the sub-communities.
1 Introduction
Twitter website, launched in 2006, offers a social networking and micro-blogging
service. It offers the users a service to send and receive short messages called tweets.
Tweets are text-based messages of up to 140 characters, which are visible on the
website or can be accessed through third-party applications. The rate of publication
on Twitter is more than one million messages per hour. At first, the idea was to
indicate personal status for friends. But, these days, it is used in various forms of
posts from political news to produce information, e.g., short phrases, URLs, and
direct messages to other users.
Especially before and during elections, political atmosphere is clearly seen on
the tweets posted by many users. In addition, political meetings are arranged and
announced to supporters meanwhile.
The 10th Iranian presidential election was one of the most important political
events in Iran, after revolution in 1979. This election was held on 12 June 2009, with
incumbent Mahmoud Ahmadinejad running against three challengers:
M. Mousavi: An Iranian reformist politician, artist and architect who served as the
last Prime Minister of Iran, from 1981 to 1989.
M. Karoubi: An influential Iranian reformist politician, democracy activist. He
was chairman of the parliament, from 1989 to 1992 and 2000 to 2004.
M. Rezaei: An Iranian politician, economist and former military commander.
Rezaei was the Iranian Revolutionary Guard Corps chief commander for 16 years
(19811997).
According to the official result, Ahmadinejad won the election by more than
two-thirds of votes. However, Mousavi and other candidate did not accept the results;
they ask their supporters to hold peaceful demonstration. They could hold some
demonstrations in the large cities of Iran. The 13 June situation was described as the
biggest unrest since the 1979 revolution. Mousavi urged for calm and asked that his
supporters refrain from acts of violence. However, the struggle between the security
forces and protesters changed to violence after some days of unrest. The government
tried to push back the demonstrations. Some opposing politicians were arrested.
The protesters used social networking or social media websites such as Facebook,
YouTube, and Twitter to organize their meetings and rallies. To control the situation,
some Internet services went down and Short Message Service (SMS) was blocked
by the authorities.
Meanwhile, Twitter postponed its upgrade for some hours in order to let people
cover news on Iranian election.1 Facebook launched its support for Persian lan-
guage earlier than schedule.2 Google released its Persian translator before the
schedule.3 Iranian election was deemed the most engaging topic of the year. The
terms #iranelection, Iran, and Tehran were among the top trending topics of 2009
in Twitter.4
Here, we try to analyze the tweets that were published about Iranian election from
3 months before the election to 15 months after it. We study the social network among
the users and analyze the content of tweets.
The rest of this paper is organized as follows: in the next section, previous works
are reviewed. In Sect. 3, we explain our data collection method. We look at the
dynamics of user registration in Twitter and we find the critical days in post-election
events according to the number of tweets per day. In Sect. 4, we analyze the trending
keywords and in the next section we study the most influential websites that were
cited in tweets. In Sect. 6, we take a look at the social network among users and their
communities. Finally, the conclusion and future works will come.
blog.php?post=97122772130.
3 Google translates Persian. The official Google Blog. [online] http://googleblog.blogspot.com/
2009/06/google-translates-persian.html.
4 Top Twitter Trends of 2009. The Official Twitter Blog. [online] http://blog.twitter.com/2009/12/
top-twitter-trends-of-2009.html.
Study of Influential Trends, Communities, and Websites on the Post-election Events . . . 73
2 Previous Works
Various studies report the important role of Social Networking websites on the
political events in different countries [2, 4, 10, 11].
Reference [1] measures the degree of interaction between 40 liberal and
conservative blogs over the period of two months, and their effects on U.S. elec-
tion 2004.
Reference [7] with the help of some Persian natives categorizes the Persian
weblogs, find the main poles and study the relationship among different poles. Ref-
erence [12] introduces a new dataset on Persian blogs and analyzes the network.
Different works have been done especially on Twitter. Reference [13] analyzes
more than 100,000 tweets mentioning parties or politicians prior to the German
federal election, 2009.
After Iran presidential election at 2009, more researchers were attracted to Iran
events and Persian social network [9, 14]. In our previous work [8], we studied the
role of Twitter on that election and events after the election. In this paper, we have
focused more on the content of tweets.
3 Data Collection
Our dataset consists of 1,375,510 tweets from 6,721 users, which contain iranelection
tag. They have been published in a period of 3 months before the election up to
15 months after it (totally 18 months). The following information about users is
accessible in Twitter: id, name, number of followers, number of friends and account
creation date. Also, the following information about tweets is accessible: id, owner
user, body text, creation date.
Figure 1 shows the histogram of the number of users tweets. More than two
thousands (2,128) users have just one tweet with iranelection tag, and 603 users
have two. Also, there is a user with 6,826 tweets. In order to be clear, the horizontal
axis shows only the users who have written less than 500 tweets with #iranelection.
All of tweets in our dataset are either in Persian or English. Based on that, we
categorize all users into two groups: (1) Persian natives (P-Users): Users who have
published at least one tweet in Persian (4,634 Users). A P-Users may have written
tweets either in two languages or all in Persian. (2) Foreign Users (EN-Users): Users
who do not have any Persian tweet, and publish their tweets all in English (6,722
Users).
Fig. 2 Number of users signed up to Twitter in each month. Most of users have signed up in March
(beginning of the new year in Persian calendar), April, May and June (month of election)
The figure clearly shows that most of users have signed up in March (beginning
of the new year in Persian calendar), April, May and June (month of election), 2009.
Considering Iranian election was held on June 12, 2009, some interested users signed
up for Twitter on early months in order to diversify their sources of information. On
May 23, 2009, Iranian government started filtering Twitter. That might be why the
number of new users in this month is a bit smaller than April as newcomers did not
know how to use anti-filtering software. The number of new users reached its peak at
June and then started declining until the next March and reached a negligible number.
Note that the figure does not mean the Iranian users are not interested anymore
to Twitter. Since we have focused only on the tweets about presidential election and
Study of Influential Trends, Communities, and Websites on the Post-election Events . . . 75
Fig. 3 Number of users signed up in June 2009. Starting from the day of election, users joined
Twitter with an accelerating rate
post-election events, the figure means only #iranelection issue was not interesting
anymore to the Iranian community, not the whole Twitter.
In order to find the reason behind joining in Twitter from Iran, we took a closer
look at the daily rate of sign-ups. Figure 3 shows that, starting from the day of elec-
tion, users joined Twitter with an accelerating rate until four days after the election.
The acceleration might be because text-messaging services were down on mobile
networks during the day of election. Therefore, people started using Twitter along
with other social networking sites, like Facebook, to send news about election to the
outside world. Micro-blogging services provided a fast way for protesters to share
their observation and information and possibly to organize the next protests.
The largest peak of the diagram corresponds to the mass rally of protestors on June
15. After this day, the rate of new users suddenly dropped until 19th. On Friday, June
19, 2009, which was a weekend in Iran, Ayatollah Khamenei (The supreme leader)
made a hardline speech at Friday prayers. On Saturday (the first day of the week
in Persian calendar), June 20, the new users increased a bit. On this day, opposition
movement (green movement) continued their protests, in response to the invitation
of two defeated candidates, Mousavi and Karroubi. The other crucial event of this
day was a meeting of Irans powerful guardian council, which had invited the three
defeated candidates to express their complaints. Then, the number of new users
declined more and more.
76 S.A. Tabatabaei and M. Asadpour
Fig. 4 Number of tweets posted on each day. Peaks of the graph correspond to critical events
4 Trend Analysis
In this section we try to find out what has happened on the most important days. To
specify whether something has happened on a specific day or not, we look at the rate
of tweet publication and find out the most prolific days. In order to find out what has
happened on these days we extract trending keywords of that day.
To find out important days, we measure the activity of users in this network on
different days. Figure 4 shows the total number of published tweets per day. Peaks
of the graph correspond to critical events. Among them, the marked ones will be
explained below and their trending keywords are extracted.
The first week after the Iranian presidential election was the most prolific period
for protesters. (1) Tweet publication rate started to increase on June 12, the day of
election and reached its maximum on June 20 and 21. One day after the speech
of the supreme leader in Friday prayers, June 19, Mousavi insisted on election
annulment, and a rally took place in Tehran. Neda Aqa Sultan was killed and news
and movie about her death spread over the media. (2) A rally in memory of stu-
dent protests of July 9, 1999 took place. (3) Friday prayers was held by Hashemi
Rafsanjani. Supporters of both reformist and conservative parties took part in this
event. (4) The 4th peak corresponds to Qods day rally on September 18. Although it
was an annual rally in support of Palestinian people, protesters came to streets and
made their objections to the government crackdown. (5) The Students day rally was
held on November 4. (6) One month later, on Scholars day, university students held
Study of Influential Trends, Communities, and Websites on the Post-election Events . . . 77
a protest against the government policies. (7) On Ashura, which is the most impor-
tant religious event in Shia religion, there was a rally in support of leaders of green
movement, which finally was led to violence. (8) On February 11th, there was a mass
rally in support of government in which protesters failed to show their disagreement
with a crackdown. (9) The last peak corresponds to the election anniversary.
In this section we focus on the trend of the day. To do this, we analyzed tweets which
were published in each day, and extract their keywords. Our purpose is to show the
relationship between the keywords of tweets and events that happened in that specific
day. To do this, we changed TF-IDF method [6] to adjust with our purpose. In this
section, first, TF-IDF method is explained briefly; then the changes we have made
are explained. Table 1 shows the results of this analysis.
TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a
real-valued measure which is used for keyword extraction. Its value reflects how
important a word or term (t) is to a document (d) in a corpus5 (D).
The value of TF-IDF(D, d, t) increases proportionally with the number of times
term (t) appears in the document (d), but is offset by the frequency of documents of
corpus (D) that contain the word (t).
where TF(d, t), term frequency, is defined as the number of times a given term (t)
appears in document (d); and IDF(D, d, t) is defined as:
log|D|
IDF(D, d, t) = (2)
|d D : t d|
where |D| shows the total number of documents in the corpus, and | d D : t d |
shows the number of documents that contain term t.
The value of TF-IDF is low for words with low term frequency, and also for
words with high document frequency (i.e. stop words like a, the and of). On
the contrary, TF-IDF is maximized by high term frequencies (in the given document)
and low document frequency of the term in the whole collection of documents. So,
we can say the words with high value of TF-IDF are those words that appear many
times in a document but appear rather few times in other document i.e. keywords of
that document.
In order to specify trends of tweets which was published in a day, we have used
a customized version of TF-IFDF method:
Table 1 (continued)
Peak Keywords Description
8 Sadeqiyeh Clash between people and security forces around Sadeqiyeh Sq. in
Aryashahr Aryashahr district, and near Enqelab Sq. happened
Eshraqi Zahra Eshraqi, grand-daughter of Ayatollah Khomeini was arrested
Granddaughter and released shortly
Squares
Enqelab
9 Square People marche silently on sidewalks of Vanake, Valiasr Squares and
Vanak other places to show their opposition
Valiasr
Sidewalk
1. We do not have the concept of document here. However, we append all tweets
published in the same day and consider them as a single document.
2. Trending keywords usually continue to appear in the tweets of the succeeding
days, for a long time (until the interest of public to the trend vanishes). For
example, Neda is a term that is used almost in all days after Nedas death. Since
it appears in many documents, if we use the usual TF-IDF method, its value of
TF-IDF would not be high; and it would not be considered as a keyword. Whereas,
at least for the first day that it was used, it should be considered as a trending
keyword. To overcome this problem, we specify an overlapping time-windows of
30 days during (the focused day and 29 days before that) which TF-IDF method
is applied.
So here, to calculate the value of TF(d, t), all tweets published on a specific day
are considered as the document d. And, to calculate the value of IDF(D, d, t) the
tweets of that day and the ones of 29 days before it are considered as corpus D.
By using the explained method, TF-IDF values are calculated for all terms that
appear in the tweets published in a day. And, terms with highest value of TF-IDF are
considered as the trending keywords of that day. Table 1 shows the trending keywords
of the important days mentioned earlier (Fig. 4). A description about what happened
on those days explains the relation between the keywords and the events happened
that day.
5 Influential Websites
In this section, around 400,000 URLs, which were cited in the tweets, are analyzed. As
mentioned, tweets can link to other websites and online contents e.g. news agencies
and social media. Because of the restriction on the number of characters in each
tweet, URLs are usually shortened by the URL shortening services like bit.ly or
80 S.A. Tabatabaei and M. Asadpour
tinyurl.com. We tried to find the main URLs in these cases; however some of them
were no longer valid. Here we only report the valid ones.
Table 2 shows the 25 top referenced websites along with the number of tweets and
the number of users who had mentioned them. The first rows of the table is occupied
by important websites like YouTube (for coverage of videos from the events), Twitter
and related sites like TwitPic, Twubs, and TwitLonger (for their rapid information
diffusion potential), Google (for its news services), and Facebook and Formspring
(for social networking).
Study of Influential Trends, Communities, and Websites on the Post-election Events . . . 81
The first web site that belongs to a Persian group is Foozools, followed by
HamsedayeIran, a Persian weblog. It is interesting to note that these sites have been
used by a small number of users. This means, a small group of users have tried to
exploit this situation and advertise their favorite website through abusing the iran-
election tag (these kind of websites are highlighted). Hamsedyeiran is similar to
Solaleh7 in its content and they all belong to an armed terrorist group called as
Monafeqin.
The top websites that specifically address the green movement are HelpIran-
Election.com, Iran.WhyWeProtest.net, and RaheSabz.net. Finally the news agencies
and newspapers like CNN, BBC, Reuters, NY Times, and Guardian come.
Table 3 shows the most popular websites (according to the number of users who
cited them in their tweets) among English-speaking users, along with the number
of users who have used them. Highlighted columns show websites that are popular
only among English-speaking users.
HelpIranElection.com is a website that encouraged tweeter users to change their
avatar to have green overlay or green ribbon (green was the official color of the
movement).
Table 4 shows the most popular websites (according to number of users who refer to
them in their tweets) among Persian users, along with the number of English-speaking
users who refer to them. Highlighted columns show websites, which are popular only
among Persians.
The top-most web sites in this table are almost same as Table 3 except
Rahesabz.net. It is one of the most popular news websites related to the green move-
ment. It started its work a few days after the presidential election (June 20, 2009).
This website is written in Persian, so it is not surprising that English-speaking users
did not refer to it.
6 Follower-Followee Network
Users of twitter can follow other users; also, he/she may be followed by others.
The graph in Fig. 5 shows the follower-followee relationship among the users in our
dataset. Nodes correspond to users. Size of a node is proportional to the number of
followers the user have; and its color shows the users language. Nodes that have
82 S.A. Tabatabaei and M. Asadpour
link to each other are placed closer. To visualize this network, we used ForceAtlace2
(Ref. [5]) layout of Gephi6 (Ref. [3]) open-source software.
It is clear from the graph that users are divided into two big communities according
to their language. This is not surprising. The users, who are placed between the
6 http://gephi.github.io.
Study of Influential Trends, Communities, and Websites on the Post-election Events . . . 83
Fig. 5 Follower-followee network: Persian and English speaking users are shown in green and
blue, respectively
the spread of news to the outside world. In the next section we will take a deeper
look into the core communities.
In this section, we take a look at the users that have somehow supported the three
political groups: Monafeqin, Jebheh Mosharekat, and Mojahedine Enqelab. Monafe-
qin, as explained earlier, is an armed terrorist group, based in Iraq. Jebheh Mosharekat
is a reformist group very close to Khatami, the former president of Iran. Mojahedine
Enqelab is another reformist group.
In this subsection, we find the users who support these political groups. Then,
the community of supporters of these groups are compared to each other. To do this,
we found all users who had published at least one tweet in support of one of those
three groups. In order to specify which tweet is in support of which group, we first
collected the keywords that were related to those groups (e.g. name of the group
and name of the famous members of group). Then we found the users who had used
those keywords in their tweets. Finally, to clarify the opinion (positive or negative)
of user about the group, we read some of the tweets of the users that contained the
mentioned keywords. If a user had published at least one tweet in support of a group
we marked him/her as a supporter.
Study of Influential Trends, Communities, and Websites on the Post-election Events . . . 85
Fig. 6 a The core communities: Green Persian speaking users, Blue English speaking users,
b supporters of Monafeqin, c supporter of Jebheh Mosharekat, d supporters of Mojahedine Enqelab
For better visualization, the core community of Fig. 5 is magnified in Fig. 6a.
Figure 6bd show supporters of Monfaeqin, Jebheh Mosharekat, and Mojahedine
Enqelab respectively. It can be seen that supporters of Monafeqin are a few small
nodes congregated in one place. However supporters of the two other groups are
scattered in the whole Persian speaking user community and consist of many impor-
tant (big-size) nodes. These two groups have lots of supporters in common. Both of
these groups supported Mousavi (the defeated candidate) in the election.
7 Conclusion
In this paper we studied the social network among users of Twitter who were
interested in Iranian Presidential election and its post-events. By analyzing the num-
ber of users which signed up to Twitter in different months and days, we saw that
86 S.A. Tabatabaei and M. Asadpour
the restriction that Iranian government put on media during the protests moved the
interested people to online social media and social networks in order to diversify their
sources of information. Some activists used these media to organize their protests
and to communicate with the outside world for help and sympathy. Meanwhile some
small groups tried to abuse this opportunity and advertise their website by sending
spam.
On the other hand, by using a customized version of TF-IDF method, the trending
keywords of tweets which were published in each day were extracted. Results showed
a strong relationship between the published tweets and the occurred events in the
each day.
The top URLs that appeared on the tweets showed social networking and social
media websites were the most influential websites. Also, we perceived that two big
communities (Persian and English speaking users) helped in communicating the
news, events and messages to abroad and vice versa. We also took a look at some
sub-communities and found out some of them. Although a small minority, were
too prolific and less influential, some other sub-communities were dispersed in the
network, being followed by many other users and were more influential. In future
we would like to investigate the spread of information in the network and find out
how content might affect the rate of spread of tweets.
Acknowledgments We would like to thank Kaveh Ketabchi for his helps on collecting the dataset.
References
1. Adamic LA, Glance N (2005) The political blogosphere and the 2004 US election: divided they
blog. In: Proceedings of the 3rd international workshop on link discovery. ACM, pp 3643
2. Albrecht S, Lbcke M, Hartig-Perschke R (2007) Weblog campaigning in the German bun-
destag election 2005. Soc Sci Comput Rev 25(4):504520
3. Bastian M, Heymann S, Jacomy M et al (2009) Gephi: an open source software for exploring
and manipulating networks. ICWSM 8:361362
4. Drezner DW, Farrell H (2008) Introduction: blogs, politics and power: a special issue of public
choice. Public Choice 134(12):113
5. Jacomy M, Heymann S, Venturini T, Bastian M (2011) Forceatlas2, a graph layout algorithm
for handy network visualization. Paris, p 44. http://www.medialab.sciences-po.fr/fr/
publications-fr
6. Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval.
J Doc 28(1):1121
7. Kelly J, Etling B (2008) Mapping Irans online public: politics and culture in the persian
blogosphere. Berkman center for internet and society and internet and democracy project.
Harvard Law School
8. Ketabchi K, Asadpour M, Tabatabaei SA (2013) Mutual influence of Twitter and postelection
events of Iranian presidential election. Procedia-Soc Behav Sci 100:4056
9. Khonsari KK, Nayeri ZA, Fathalian A, Fathalian L (2010) Social network analysis of Irans
green movement opposition groups using Twitter. In: 2010 International conference on
advances in social networks analysis and mining (ASONAM). IEEE, pp 414415
10. Koop R, Jansen HJ (2009) Political blogs and blogrolls in canada: forums for democratic
deliberation? Soc Sci Comput Rev 27(2):155173
Study of Influential Trends, Communities, and Websites on the Post-election Events . . . 87
11. McKenna L, Pole A (2008) What do bloggers do: an average day on an average political blog.
Public Choice 134(12):97108
12. Qazvinian V, Rassoulian A, Shafiei M, Adibi J (2007) A large-scale study on persian weblogs.
In: Proceedings of LINKKDD
13. Tumasjan A, Sprenger TO, Sandner PG, Welpe IM (2010) Predicting elections with Twitter:
What 140 characters reveal about political sentiment. ICWSM 10:178185
14. Zhou Z, Bandari R, Kong J, Qian H, Roychowdhury V (2010) Information resonance on
Twitter: watching Iran. In: Proceedings of the first workshop on social media analytics. ACM,
pp 123131
Entanglement in Multiplex Networks:
Understanding Group Cohesion
in Homophily Networks
Abstract The analysis and exploration of a social network depends on the type of
relations at play. Homophily (similarity) relationships form an important category
of relations linking entities whenever they exhibit similar behaviors. Examples of
homophily networks examined in this paper are: co-authorship, where homophily
between two persons follows from having co-published a paper on a given topic;
movie actors having played under the supervision of the same movie director; mem-
bers of a entrepreneur network having exchanged ideas through discussion threads.
Homophily is often embodied through a bipartite network where entities (authors,
movie directors, members) connect through attributes (papers, actors, discussion
threads). A common strategy is then to project this bipartite graph onto a single-type
network. The resulting single-type network can then be studied using standard tech-
niques such as community detection or by computing various centrality indices. We
revisit this type of approach and introduce a homogeneity measure inspired from past
work by Burt and Schtt. Instead of considering a projection in a bipartite network, we
consider a multiplex network which preserves both entities and attributes as our core
object of study. The homogeneity of a subgroup depends on how intensely and how
equally interactions occur between layers of edges giving rise to the subgroup. The
measure thus differentiates between subgroups of entities exhibiting similar topolo-
gies depending on the interaction patterns of the underlying layers. The method
is first validated using two widely used datasets. A first example looks at authors
of the IEEE InfoVis Conference (InfoVis 2007 Contest). A second example looks
at homophily relations between movie actors that have played under the direction of
a same director (IMDB). A third example shows the capability of the methodology
to deal with weighted homophily networks, pointing at subtleties revealed from the
analysis of weights associated with interactions between attributes.
1 Introduction
The analysis and exploration of a social network depends on the type of relations at
play. Borgatti [7] had proposed a type taxonomy organizing relations in four possi-
ble categories, among which homophily (also referred to as similarity) links actors
exhibiting similar attributes such as membership in a club or interest group [28].
These types of ties do not represent actual social ties themselves, but might lead
to a higher probability of a tie to develop between the members sharing similar
attributes. Examples are networks of co-author, where homophily between two per-
sons follows from co-authorship; networks of movie actors having played under the
supervision of the same director; or networks of members having exchanged ideas
through discussion threads, for instance.
The second type of ties are social relationships that can be affective relationships
such as friendship) usually spanning over time. The third type captures joint inter-
actions observed through discrete events such as calling each other or travelling
together. The last type of ties describes flow (tangible or intangible) between entities
(migrants moving between places, air traffic passengers between airports, etc.). This
paper focuses on networks induced from homophily relations.
Homophily is often embodied through a bipartite network where entities (authors;
movie actors; members) connect through attributes (papers; directors; discussion
threads). Guillaume and Latapy [17] advocate bipartite graphs as being universal
models for complex networks, hence offering additional motivations to use of these
graphs to describe homophily relations. Indeed attributes of different natures can
be also seen as another type of entities interacting together across the edges of the
homophily network.
When dealing with bipartite graphs, a common strategy is to project them onto a
single-type network with entities of a same type. Edges are sometimes weighted based
on how much entities interact through attributes. The resulting single-type network
often tends to have high edge density, with a propensity to contain cliques (depending
on the affiliation data used to build the bipartite graph) [17]. It may nevertheless be
studied using standard techniques such as community detection using edge density,
or the computation of various centrality indices.
Such study of the bipartite projection can however hinder subtle characteristics
of the original data since it can create relationships that do not exist (Fig. 1), hence
inducing many cliques that may not be relevant. Many different attributes can also
generate such cliques as illustrated in Fig. 3. One option from [29] is the computation
of a one-mode projection from the most significant edges, but it still presents a loss of
Entanglement in Multiplex Networks . . . 91
Fig. 1 A side effect of the bipartite projection: we start from a multiplex network (on the left)
associating entities (nodes) through different attributes (edges in color), then convert it into a bipartite
network (middle) with the right (round shape) entities corresponding to adjacent edges in the first
network, and finally project the bipartite network onto another network (right). We can observe the
apparition of new edges. Note that the right multiplex network could be considered as an entity-
similarity multiplex network
Our paper contributes an approach designed to help users evaluate the reliability
of a proposed group structure. Because similarity between entities is most often mea-
sured based on co-occurrences of attributes, we provide a means to simultaneously
work on two networks derived from the original homophily multiplex network or
bipartite graph: one directly linking entities, and the other directly linking attributes.
The notion of a group we consider here depends on the context: it may be a cluster
computed from any algorithm, a subset of entities selected by a user, or the result of
a query on a network, for instance.
This paper extends the previous ASONAM publication [34] and our work
contributes with one node index and two multiplex network measures computed on
any group of entities indicating the overall cohesion of the group measured through
the intensity and homogeneity of interactions of their co-occurring attributes (that
is the entanglement of a multiplex network). We extend this approach to weighted
interactions. Exploring the network, selecting a group or subset of co-occurring
attributes and getting feedback on internal entanglement, analysts can validate the
model implicitly supported by the grouping procedure.
Our method has been validated based on three different datasets, among which
the first two are widely known and used. A first example looks at authors of the
IEEE InfoVis Conference (InfoVis 2004 Contest) [19]. A second example looks at
homophily relations between movie actors that have played under the direction of a
same director (IMDB) [40]. Our third example examines the Edgeryders community
forum [39] where homophily emerges from discussion threads. This last example
shows the capability of our methodology to deal with weighted homophily networks,
pointing at subtleties revealed from the analysis of weights associated with interac-
tions between attributes.
Related work. Bipartite graphs form an important modeling tool in social network
analysis, supporting two-mode concepts [5]. They form an important analytical arti-
fact to study homophily relations [13], and were even claimed as universal mod-
els for complex networks [17]. The literature covers a wide variety of approaches
dealing with different properties of bipartite graphs and homophily networks. An
optional but common strategy consists in projecting the graph inducing relation-
ships between entities of a same type (see [6, 20, 30, 33, 36, 42], for instance),
with the obvious disadvantage of containing lots of cliques, the relevancy of which
can be questioned [14]. Neal [29] recently introduced an approach computing a
one-mode projection the most significant edges based on local likelihood. Latapy
et al. [25] offer to study in a bipartite network the neighborhood overlaps of a node
so that the network would stay connected even without it. Fujimoto et al. [16] studied
network autocorrelations in bipartite network as a way to measure the influence of
nodes of one mode into the formation of edges in the opposite mode. Other research
also focuses on finding bicliques (such as in [5, 32]) which can be suspected to form
cohesive subgroups. Only little work has been yet propose for the study of multiplex
networks, and we can mention the efforts from [10, 24] for bringing a mathematical
formulation of multiplex networks with tensors, although this effort is not focused
on direct use an applications of multiplex networks.
Entanglement in Multiplex Networks . . . 93
Because of their wide applicability and because they also offer a straightforward
graphical representation of the data, bipartite graphs have been recently used in the
design of a website traffic analysis system [11]. Finally, Kaski et al. [22] studied
homophily in gene networks (similarity in gene expressions) in bio-informatics with
emphasis on the trustworthiness of similarities, which places it close in spirit to our
work.
This section takes a closer look at homophily networks and describes the general
framework we use.
As we shall see, cohesion of a group is easier to achieve with smaller groups.
Inspecting a group, in an effort to understand why and how cohesion is embodied
in the group certainly requires to be validated based on user knowledge. This only
makes sense when conducted on small scale groups, gathering hundreds of nodes at
most.
Simple questions come to mind when inspecting a group, such as How can
we assess a group really forms a cluster? How can we make sure all entities of
a cluster really belong to it? Should we suspect the group to contain marginal
(outlier) entities?, What are the attributes that tie the entities together? etc.
A central ingredient we used to answer these questions is a set of metrics that
capture the homogeneity and intensity of interactions between attributes associated
with entities. These metrics can be viewed as an aid to assess of the internal cohesion
of a group.
(a)
(b)
(c)
(d)
Fig. 2 The initial data in this example is formed of authors with associated keywords A, B, C, . . .
(a) (e.g. keywords indexing papers). This situation is modeled as a bipartite network linking authors
to keywords (b) (authors having published papers with given keyword, see Sect. 3.3). We then
consider the projected author interaction network with keywords as multiple edges (c) from which
we derive a keyword interaction network (d)
Entanglement in Multiplex Networks . . . 95
secondary node set represents the different layers of interactions. Hence, two other
networks are derived from this entity-attribute network, namely an entity interaction
network GA and an attribute interaction network GB . The entity network is usually
built from the entity-attribute network by projecting paths a b a (linking entities
a, a A through attribute b B) onto an edge a a directly linking entities.
We also need to store the attribute b as a label for the edge a a . Edges in GA are
thus labelled by subsets of attributes (all attributes b, b , . . . collected from triples
a b a , a b a , . . .).
Because we are focusing on entities group cohesion and on attribute co-occurrence,
we filter out some of the edges. Loops are discarded to obtain the entity interaction
network GA = (A , EA ). The resulting network is shown in Fig. 2c.
Note that, in the case of a multiplex network such as an author co-publication
network, the entity interaction network is defined by the multiple relationships across
authors. Going through the bipartite model would imply direct relationships across
authors that are not expected as detailed in Fig. 1. The construction of the entity
interaction network remains the same.
Links in the attribute interaction network GB = (B, EB ) are built from attributes
b that co-occur at least once with another attribute b (through at least two entities).
That is, there must exist at least two paths a b a and a b a to infer the edge
b b in EB . Note that this network is not obtained by projecting paths b a b
onto edges b b . For instance, EB does not contain edges connecting attributes that
only concern a single entity. The resulting network is shown in Fig. 2d. The attribute
interaction network is a central artifact in studying group cohesion.
Figure 3 underlines the nuance we wish to bring into the analysis of homophily
networks. Consider entities (depicted here as pale blue squares) with attributes
A, . . . , E; entities are linked by an edge whenever they share an attribute. Observe
that in both situations the pairwise distance between entities is the same (any
two entities share either one or two attributes) ending in identical topologies of the
attribute network GA . As a consequence, based on pairwise distance, these two
groups are somehow equivalent.
Now, consider the attribute networks (with circle nodes) derived from these two
situations. In the first situation (Fig. 3a), all entities having attribute A gives this
attribute a central positionif there were a reason explaining why these people form
a group, it would certainly rely on the group gathering around A, the other attributes
being somehow accessory. The second situation (Fig. 3b) is much more balanced
(although attributes do not mix as intensely as they could). This small example points
at situations where the analysis may be mislead when solely inspecting the single-type
people network. The attribute interaction network actually is key to understanding
how attributes interact within a group.
As these simple examples show, the inspection of a group of entities with
associated attributes raises several questions. It might be important to know whether
attributes equally map to all entities in the group, for instance. Conversely, a
misleading transitivity effect may be suspected to take place. Indeed, we may
have attributes b, b co-occurring between entities a and a , and attributes b , b
co-occurring between entities a and a , may lead one to believe that b, b , b
96 B. Renoust et al.
simultaneously co-occur between all three a, a , a . Although the case can be easily
spotted when only considering a few entities and attributes, the transitivity effect
becomes rapidly confusing as we increase the number of entities and attributes.
We address this issue by looking at how well attributes mix within a group. This is
accomplished using the entanglement index introduced in the forthcoming sections.
This index is computed for each attribute (or layer) b, measuring how homogeneously
and intensely an attribute co-occurs with all other attributes in a group of entities. As
we shall see, global entanglement homogeneity and intensity at the group level can
then be computed from the individual attribute entanglement indices. The definition
of the entanglement index makes it so that optimal homogeneity is reached whenever
attributes have the same entanglement index, that is when all entities have the exact
same associated attributes, and that all attributes equally co-occur within entities;
and the optimal intensity is reached whenever all entities share exactly all attributes.
number of edges in EA carrying the attribute b. The matrix NB collecting all these
n b,b entries gives rise to another matrix CB filled with ratios cb,b = n b,b /n b ,b .
The value cb,b may be viewed as computing the (conditional) frequency that an
edge be of type b given it is of type b . We give cb,b another definition, namely cb
the proportion of edges carrying attribute b among all N edges in GA = (A , EA )
such as cb = n b,b /N .
Consider the example in Fig. 2. Starting from authors a A having published
papers with keywords b B (attributes), we build a bipartite graph where authors
a, a link through keywords b whenever a and a have co-authored a paper with
keyword b (Fig. 2b). A single-type graph is obtained by inducing edges between
authors labeled with keywords (Fig. 2c). The resulting keyword interaction network
is shown in Fig. 2d. The matrices NB and CB (built over keywords C, D, E and L)
then read:
3310 0.75 1.00 1.00 0.00
3 3 1 0 1.00 0.75 0.33 0.00
NB =
1 1 3 1 CB = 0.33 0.33 0.75 1.00
We now wish to compute the entanglement index for each attribute, measuring
how much a attribute b contributes to the overall cohesion of an entity group. This
notion of cohesion is inspired from Burt and Schtts work on relation content in
multiple networks [9].
Denote by the maximum value among entanglement indices b of attributes
b B. In other words, the entanglement index of attribute b is a fraction of ,
namely b = b with b [0, 1]. The entanglement value of an attribute b
is reinforced through interactions with other highly entangled attributes. Having a
probabilistic interpretation of the matrix entries cb,b in mind, we can thus postulate
the following equation which defines the values b .
b = cb,b b (1)
bB
The vector = (b )bB collecting values for all attributes b, thus forms a right
eigenvector of the transposed matrix CB , as Eq. (1) gives rise to the matrix equation
= CB . The maximum entanglement index thus equals the maximum
eigenvalue of matrix CB .
The actual entanglement index values b are of lesser interest; we are actually
interested in the relative b values. Furthermore, we shall see how the entanglement
vector and eigenvalue can be translated into network measures to help understand
entanglement in a group of entities. Hence the entanglement indices for our examples
attributes are:
Notice that two indices are equal, and correspond to keywords C and E.
98 B. Renoust et al.
b coincide. That is, all attributes indeed contribute, and they all contribute equally
to the overall entity group cohesion. The Perron-Frobenius theory of nonnegative
matrices [12, Chap. 2] further shows that = |B| is the maximum possible value
for an eigenvalue of a non-negative matrix with entries in [0, 1].
The Perron-Frobenius holds for irreducible matrices, that is when the graph GB is
connected. Hence, the connected components in GB = (B, EB ) must be inspected
independently. When the matrix CB is irreducible, the theory of non-negative matri-
ces tells us that it has a maximal real positive eigenvalue R, and that the corre-
sponding eigenvector has non-negative real entries [12, Theorem2.6]. We hereafter
assume GB is connected so that CB is irreducible.
Inspired from the clique archetype of an optimally cohesive entity group, we
wish to measure the entanglement the entity group level. We already know that the
eigenvalue is bounded above by |B|, so the ratio I = |B | [0, 1] measures how
intensely interactions take place within the entity group. This ratio thus provides a
measure for entanglement intensity I among all entities with respect to attributes in
B. From our previous example I = 0.31 denoting a low interaction across catalysts.
We also know that the clique situation with equal cb,b matrix entries leads to an
eigenvector with identical entries. This eigenvector thus spans the diagonal space
generated by the diagonal vector 1B = (1, 1, . . . , 1). This motivates the definition of
a second measure providing information about how homogeneously entanglement
distributes among attributes. We may indeed compute the cosine similarity H =
1B ,
||1B |||| || [0, 1] to get an idea of how close the entity group is to being optimally
cohesive. We will refer to this value as entanglement homogeneity H . From our
previous example H = 0.91 denoting a relatively homogeneous but not optimal
distribution of entanglement indices.
A thorough study of the entanglement indices, and the homogeneity and intensity
network indices is out of the scope of this paper (see [35]). Other measures, including
Shannon entropy [38] and Guimeras participation coefficient [18], offer interesting
alternatives to cosine similarity.
Entanglement in Multiplex Networks . . . 99
That is, n b,b equals the sum of weights of edges e EA bearing attribute b
B and n b,b equals the sum of weights of edges bearing both attributes b and b .
Because we need to preserve the probabilistic interpretation of cb and cb,b values,
we further set: n b,b
cb = (5)
w(E)
As a consequence, Eq. (5) may be interpreted as the probability that an edge bears
attribute b and Eq. (4) may be interpreted as the conditional probability that an edge
carries b knowing that it already bears b . Observe that considering equal weights
we = 1 for all edges e E coincides with the non weighted version introduced in
the previous section. Using the newly defined quantities cb,b , we may still define the
entanglement index through matrix equation (Eq. (1)).
Note that, unless we filter out edges using a threshold on weights, the shape of
the attribute interaction network remains the same in both situations, weighted and
non-weighted.
100 B. Renoust et al.
3 Case Studies
The case studies we describe in this section aim at showing how the entanglement
indices, and the homogeneity and intensity indices of networks help users explore
social networks and reason about the homophily content. Navigating the network
and getting feedback about these indices, users can question the structure of the
space that binds entities together. The examples are designed to highlight different
aspects of the exploration, each time underlining how the indices contribute to better
understand the group structure of the homophily network. As the examples will show,
the entanglement methodology was embedded in a visual analytics environment
providing sound interactions to help users flexibly select subgroups. While users get
immediate visual feedback about the entanglement values at play, the environment
also allows them to explore the networks, enquire about homogeneity by easily
hopping between the entity and attribute networks.
Roughly speaking, the knowledge users gain after applying a grouping proce-
dure (clustering, community detection) is that a group of entities share a list of
attributes. This is where the entanglement index enters the scene. What does a list
of attributes really mean? Do all entities share all attributes? Do entities more or
less split between attributes? What particular attribute(s) make(s) the split explicit?
In other words, users must be able to elucidate to what extent, and possibly how/why,
the group of entities form a more or less cohesive unit.
Our first use case focuses on the IMBD network [40] gathering movie directors
linked through movie actors (they have directed). Our second use case focuses on an
author/keyword network extracted from the InfoVis 2004 Contest [23]. Our third use
case introduce a user/topic network from a study of the Edgeryders community [39].
All use cases illustrate how the entanglement index, and network homogeneity and
intensity can be used in a visual social network analytics context.
3.1 IMDB
This first use case is built from the Internet Movie DataBase, a largely used
dataset [40]. Auber et al. [2] had visualized a small world subset of the IMDB
co-acting graph. Starting from a small set of star movie actors, we have extracted
the corresponding movie directors to form a bipartite network where movie directors
connect to movie actors they have directed. Applying our methodology we compute
(i) a movie director network (entities), where two directors connect when the set of
movie actors they have directed (attributes) share at least two actors, together with
(ii) the corresponding movie actor interaction network. The data may thus be used
to find cohesive subgroups of movie directors, those whose artistic signature rely on
similar movie casts.
This first example gathers 15 actors and 16 directors (see Fig. 4). A low inten-
sity and medium homogeneity, together with a loosely connected actor interaction
Entanglement in Multiplex Networks . . . 101
network topology suggest that actors and directors roughly split into two commu-
nities. The director network has medium homogeneity that corresponds to a quite
balanced distribution of actors among them. Homogeneity is not optimal: the direc-
tors did not individually direct each of these actors although, as a group, they did
direct all of these actors. The low values of the network level measures readily indi-
cate the need to dig further into the network and try to nuance the cohesion of
this group. Roughly speaking, low intensity follows from the fact that most directors
have directed only a small number of actors relatively to the whole set.
As can be seen from Fig. 4 (bottom), the two communities of actors are connected
through Robert Duvall, and the two communities of directors are connected through
Sidney Lurnet. Apart from Robert Duvall, the bottom right community of actors is
formed around Marlon Brando, Al Pacino, Jeremy Irons, Jack Nicholson, etc. The
top left community of actors is formed around Sharon Stone, Harvey Keitel, Samuel
Lee Jackson, Leonardo DiCaprio, Meryl Streep, etc. Clearly, there is a generation
gap between those two communities of actors with Robert Duval filling the gapjust
as Sidney Lurnet does it in the director network.
The community of actors located in the top left part of the panel correspond to a
different group of directors (connecting to the previous group through Sidney Lurnet).
It gathers Spike Lee, Jim Jarmusch, Martin Scorsese, Woody Allen and others. This
community has similar intensity but higher homogeneity when compared to the
overall network. This means these actors have equal influence within this group and
better capture altogether the artistic signature of these directors as a group.
The upper left subgroup in the director network (see Fig. 5) actually divides into
three overlapping cliques. Two cliques reach maximal homogeneity and intensity
(the exact same actors have all played under their direction). The third clique (Bruce
Beresford, Jim Jarmusch, Barry Levinson, and Sidney Lurnet)selected in the top
panel of Fig. 5focuses on Ellen Barklin and Sharon Stone. It has lower homogeneity
and intensity indices: they dont mix that well with the other actors.
This use case thus underlines the fact that although a group involves a well
identified and distinct set of attributes (movie actors), the cohesion of the considered
group may rely only on a subset of these attributes. Additionally, group cohesion
must not solely rely on the topology of the projected single-type network obtained
from the original bipartite network.
Fig. 4 IMDBdirectors appear on top; the actors interaction network is displayed at the bottom.
Selecting a group of directors highlights the corresponding actors, with node size mapped to their
entanglement index. This group of directors shows low homogeneity and intensity. We can clearly
see that the distribution of actors is unbalanced, partly because Sharon Stone plays by far a central
role in the interactions between directorsthe directors all have, at some point, directed her
Fig. 5 A group of directors (top) and the corresponding actors they co-directed (bottom, high-
lighted) with node size mapped to their entanglement index. This clique of 4 directors shows higher
homogeneity and intensity than the selected group on Fig. 4
The second question often helps to narrow down results from the first question.
Given these questions it is then straight forward to propose the two corresponding
boolean operators:
OR : VB VA with B OR(B) = 1 (b) A ,
bB
AND : VB VA with B AND(B) = 1 (b)A ,
bB
Our second example concerns data of a different nature, where keywords (attributes)
link to authors (entities), showing that the notion of entanglement can actually apply
to a wide variety of application domain.
We selected a subset of the InfoVis 2004 Contest dataset gathering papers
published at the IEEE InfoVis symposium over the period 19942004 [23]. The
data we consider are authors indexed by keywords gathered from papers they pub-
lished. We thus compute a bipartite graph where authors link to keywords. To some
extent, with respect to Borgattis taxonomy of relations [7], this network could be
Entanglement in Multiplex Networks . . . 105
Fig. 6 The InfoVis 2004 Contest data gives rise to a keyword interaction network (bottom) coupled
with an author social network (top). The three selected authors hold a central position in the social
network (top). Their co-publications cover a wide spectrum of topics as shows the clique of keywords
in the bottom image. Entanglement measures, although good, are however not optimal: they did
not pairwise co-published on all these topics. We may indeed suspect each of them to have distinct
co-authors in the network
Entanglement in Multiplex Networks . . . 107
sub-communities seem to address the topics portals and data visualization located at
the bottom left of the keyword network. Grasping these two keywords, we find that
they solely concern Woodruff and Olston. Leapfrogging the selection to Woodruff
and Olston, we then see the additional topics these two authors have in common.
Observe that, logically, these topics are marginally positioned with respect to the
main clique (Fig. 7top).
This second use case pointed at fully cohesive subgroups where authors have
co-published papers on the exact same topics. This also suggest that the analysis
may be conducted either from the actor (author) network or the attribute (keywords)
network. Going back and forth between these two perspectives seems a fruitful
strategy to get the most out of the entanglement index and the dual GA GB
representation.
A full comparison with the results of the InfoVis 2004 Contest would require an
extended study of the whole the dataset. Many of the presented results emphasized
trends over the 10 year period observed, which is why here we only focused on
a smaller excerpt from the results of [23]. In our use case, instead of presenting
quantitative results over the different authors, we have presented specificities across
authors relationships.
We also applied on the excerpt the widely used Louvain clustering algorithm [4]
returning three communities (see Fig. 8). The first community regroups Kuchinsky,
Landay, Wang Baldonado and Woodruff, which presents clearly two disconnected
components in the attribute interaction graph, suggesting two sub-communities
within. The second community regroups Allen, Chen, Paxson, Su, Taylor and Wis-
novsky, with I = 0.82 and H = 0.91 suggesting unbalanced collaborations as
we discussed previously. The third community regroups Chu, Ercgovac, Lin, Olston,
Spaldin and Stonebraker, with optimal values I = 1 and H = 1, confirming
the cohesion of this community. Finally, even if Louvain has returned fairly cohesive
communities, the entanglement analysis suggests to dig for more specific interactions,
particularly in the case of disconnected components across attribute relationships.
Comparing entanglement measures with known measures can be also challenging.
Since they are computed for a multiplex network, they do not really correspond to
either traditional network measures or bipartite networks. We will assume that we
have the two separated entity interaction network and attribute interaction network.
Hence, we can only compare entanglement intensity (I = 0.33) and homogeneity
(H = 0.72) with global entity interaction network measures such as density
(d = 0.48) and average clustering coefficient (cc = 0.91). A proper evaluation
would compare those measures over a large number of different networks with varied
characteristics. More interestingly, we can compare the entanglement indices with
node measures on the attribute interaction network as in Fig. 9, and confirm the
differences among these statistics.
108 B. Renoust et al.
Fig. 7 Browsing around obvious sub-communities of authors, the keywords portals and data
visualization never pop up. Directly selecting them in the keyword network brings two co-authors
up front: Woodruff and Olston (top). Selecting these authors shows their common topics of interest
to be marginally positioned with respect to the main clique (bottom)
Entanglement in Multiplex Networks . . . 109
Fig. 8 Top three communities identified by the Louvain community detection algorithm. Bottom the
disconnected attribute interaction network corresponding to the community in orange (Kuchinsky,
Landay, Wang Baldonado and Woodruff ), suggesting that two sub-communities correspond to this
group
110 B. Renoust et al.
Fig. 9 Comparisons of the entanglement indices with traditional measures on the attribute interac-
tion network, for a better comparison the different values have been normalized. Top left betweenness
centrality. Top right degree. Bottom left Page Rank. Bottom right clustering coefficient. If no clear
correlation can be observed on this excerpt, the measures clearly display many differences
Although the above results do not qualify as a full scale quantitative evaluation of
the results of the entanglement analysis, they illustrate how the entanglement index,
homogeneity, and intensity, stand out from traditional network measures (Fig. 9).
3.5 Edgeryders
This last use case presents a situation with a relevant use of our weighted model, and
brings also forward how we can take advantage of the AND and OR operators.
We study here the Edgeryders community [39]. The data represents users
participating to discussion threads on various topics. Each topic corresponds to a
participation campaign lead by the Edgeryders leaders; campaigns took place one
after the other. The topic 0Undefined has been used for preliminary or out-of-scope
discussions. During each campaign (topics 19), the Edgeryders leaders designed
and implemented different policies to engage users in participating to the debate.
Within the network, opinion leaders accordingly promote participation into the top-
ics. Participation to a topic is weighted for each user in terms of effort measured as
the length of a text (number of words) produced in one piece of conversation. A topic
never closes, and users can participate to every topic by either starting a new thread
Entanglement in Multiplex Networks . . . 111
Fig. 10 The user interaction network (left): node size on users is mapped to their degree; notice that
a few nodes have very high degree (opinion leaders) while other nodes have very low degrees. The
topic interaction network (right): the network forms a clique, meaning that all topics pairwise inter-
act. The entanglement indices indicate however that topics 1, 2 and 4 concentrate most interactions
while topics 0, 5, and 8 only marginally interact with other topics
much effort has been mutually spent on different topics. Obviously, not considering
weights in this network leads to an incorrect interpretation of the network activity.
We can easily retrieve five leaders (the entity nodes of higher degree), by looking
at the collaborations that concerned all topics (i.e. by selecting all topics, with the
AND operator), which are user 4, 10, 64, 468, and 857. Leapfrogging to this selec-
tion of users (see Fig. 12), we can have a deeper look at their mutual efforts. Inten-
sity and homogeneity are very high (0.76/0.95, against 0.14/0.94 in an unweighted
context) which we could expect from opinion leaders. They have worked together
Entanglement in Multiplex Networks . . . 113
Fig. 11 The two barcharts above help compare the entanglement indices from the weighted network
(right) and non-weighted network (left). The comparison emphasizes how considering or not the
weights can have a strong impact on reading the relative entanglement indices. As can be seen, all
topics are assigned a different entanglement value (except for the topics with extremal values
topics 1 and 5). The balance between entanglement indices does not radically change, but the
participation of each topic to the networks cohesion radically differ
homogeneously on all topics, except for topics 0 (Undefined which is marginal) and 8
(Resilient which was a concluding debate). Notice from the topic interaction network
in Fig. 12 that no interaction between these two topics emerged from leadersmost
probably because those topics are indeed marginal.
Using the same process, we can now answer Edgeryders leaders questions.
We may process one topic at a time. Selecting a topic t, we retrieve the subset of
users who have participated in t. We may then identify other topics they have mutu-
ally participated in (which could be related to the corresponding policy campaign).
A variety of facts can be extracted:
topic 3 and topic 7 clearly dominate the mutual efforts of contributors;
closer examination reveals strong ties between topic 1 and 2;
topics 0 and 8 gather a majority of users who have pairwise co-participated as well
to other topics;
users who participated to topic 5 developed similar efforts to all other topics.
The use case we have just presented thus advocates how weights can be integrated in
our framework to offer a finer interpretation of cohesion and entanglement indices. It
also highlights how the use of the OR and AND operators between the two networks
GA and GB can help to narrow reasoning over the network when the topology is not
sufficient to understand its structure.
This paper addressed the issue of assessing cohesion in groups from homophily
networks mixing entities and attributes into a multiplex view of a bipartite net-
work. Our approach considers splitting the multiplex network into two single-type
114 B. Renoust et al.
Entanglement in Multiplex Networks . . . 115
Fig. 12 A first selection of all topics (left) have highlighted the five most influential users (middle).
Leapfrogging to these users let us understand how they have been mutually collaborating to the
different topics (right). Note that the first selection, made using the AND operator, returns the lowest
intensity and homogeneity values (0/0) since no pair of users have contributed together to all topics.
This underlines the need to leapfrog the selection since we still have 5 users who have contributed
to all topics. Notice that except for topics 0 and 8 they have all contributed equally. Notice also the
absence of highlighted edge between topic 0 and 8 indicating that no pair of the selected users have
both contributed to those topics together
networks used in conjunction when analyzing the homophily relations between enti-
ties. To answer this question, we have defined the entanglement, a notion of how
attributes intertwine entities edges. We have measures entanglement indices on
attributes, together with the homogeneity and intensity indices computed on any
subset of entities.
These attributes can be used to question the cohesion of a group of entities,
where optimal cohesion requires that entities simultaneously involve the exact same
attributes, and maximum intensity occurs when entities cover all available attributes.
A group of lower or unbalanced entanglement indeed requires more careful analysis,
and typically leads to the discovery of subgroups or regions locally showing higher
entanglement. An entanglement-based search the networks often leads to the identifi-
cation of outlier entities that can then be discarded, or on the contrary brought forward
to understand the network activity. A close examination of the attribute interaction
network also helps the identification of core attributes from which entities form a
cohesive unit.
The case studies clearly show the relevance of questioning the attribute
entanglement of entities to potentially confirm the community structure derived from
edge density, for instance. They focused on small size examples for sake of read-
ability. This limitation is but apparent, as using the interaction network occurs after
entities have been indexed and grouped. Although a query might return hundreds (or
thousands) of entities, we may expect the grouping procedure to form much smaller
groups before closer examination occurs. We also suspect that larger samples gather
larger attribute sets, typically leading to less tangled attribute interactions and less
cohesive entity groups.
Our second case study suggests our approach applies to other types of networks
modeled using a bipartite graph, namely interaction relations. The initial comparative
results encourage us to extend our approach to the study of multivariate networks.
Indeed, since the entanglement measurement actually considers a multiplex network
of interacting entities A , with attributes B corresponding to families of edges.
Our third use case has brought forward the important nuance in taking into account
weighted entities interactions. We are exploring possibilities to further extend the
ways we can incorporate weights in our model, and then fully embrace the weighted
multiplex model, possibly with the help of De Domenico et al.s formulation [10]. For
example, entities of type B may not be equal (some may weigh more than others),
and the interaction through a same entity of type B across two different pairs of
entities of type A may weigh differently. These are design choices we suspect may
116 B. Renoust et al.
depend on the nature and/or on the size of the dataset and the questions our users are
seeking answers for.
These structures being rather complex to manipulate, the use cases we have shown
underline the increase in usability when our approach is embedded in a visual and
interactive environment. The interactions we have used enable a quick back-and-
forth search in the data, putting users as close as possible to their own questions on
the original data.
Further studies would cover optimized implementation and performance studies,
with comparative results on a larger number of networks and measures. Further work
also include examining strategies to automatically identify entity and attribute subsets
with optimal (or maximum) homogeneity and/or intensity, suggesting potential areas
of interest in the network under study. These problems, however, will inevitably bring
us to combinatorial optimization problems, and we may expect to have no choice but
to rely on heuristics to avoid typical algorithmic complexity issues.
Acknowledgments We would like to thank the European project FP7 FET ICT-2011.9.1
Emergence by Design (MD) Grant agreement no: 284625.
References
Abstract New analysis dimensions in social network analysis tend towards more
realistic social graph models feeding new studies and interesting phenomena. Based
on dynamic or semantic dimension, more meaningful and informative results can be
harvested. A social network can be dominated by a core region depending on cen-
tralized or decentralized information sharing, social interactions lifetime and even
orientations developed by network actors. This is an underlying social structure
addressed by the raised question in this paper aiming to strengthening the signif-
icance of a core identity through the dynamic behavior or a semantic character of
collectivities. The temporal dynamic aspect is proposed in priority to be formalized
through a topological dynamic model as an evolutionary process. The aim is to find
a resistant grouping, playing a central role describing a first identity for a cores
infrastructure in time. The semantic aspect is proposed to be a strengthening ele-
ment for such identity. In this study we propose that the feeling of belonging issued
topologically from such grouping durability allows to deduce an implicit seman-
tic nature. However, the study shows that the interactions diversity or interests of
actors in a richer static semantic model will be more explicit to identify a semantic
character of such region. In this paper, we address an identity of a core structure
significantly expressed through an elite grouping of individuals between the topo-
logical dynamic or the static semantic: Internally through the collectivity durability,
the common implicit or explicit semantic character and externally from the strategic
positioning on the communication flows in time or by approximating semantically
different semantic regions in the network.
1 Introduction
in time by creating or deleting relations with others. This has a direct impact on its
positioning in the network and equally on its probable affiliation to one or more
social groupings. This fact is one of reasons explaining why the overall structure is
determined by structures at local level in a social graph. The temporal change is in
fact animated by many factors influencing the corresponding actor behavior. Such
factors may have semantic origins (a semantic dimension) including the connec-
tions causality, the positivity to the socialization (influenced by social media tools),
relationship types, interests, etc., for an actor. Accordingly, the temporal dynamic
behavior or an involved semantic of social entities are an informational richness to
exploit by our contribution in order to characterizing and strengthening significantly
a possible underlying core identity.
A core region composed by a subset of individuals will be considered in this
paper as an inheriting structure from a grouping of individuals. The internal cohe-
sion of group is firstly inherited. Topologically, a core identity will capture a particular
dynamic behavior of a group in time. On the other hand, a common and salient seman-
tic character is wanted to be expressed explicitly by such region in a static semantic
configuration. The study in this paper focusing on such underlying structure requires
meta-models of SN(s), where the dynamic and semantic aspects will be separately
processed (dynamic model or semantic model). In first, the temporal dynamic shall be
initiated (modeled) primarily on a topological map so as to identify an infrastructure
of a core region in time. Thus, a SN sample evolving in time linking a set of company
employees (Enron Company) is used to be modeled in the form of a development
process of groups. It links SN imprints (modular configurations) through parameters
of composition stability and centrality of groups in time steps. Therefore, by find-
ing a covering path, we target a durable grouping (resistant) and playing a central
role in time encapsulated inside. It will be a first structural identity characterizing
significantly a core grouping during the observation period. In other side and with
aim of characterizing a core identity semantically strengthened, we believe that this
particular internal dynamic of such grouping of individuals can be implicitly issued
from a semantic orientation of individuals in this topological model. Here, our atten-
tion is focused on the durability phenomenon of a collectivity and how it can be
an image reflecting deeply a feeling of belonging of these composition members.
However, we will adopt in a second step a higher abstraction level in another social
representation in order to investigate an explicit semantic character for a collectivity
and then for a probable core region. In this context, complexity reasons require us in
this paper to process until now the explicit semantic without the temporal dynamic
aspects. Therefore, a richer static and semantic model (RDF graph) of a SN will be
addressed based on some ontological conceptions. Our semantic considerations are
based on the expressivity degree depending on the networked environment and the
information availability. Two cases study will be considered through richer explicit
representations modeling static imprints of two different SN(s). In the first case, the
semantic information will be focused on the relationships type within a collaborative
learning environment. Here, the conception of a semantic graph model (RDF graph)
is carried out. Thereafter, we propose a mapping approach showing how to exploit
the expressivity degree without increasing the computational cost on this RDF social
122 B. Hamadache et al.
2 Related Work
(group centrality). But, some group centralities [11] are computed based intuitively
on external individuals. While some notions like boundary (faster information shar-
ing inside) will not be considered without for example a modular configuration of
the network.
These are static analytical studies where false or misleading information can be
harvested due to an underestimation or overestimation of cohesion or centrality of
groupings qualified to be a core structure. Social interactions change continuously
and generate consequently a natural temporal dynamicity of a SN in the form of a
development process in time. This can be caused by an endogenous dynamic con-
text resulting from simultaneous influence between behaviors and relations changes
among network actors [3, 7, 13, 19, 24]. The observed changes can be equally pro-
voked by some external events: Twitters change (increase of new accounts number)
during the elections event in Iran in 2009 [19]. Accordingly, the individual affilia-
tion, its role and then that of group are affected in time (chronological affiliation to
groups [23]). In this paper, an identity of a core grouping can be more significantly
expressed on a temporal dynamic dimension focusing on the evolution of collectivity
behavior. This evokes a network configuration study in groups in time and requires
being well formalized in order to show some interesting phenomena (group dura-
bility and development). Different partitioning derivatives of dynamic network are
essentially performed by threading a community discovery on a sequence of network
imprints in time [7, 27, 31]. It should be noted here there are different interpretations
of group or community concept in time (a latent concept). Even in literature there
is not a complete agreement on its definition. In addition, many related measures
notably the modularity (high internal connectivity in a group versus to a low exter-
nal connection), have been extended in time. However even in the recent efforts, a
core structure is not considered in term of collectivity behavior neither its role as a
grouping of individuals within the temporal dynamicity of SN.
In the other side, the social structures are more and more complex, evolving within
multiple contexts. Different relationships, activities, roles, identities across multiple
applications and interests can be developed by a social entity. This is a context
where the heterogeneity is generated (heterogeneous SN) [10]. For example the social
tagging is a phenomenon resulted from labeling activities using tags for expressing
interests (Folksonomy, another source for SN: Interests networks). However, the
analytical studies in SNA are generally structural applied on simple non-typed graph
representations. Here, studies surrounding the structures of cores are not exceptions.
The informational richness in SN can be exploited to obtain generally more significant
results (A semantization of SNA). It is in this sense we target in this paper a significant
core identity based not only on the temporal dynamic but giving also it a semantic
dimension to such identity. Initially, a semantic SN model is required for exploiting
the expressed richness in order to give at first a semantic dimension to a grouping
of individuals. The semantic web technologies are currently seen well adapted as
another additional step for improving the representations quality of SN. Depending
on the expressivity degree, the social data can be semantically structured using typed
graphs: Resource Description Framework (RDF). These are descriptive graphs based
on concepts defined as primitives of ontological models. According to the information
124 B. Hamadache et al.
availability, the expressivity degree can be increased. Primitives can describe by using
for example the FOAF ontology (friend of friend), the user account (social entity)
and its basic relations (FOAF: knows). RELATIONSHIP, SIOC and SKOS concepts
are more extended and expressive for describing more specialized relations (rel:
works with, rel: friendOF), published contents and social tagging (tags, specification
or generalization relations between tags skos: narrower, skos: broader) respectively.
Thereby, the analytical studies can be enriched on richer models by parameterizing
statistical and individual measures (centralities, diameter, geodesics, etc.). On the
other hand, it will be very interesting to find a semantic nature of a group from which a
semantic character of a core grouping can be inspired in this paper. For example when
additional information like tags is available, a typed graph (RDF) can semantically
model relations between users, user-tag and tag-tag (structured folksonomy) [9].
This has been based on some ontological models used together in [9]. Accordingly,
the group connectivity has been proposed to be strengthened by the same shared
tag between its members (labeled community or interest community) [9] through a
proposed iterative approach: SemTagP: Semantic tags propagation in [9]. Moreover,
the collectivity spirit has been expressed by the semantic links among tags [9]. Thus,
thematic areas more and more specialized have been identified through communities
labeled by tags representing related topics [9].
However, the semantic processing requires exploiting the RDF graphs richness
which is itself a challenge in SNA. It should be noted that tools and operators
(SPARQL: query language for RDF data [10]) are limited to analyze RDF graphs,
respecting analysis requirements and its topological complexity (centrality measure,
community detection, etc.). Even, there are attempts towards extensions (by adapting
queries on the path notion [9]), the related resolution (projections number on graph:
matching on RDF triples) consumes longer computational time. Therefore, treatment
phases (e.g. in previous cited approach) are candidates to be more expensive. More-
over, even such semantic social representations are enabled to enrich static analytical
studies, the dynamic aspect is not considered. Furthermore, if the collectivity nature
can be strengthened by a semantic character, this is not yet clear for a grouping of
individuals semantically qualified to be a probable core structure inside a SN. The
temporal dynamic behavior must be in foreground to characterize a core identity in a
topological context towards a higher abstraction level allowing a possible semantic
strengthening after.
network actors affecting their positioning and affiliations. Thereby, the dynamic of
the collectivity behavior is more or less influenced. At this level, the pace of change
is not equally the same but it is proposed to be captured through parameters that
should be significantly ordered (Fig. 2).
The persistence is determined when the network observation period is covered by
a stable composition. When a group preserves its composition of individuals during
a time period, its role in centrality terms will be more realistic, based on links of
boundary individuals with the outside. A core region should be the resistance point
embodied by a persistent stable group retaining its all composition (subset of linked
individuals) against the network temporal dynamicity. Once a stable structure is
conserved, its collective influence in term of group centrality can be investigated on
traces sequence in time. A central role played by a larger stable composition in time
will be a first good characterization of an identity of a core structure inside a SN
on a dynamic dimension. This should be supported by a network model explicitly
formalized.
During an observation period divided in time points, the network connectivity, its
centralization or decentralization vary. Therefore, different network modular config-
urations can be obtained in time. Thereby, the SN imprints in time will be considered
in the proposed model as a structure of groups at each time step. Hence, a tempo-
ral weighted graph (TWG) is formalized by linking a sequence of these network
imprints. This is an evolutionary process model [12] where the vertices are cohesive
groups resulting from the network partitioning at each time point. The model arcs
are locally created to link the imprints between two successive time points. Each
arc is created to link exclusively two groups A, B belonging respectively to two
successive partitions (PTi, PTi+1) and having a non-empty overlap. It is based on
a kind of a successive temporal overlap considered as a grouping of individuals
retaining locally its composition between two successive time points. Thereby,
An Elite Grouping of Individuals for Expressing a Core Identity . . . 127
Fig. 3 Layered architecture encapsulating deeply a characterized core identity inside a modeled
evolutionary process of a SN
GC T i(A B) + GC T j (A B)
W (A, B) = A B , j = i +1, i = 1 . . . t 1
2
(1)
The covering sequences in this model will be targeted. The aim is to find the heaviest
groups sequence (a critical path) covering the observation time points. It is a nar-
rower context where the weights W(A, B) are generally maximized between each
successive time points. In other words, it includes a succession of temporal over-
laps maximizing locally the combination: local stable composition and centrality:
A succession of larger and more central overlaps. In this sequence, the persistence
will not be ensured unless an overall stable composition is encapsulated inside. This
configuration is schematized in a layered architecture, in which the deepest level is
expressed by a persistent grouping of individuals in time. Accordingly a core charac-
ter is determined by this persistent structure with a particular identity deeply imitated
from the higher layers: From the larger and central overlaps expressed on heavy arcs
in the model. Thus, the core region is clearly identified as an underlying structure
and deeply determined by a central and persistent stable grouping of individuals in
time according to such architecture.
128 B. Hamadache et al.
The resistant character and the strategic played role are used to draw a core identity
inside a topological representation of collectivities behavior in SN evolving in time.
This is an infrastructure of core region in time, whereas its corresponding semantic
character is not addressed. Usually, the semantic is generally related to an informa-
tional richness issued from actors animating the SN and the context where they are
surviving. Although the temporal information on the SN dynamic is topologically
represented, it could be equally an important for giving a semantic signification for
some phenomena, particularly when it is well formalized in a temporal dynamic
model (such as TWG). The semantic orientation of collectivities will be addressed in
the next phase. We show how it can be implicitly inspired from the internal dynamic
or explicitly following a higher expressivity degree in a richer SN representation. This
will be beneficial for strengthening semantically the signification of a core identity
according to a higher abstraction level.
Fig. 5 Mapping from a semantic model (RDF graph) to a direct labeled graph preserving the same
expressivity
level of learner is the primary objective while the positivity of an actor (a learner) to
collaborate (to socialize) then its cognitive level, are influenced.
In a graph mining context, SNA studies are usually facing to the complexity
problems: The computing of centralities based on paths, the discovery of communi-
ties and underlying structures is already complicated on topological representations.
Thereby, by analyzing directly this RDF social graph, the analysis complexity will
be probably increased. In addition, it should be noted that tools treating the RDF
graph while meeting the complex analysis requirements are limited. Accordingly, a
mapping approach is proposed towards an equivalent graph representation (directed
labeled graph) by preserving the same expressed semantic richness. The type of rela-
tions (CS/CA) in RDF graph will be preserved by the labeling function on arcs in
the target representation. Between two actors, the arc orientation is exploited to dis-
tinguish the domain and the range (the trigger of collaboration) of the RDF property
(describing the collaborative interaction). The aim with such mapping is to reduce
the complexity of following studies (e.g. on the semantic of groups). We target a less
expensive processing depending on the expressed richness degree (Fig. 5). Thereby,
the individual analysis measures will be parameterized and different strategic posi-
tions can be detected according to the relations type. A semantic detection of groups
can be possible. Each collectivity can be distinguished by ensuring an internal con-
nectivity by the same link type. At the same time, the individuals can be affiliated
to one, two or more different groups (each one have different type of connectivity),
creating consequently overlapping zones. This intersection is grouping reflecting
132 B. Hamadache et al.
Fig. 6 A grouping of individuals sharing the same tag that is semantically the most related with
others tags
semantically a kind of core region having not only an intermediary central role but a
semantic positioning between various links (approximating different communities).
In the other side, the collectivity spirit cannot be only based on the connectivity
and the distinction of its type between the subset of individuals. The collectivity spirit
in group can be more explicitly strengthened through orientations expressed by the
social entities. This will require a higher expressivity degree in a SN representation
(semantic model). A richer model will be based on the available informational rich-
ness: On these orientation and activities. In OSN application, the network is more
explicit and the actor orientations can be announced as interests by tags. This is a
social tagging phenomenon which means that a set of actors describe a set of objects
with a set of tags.
The tags can be semantically related. In this case, the semantic information is not
limited only by relationships diversity between actors but it concerns equally the tag
use (user-tag) and links between tags. A semantic model of SN more enriched will
be required for structuring not only relations individual-individual but structuring
semantically the resulting links individual-tag, tag-tag. Intuitively, an interest com-
munity can be formed by actors sharing the same tag (Fig. 6). The collectivity identity
acquires a semantic character (a common interest) but it should not be topologically
deprived of its connectivity. The internal cohesion is primordial between a subset of
individuals (densely linked) for qualify it as a semantic group (sharing the same tag).
An Elite Grouping of Individuals for Expressing a Core Identity . . . 133
4 Experimental Results
points, this sequence is generally formed by groups linked by heavy arcs. Between
8299 % of heaviest arcs are covered (Fig. 8).
In other words, it covers a succession of temporal overlaps among these groups
(Ai A j Ai, A j), within which a persistent structure should be deeply encapsu-
lated inside N Ai A j (j = i + 1, i = 1 . . . 11). The succession contains heavy
weighed arcs maximizing the parameters combination locally expressed in these
weights in time. We have found that when a subset of individuals is surviving (per-
sists) inside such context (Larger and more central overlaps), it can imitate character-
istics by 9597 % of subordination. This is a larger stable composition (deeper layer
(Fig. 9)) having a central role. Because, the groups forming the sequence are gener-
ally the most central structures at each time point and centrality of their successive
overlaps are approximate. Consequently, the centrality of this persistent grouping
inside is generally higher in time (Fig. 9).
Therefore, beyond the internal cohesion, the crossing between the durability of a
larger collectivity and a strategic played role on the network communication flows,
leads to an interesting identity. It characterizes deeply and significantly an infrastruc-
ture of a possible underlying core region inside this topological temporal dynamicity
of SN.
Secondly, a higher expressivity degree will be adopted in a second dataset in
order to show an illustrative semantic dimension without multiplying the complexity
An Elite Grouping of Individuals for Expressing a Core Identity . . . 135
Fig. 8 Arc weights variation on the critical sequence compared to the heaviest arc between each
two successive time points
Fig. 9 Larger persistent grouping of individuals having a central role deeply imitated
136 B. Hamadache et al.
analysis. We target a simple semantic character feeding the collectivity spirit and then
a possible semantization of a core identity inside a static picture of an emergent SN.
A collaborative learning environment is another new source of computer-mediated
social interactions, because it tends to adopt a social collaborative mentality between
learners: Computer Supported Collaborative Learning (CSCL) [1, 2]. Here, increas-
ing the cognitive level of learners is the common individual objective. Thereby, the
collaborative social interactions within the learning communities are more oriented
than other social relationships (in social platforms). We can talk about a deeper
semantic aspect behind such interactions and explain it by the fact that the collabora-
tive act is acquired and constrained by social skills of learner and its positivity to the
collaboration. These elements are equally influenced by the nature of used collabora-
tive tools. This means that the collaborative act is also semantically affected through
these tools (social media). Two types of interactions: synchronous or asynchronous
collaboration are distinguished in this paper. Accordingly, a semantic model of a
SN of collaborators learners is generated in the form of a typed RDF graph linking
20 learners. This RDF model is based on a simple ontological model describing
relationships nature (Synchronous or Asynchronous collaboration: CS or CA).
An experimental prototype is established (using JAVA language) for a less
expensive semantization of some analytical studies in front of the expressed rich-
ness. This is an analysis parameterization on these new traces. The prototype is
intended at first to apply the proposed mapping schema from the RDF data (collab-
orations of learners) to a directed labeled graph. It must preserve and transmit the
same semantic information. By using some programming interfaces, RDF relational
data are extracted (using JENA API) and regenerated through nodes and labeled
arcs. The nodes and arcs are in the form of objects (using JUNG: Java Universal
Network Graph API) able to capturing the same expressivity (user profile, labels and
orientation of arcs: CS or CA) and forming the target graph to analyze (Fig. 10).
Thereby, the analysis measures (Centrality measures and even global indicators:
density, diameter, etc.) are parameterized, allowing to detect different individual
strategic positions according to relations type (Fig. 11).
For example, by normalizing the individuals centralities (e.g. betweenness), the
most central actor: learner 18, on the synchronous collaborative communications
flows is not the same in the asynchronous case. Another actor (learner 14) plays the
most intermediary role. However, different central positions (learner 12 or learner
7) can be identified when the interactions nature is not distinguished (non-typed
graph). The collaboration is a symmetric social interaction between tow nodes. But,
it is initiated by a collaboration request. This is supplementary semantic information
added to the relation semantic. It is modeled by the arc orientation, deduced from the
asymmetric RDF properties (domain/ range). Hence, the initiator of the collaborative
interaction among two nodes (transmitter) can be identified (even the receiver of this
request. Thereby, analysis can be still enriched by refined measures. We can compute
the node prestige depending on receptions or sending of collaboration requests which
are occurring.
If the structural potentiality of a social entity varies depending on its relations and
orientations nature, the network connectivity is globally affected. The affiliation to a
An Elite Grouping of Individuals for Expressing a Core Identity . . . 137
Fig. 10 A direct labeled graph describing the relationships nature (From a semantic social network)
Fig. 12 Semantic character of group described by the same relations type linking its members and
overlap with groups with different type
5 Discussion
Fig. 13 Road to characterizing significantly an identity describing internally and externally a core
region through an elite grouping of individuals by bringing closer the structural dynamicity and the
static semantic richness of emergent SN models
level (e.g. the abstraction of the ontological model). The feeling of belonging can
be strengthened, when individuals are involved with the same relations nature inside
a collectivity, or when the same interest (tag) is shared. The character of a core is
semantically proposed to be manifested by region situated as an intersection zone
approximating different semantic identities of groups (e.g. between different rela-
tions or interests): e.g. interests center. Here, the topological internal connectivity
and central positioning of such region is not preserved as an infrastructure. In con-
trast, regardless the expressed semantic richness, how to exploit it and the related
complexity, these semantic models are static social representations aggregating all
network links in a single representation exactly as they appear at the same time.
The temporal information like time ordering [25] of links and then its lifetime
are not considered. Accordingly, a misleading identification of a core region can be
produced, following an over/underestimated parameters of connectivity, collectivity
spirit and group centrality quantification, etc (Fig. 13).
6 Conclusion
Beyond the static and structural analysis framework, a possible core structure inside
a SN surviving in dynamic and richer contexts can be qualified as an elite grouping
of individuals. It should express an identity distinguished by two sides, significantly
characterized on new analysis dimensions. An internal identity based on an internal
cohesion between a subset of individuals evoking a particular dynamic behavior of
the collectivity (durability in time). This internal identity can be strengthened from a
An Elite Grouping of Individuals for Expressing a Core Identity . . . 141
united semantic orientation. While the external face of this identity is topologically
determined from a strategic positioning in time or semantically by crossing between
different semantic regions in the network.
It appears informative for feeding business strategies and decisions, homeland
security, for example studies on P2P networks [16], political networks, social move-
ments, epidemiology [20] and even for investigations on illegal SN hiding fraudulent
behaviors, crime, terrorism [20], etc.
In fact, the temporal or semantic aspects are separately modeled through a
structural dynamic or semantic static model respectively. A larger composition deeply
resistant and playing a strategic role on the communication flows distinguishes a sig-
nificant identity for a cores infrastructure. But it could be still refined when other
parameters (the centrality stability) are considered in time. It may even be informa-
tive for answering to the SN fragility issues in a dynamic context. However, such
structure can be determined in a richer static representation, through a collectivity
sharing the same semantic, situated semantically as an overlapping zone between
different regions. This is one of orientations towards a semantic core of a SN. On
the other hand, we can deduce some rapprochement signs between the two models.
The durability explains a particular internal dynamic guided by a feeling of belong-
ing which illustrates some semantic character. In addition, the external identity is
imitated from overlaps, either inside larger and central successive temporal overlaps
or inside semantic overlaps. Such analytical study on networks at larger scale, for
longer observation periods and higher abstraction level will be another challenge
in front of an increased complexity. Nonetheless, meta-models based on the fusion
between semantic and dynamic aspects lead to produce more expressive dynamic
models. This can be a further step towards characterizing more significant identity
of an underlying cores structure inside SN.
References
7. Berger-Wolf TY, Saia J (2006) A framework for analysis of dynamic social networks. In:
Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and
data mining. Philadelphia, pp 523528
8. Borgatti SP, Everett MG (2000) Models of core/periphery structures. Soci Netw 21(4):375395.
Elsevier
9. Ereteo G, Gandon F, Buffa M (2011) SemTagP: semantic community detection in folksonomies.
In: Proceedings of the 2011 IEEE/WIC/ACM international conferences on web intelligence
and intelligent agent technology, WI-IAT 11, vol 1. pp 324331, ISBN: 978-0-7695-4513-4
10. Ereteo G, Gandon F, Buffa M, Corby O (2009) Semantic social network analysis. In: Proceed-
ings of the WebSci 09: society on-line, 1820 Mar 2009, Athens, Greece
11. Everett MG, Borgatti SP (1999) The centrality of groups and classes. J Math Sociol 23(3):181
201
12. Hamadache B, Seridi-Bouchelaghem H, Farah N (2013) Toward characterizing a more signifi-
cant identity of core structure within dynamic social network. In: 2013 IEEE/ACM international
conference on advances in social network analysis and mining (ASONAM 2013), Niagara Falls
Canada, 2527 Aug 2013
13. Jamali M, Haffari G, Ester M (2011) Modeling the temporal dynamics of social rating networks
using bidirectional effects of social relations and rating patterns. In: International world wide
web conference committee (IW3C2), WWW 2011Session: temporal dynamics. 28 Mar1
Apr 2011, ACM, Hyderabad, India. ISBN: 978-1-4503-0632-4/11/03
14. Karolewski IP (2009) Citizenship and collective identity in Europe. Routledge advances in
European politics, Kindle Edition, pp 8385. (Routledge, 24 August 2009, p 260)
15. Klimmt B, Yang Y (2004) Introducing the Enron corpus. In: CEAS conference
16. Lathia N, Hailes S, Capra L (2008) kNN CF: a temporal social network. In: Recsys 08: pro-
ceedings of the 2008 ACM conference (2325 October 2008, Lausanne, Switzerland), on
recommender systems. ASSOC Computing Machinery, pp 227234
17. Leskovec J, Lang K, Dasgupta A, Mahoney M (2009) Community structure in large networks:
natural cluster sizes and the absence of large well-defined clusters. Int Math 6(1):29123
18. McGlohon M, Faloutsos C (2008) Graph mining techniques for social media analysis. In:
International conference on weblogs and social media (ICWSM), Seattle
19. Meeder B, Karrer B, Sayedi A, Ravi R, Borgs C, Chayes J (2011) We know who you followed
last summer: inferring social link creation times in Twitter. In: International world wide web
conference committee (IW3C2), WWW 2011Session: temporal dynamics. 28 March1 April
2011, ACM, Hyderabad, India. ISBN: 978-1-4503-0632-4/11/03
20. Memon N, Alhajj R (2011) Introduction to the first issue of social network analysis and mining
journal, published online: 13 Nov 2010. In: SOCNET (2011), vol 1. Springer, New York, pp
12. doi:10.1007/s13278-010-0016-2
21. Memon N, Alhajj R (2011) Introduction to the second issue of social network analysis and
mining journal: scientific computing for social network analysis and dynamicity, published
online: 29 Mar 2011. Soc Netw Anal Min, vol 1. Springer, pp 7374. doi:10.1007/s13278-
011-0022-z
22. Nettleton DF (2013) Data mining of social networks represented as graphs. Comput Sci Rev
7:134
23. Reda K, Tantipathananandh C, Berger-Wolf T, Leigh J, Johnson AE (2009) SocioScape
a tool for interactive exploration of spatio-temporal group dynamics in social networks. In:
Proceedings of the IEEE information visualization conference (INFOVIS 09), 1116 Oct 2009,
Atlantic City, New Jersey
24. Snijders TAB, Doreian R (2010) Introduction to dynamic social network analysis, introduction
to the special issue on network dynamics. J Soc Netw 32(1):13
25. Tang J, Musolesi M, Mascolo C, Latora V (2010) Characterising temporal distance and reacha-
bility in mobile and online social networks. ACM SIGCOMM Comput Commun Rev 40(1):118
26. Tang J, Musolesi M, Mascolo C, Latora V, Nicosia V (2010) Analysing information flows and
key mediators through temporal centrality metrics. In: Proceedings of the 3rd workshop on
social network systems (SNS 10), 13 Apr 2010, ACM, Paris, France
An Elite Grouping of Individuals for Expressing a Core Identity . . . 143
1 Introduction
represented by a graph whose nodes are individuals and edges represent a kind of
social relationship. Likewise, a proteinprotein interaction network can be modeled
by a graph whose nodes are proteins and edges indicate known physical interactions
between proteins.
An important feature of such networks is that they are generally composed of
highly interconnected sub-networks called communities [13, 30]. Communities can
be considered as groups of nodes which share common properties and/or play similar
roles within the graph. The automatic detection of such communities has attracted
much attention in recent years and many community detection algorithms have been
proposed (see [11] for a survey). Most of these algorithms are based on the maxi-
mization of a quality function known as modularity [25], which measures the internal
density of communities. Modularity maximization is an NP-hard problem [4] and
most algorithms use heuristics. However, even if the Newman-Girvan modularity
is predominant in the context of community detection, other quality functions have
been proposed over the years (see for example [20, 31, 35]) but they have been less
studied in this context.
In random graphs, however, links appear independently of each other, so a strong
inhomogeneity in the density of links on these graphs is not expected. Therefore,
random graphs should not have communities using the previous definition. As shown
in [15], due to fluctuations, it is possible to find partitions with significantly high
modularity in random networks. A good community detection algorithm should
therefore be able to find communities if it is relevant, but also to indicate the absence
of community structure.
2 Consensual Communities
Following the works from Diday [8, 9] on consensual clustering of vectors, different
studies have proposed to adapt this method to graphs and to combine different parti-
tions into consensual communities. The common features of these methods consist in
(i) compute different partitions and (ii) combine these partitions to find similarities.
A consensual community is therefore a set of nodes which are frequently classified
in the same community through multiples computations. We will give a more formal
definition later on, mainly to specify the meaning of frequently. The main reason
for using consensual communities rather than classical communities comes from the
fact that most techniques used to compute communities can usually provide more than
one solution. This may come from initial conditions of the algorithms, for instance
the random seed which is generally used in non-deterministic algorithms, or from
the fact that algorithms can depend on the numbering of the nodes, for instance if
they consider nodes in a given order. The landscape of the optimized function can
also be highly non-convex, leading to many local maxima. Given that there are many
local maxima which can be very similar in quality, even if they are structurally very
different, there is no reason to prefer one above another since they all can equally
measure the structure of the network. In the absence of a good way to choose one
partition among all, finding a consensual partition therefore seems to be the good
compromise.
Consensual communities can also provide a deeper insight on the structure of
the network since they summarize many partitions and encode more information on
the structure. They can also erase the defaults of each single partition. The classical
example consists of two cliques (complete graphs) C1 and C2 overlapping on some
nodes C = C1 C2 . Any single run will classify the overlapping nodes of C either
with the nodes of C1 or the nodes of C2 and none of these choices is better than
the other. However, when combining multiple executions, the fact that the nodes
of C belong both to C1 and C2 will clearly appear. For this reason, consensual
communities have already been used in the context of overlapping communities,
for instance in [33]. It has also been shown that consensual communities are more
resilient to modifications of the networks [28] and could therefore be more suitable
to study evolving communities in graphs.
Two main approaches are used to obtain different partitions. The first one consists
in disturbing a given network by rewiring a small fraction of links [17] or changing
148 R. Campigotto and J.-L. Guillaume
slightly the weights on links [12, 27]. The second one, that we are going to use
hereafter, consists in using the non-determinism of some algorithms to obtain differ-
ent partitions. For instance, the Louvain method [3] (among others) can give different
results depending on the order in which nodes are considered by the algorithm. This
has been used in [19, 29] to compute consensual communities and in [32] to com-
pute overlapping ones. A generic version of Louvain is under development, in which
different quality functions can be plugged [5].
2.1 Definitions
2.2 Experiments
For our experiments and the proof hereafter, we will use three different quality
functions. First, the classical Newman-Girvan modularity function Q [25], which is
defined by
ki k j
Q= Aij X ij , (1)
2m
i, jV
The Power of Consensus: Random Graphs Still Have No Communities 149
where
Aij represents the weight of the edge between i and j (0 if ij E), ki =
jV ij is the sum of the weights of the edges attached to node i, X ij = 1 if i and
A
j are in the same community and 0 otherwise, and m = 21 i, jV Aij .
Then, the balanced modularity function B [7] which takes into account both the
links inside communities and the non-links between communities. It is defined as
ki k j
(n ki )(n k j )
B= Aij X ij + Aij X ij , (2)
2m n 2 2m
i, jV i, jV
with Aij = W Aij the non-link between nodes i and j (where W = maxi, jV Aij )
and X ij = 1 X ij .
Finally, the deviation to indetermination function D [1, 16, 21], defined as
ki kj 2m
D= Ai j + 2 X ij . (3)
n n n
i, jV
1 Also, an execution takes less than one hour on a network with more than one billion of nodes and
links.
150 R. Campigotto and J.-L. Guillaume
Fig. 1 Consensual communities for Zacharys network using three different thresholds with
Louvain-Modularity. The shape of the nodes (circle/square) is the manual classification made
by E. Zachary. a = 0.32. b = 0.62. c = 1.00
node. On the contrary, with a threshold equal to zero, we have a single consensual
community (if the original graph is connected), and with < 0.5, we gener-
ally have a giant consensual community containing the majority of nodes. When
The Power of Consensus: Random Graphs Still Have No Communities 151
Fig. 2 Average (left) and maximal (right) size of consensual communities versus threshold.
a Using Louvain-Modularity. b Using Louvain-Balanced. c Using Louvain-Deviation
Table 1 Number of nodes and number of links of the four networks used in this paper
Network Karate club Email Collaboration Internet
Number of nodes 34 1,133 13,861 22,963
Number of links 78 5,451 44,619 48,436
the threshold increases, this giant consensual community will split into smaller
consensual communities. But in the Internet or email network, even with an equal to
1, we still have a large consensual community containing approximately 10 % of the
nodes (see Fig. 2). However, the decrease after the splitting of the single consensual
community up to = 1 is smooth.
This smooth decrease can also be understood through the study of the distribution
of the values inside the P ij matrix. Figure 3 shows the pij distributions for three
networks. We observe that if most pairs are nearly always separated and that a fair
152 R. Campigotto and J.-L. Guillaume
Fig. 3 pij complementary cumulative distribution for three real networks using Louvain-
Modularity
amount are always grouped together, there are also some pairs of nodes which are
sometimes together and sometimes separated. This explains that significant consen-
sual communities appear for a wide range of values of .
These results show that the notion of community consensual communities makes
sense and that they can be used to detect different levels of communities with different
quality functions. We will now show that they can also be used to show the absence
of a real community structure in random graphs.
In random graphs, all pairs of nodes have the same probability to be connected.
Hence, they should not have preferential binding inducing specific and identifiable
nodes groups. Therefore, we could conclude that there are no community structure
in random graphs. However, several studies show that it is possible to find partitions
with high modularity in random graphs [15, 26]. Indeed, the links concentration
fluctuates in generated graphs, which means that subsets of nodes with a density
larger than global density can appear. The phenomenon is even more pronounced in
regular or quasi-regular graphs, like trees, torus or grid graphs, in which community
detection algorithms can also find partitions with good modularity [23].
A good algorithm for community detection should indicate the presence or
absence of a community structure and recognize that in random graphs, the commu-
nities which are obtained are not real communities.
We will now show that random graphs do not exhibit any non-trivial consensual
communities structure. For that, we will use two different random graphs models:
The Power of Consensus: Random Graphs Still Have No Communities 153
the classical Erdos-Rnyi model [10], which is used to mimic the number of nodes
and links only, and the configuration model [2, 22], which also respects the full
degree distribution. We will conclude this section with random graphs with known
community structure generated using the LFR benchmark [18].
First of all, Fig. 4 shows the distribution of pij values for an Erdos-Rnyi random graph
with different values of the number of nodes and the average degree. We observe
a high concentration of pij at an average value (around 0.1 for large graphs using
realistic values of the average degree) which is very different from the distributions
observed on real graphs where the maximum of the distribution is at the zero value
(see Fig. 3). We further observe on Fig. 4b that large values of pij appear. However,
the concentration of values increases both with the size of the network and with the
average degree and these large values are therefore less and less frequent.
This concentration of values implies that even if partitions with a good modularity
can be found in random graphs, these partitions are very different from one another
since most pairs are classified in the same community only once every ten runs.
Therefore, no real similarities can be found.
To compare more precisely real and random networks, we generated random graphs
from the Erdos-Rnyi model (resp. configuration model) that have the same size and
the same average degree (resp. the same degree distribution) as two real networks.
In Fig. 5, the Erdos-Rnyi model shows no pair of nodes with pij = 0, which means
that all pairs of nodes have been grouped together at least once during 1,000 runs
of the Louvain algorithm, regardless of their position in the network. The same is
observed for the configuration model.
Conversely, there is nearly no pair of nodes which are always grouped together,
except for the leaves (nodes of degree 1) of the network which are always grouped
with their only neighbor. This presence of nodes of degree 1 is very common with
the configuration model since the real networks degree distribution are power-law
shaped and therefore contain many nodes of degree 1. The same is observed for the
Erdos-Rnyi model since the real average degree is small and nodes of degree 1 are
not so uncommon on generated graphs. This explains the small increase observed
for the pij values around 1.
Furthermore, as predicted by the experiments on Erdos-Rnyi random networks
(Fig. 4), the maximum of the values is around 0.1.
There is two direct consequences of this distribution: (i) for very low values of the
threshold, there is a single consensual community comprising all nodes since there
154 R. Campigotto and J.-L. Guillaume
Fig. 4 Distribution of the pij averaged over 100 random Erdos-Rnyi graphs (with the average
degree and n the number of nodes). a = 20 and different values for n. b n = 1,000 and different
values for
is no value close to zero and therefore the virtual graph contains all links and (ii) for
large values of the threshold, the virtual graph contains almost no links and therefore
high threshold consensual communities are reduced to single nodes. Interestingly, in
random networks, there is a sharp transition (see Fig. 6), at a threshold value around
0.4 between the situation where one single consensual community is present and the
intermediate threshold values where several consensual communities are present,
which is not present in real networks.
The Power of Consensus: Random Graphs Still Have No Communities 155
Fig. 5 pij distribution for two real networks together with Erdos-Rnyi and configuration model
random graphs with the same size. a Email network. b Collaboration network
This phase transition cannot be directly deduced from the previous remarks and
we will use after more arguments to prove its existence.
156 R. Campigotto and J.-L. Guillaume
Fig. 6 Average size of consensual communities versus threshold for a real network and two
random networks generated with the Erdos-Rnyi and the configuration models. a Using Louvain-
Modularity. b Using Louvain-Balanced. c Using Louvain-Deviation
The Power of Consensus: Random Graphs Still Have No Communities 157
To observe the transition between a graph with clear communities towards a random
graph, we used the four groups test which is a random graph with 4 communities
of 32 nodes [13], generated using [18]. Each node has 16 x links towards its
community and x links outside. For x = 0, the graph is composed of 4 independent
random graphs with high density. Then, when x grows, the communities are less and
less defined and, for x 11.7, the graph is purely random. Finally, above this value,
each node has fewer links towards its community than outside. Classical community
detection algorithms are very successful at identifying communities for small values
of x, up to 6 in general. Above 6, they start to fail in identifying the groups.
Figure 7a shows the significance of consensual communities using Louvain-
Modularity. As we can see:
for x = 5, 4 groups of nodes are clearly identified in the range [0.16, 0.87[ and a
partition in 3 communities (one of 64 nodes and two of 32 nodes) is found in the
range [0.02, 0.16[;
for x = 6, a grouping in two communities (each containing 64 nodes) is obtained
in [0.05, 0.3[, then on is split to three communities in [0.3, 0.55[, and four groups
are obtained in [0.55, 0.6[;
for x = 7, three communities are identified in [0.26, 0.33[ and four are identified
in [0.33, 0.67[;
for x = 8, two groups are found in [0.44, 0.45[, three in [0.45, 0.5[ and four in
[0.5, 0.57[.
Note that these groups are not always the correct groups since few nodes can
be misclassified. We can see on Fig. 7b, c that these phenomena are similar for
Louvain-Balanced and Louvain-Deviation.
The main conclusion is that as the graph is more and more random, the intervals
in which the communities (or merge of communities) are found are narrowing.
We recall that for a given threshold , -cores are defined as connected components
of the weighted graph G whose adjacency matrix is P , in which we have deleted
weighted links with a value less than this threshold . In random graphs, we observe
that a small threshold gives one consensual community containing all the nodes of
the graph. Then, after a rapid phase transition (based on the choice of ), we obtain
only trivial consensual communities, each containing a single node.
Now, we give in the sequel arguments to show the existence of this phase transition.
Throughout the proof, we use extensively the fact that graphs are random and thus all
connections appear independently. Assumptions made in some cases may be related
to classical mean field assumptions in statistical physics.
158 R. Campigotto and J.-L. Guillaume
Fig. 7 Average size of consensual communities for a random network with 4 communities of 32
nodes, 16 links per node on average and a variable number of links pointing out of the community.
a Using Louvain-Modularity. b Using Louvain-Balanced. c Using Louvain-Deviation
The Power of Consensus: Random Graphs Still Have No Communities 159
Since we are considering random graphs, we can suppose that nodes (and their
neighbors) in the input graph are similar. Thus, regardless of the results of the com-
munity detection algorithm used, nodes will be in expectation in the same community
than a proportion p of their neighbors. Moreover, the random aspect of the graph
implies this proportion p concerns neighbors which have been chosen randomly and
independently for each run of the algorithm. In an equivalent way, we obtain that all
pij are approximately equal to p.
Of course, this argument holds only if we assume that all elements in the graph
are random. Indeed, the existence of correlations or specific properties on nodes can
harm it. This is for instance the case of modularity applied on graphs having very low
average degree. In particular, a node of degree 1 is always placed in the community
of its unique neighbor and the above mentioned argument cannot be applied. The
complete absence of correlations is therefore only valid for large networks with a
sufficiently large average degree.
Figure 8a is an experimentation on a 10,000 nodes random Erdos-Rnyi graph
with different average degrees. We can observe that when the average degree is
increasing, the effects of low degree nodes disappear and the distribution of pij is
much more concentrated.
4.2 Values of pi j for Two Connected Nodes Are Higher than Those
of Two Non-connected Nodes
On Fig. 8a (bottom), we can see that the distribution is in fact composed of two distinct
modes. These two modes correspond respectively to connected pairs of nodes, i.e.
links, and non-connected pairs of nodes. Figure 8b shows the decomposition of these
two distributions. We can see that pij values for connected nodes are higher (after
than for non-connected nodes).
Two nodes i and j not connected and having a nonzero pij were necessarily
classified at least once in the same community. As communities are necessarily
connected subgraphs of the input graph, there exists a path connecting them and
having only nonzero puv , for each nodes u and v belonging to the path. For instance,
i and j can have a common neighbor k such that pik and pjk are positive.
Let us assume to simplify that nodes i and j have a unique common neighbor
k. As the graph is purely random, we can suppose that the probability that i and k
are placed in the same community is pik = p, and the one that k and j are in the
same community is pkj = p. We also suppose they are independent, because edges
linking i, j and k can be inside as well as between different communities, without
any correlation. Thus, to i and j be classified in the same community, these two
events must occur simultaneously. Therefore,
160 R. Campigotto and J.-L. Guillaume
Fig. 8 pij distribution for a random graph with different average degree (5 and 100) and 10,000
nodes. The curve with all pairs is nearly completely overlapped by the two curves, expect for
average degree 5. a Global distribution (all pairs of nodes). b Distinction between connected and
non-connected pairs of nodes
Let us note that these calculations do not make sense in complex networks, since the
independence assumption is clearly unfounded, in particular because of the existence
of strong local correlation as measured by the clustering coefficient.
In the case where nodes i and j have no common neighbor but are connected with
a longer path in the input graph, by using the same reasoning, we have
pij = puv = p t ,
uvP
where P is a shortest path of size t linking i and j. This calculation holds if i and j
have only one common neighbor.
The Power of Consensus: Random Graphs Still Have No Communities 161
It is easy to compute pij in the case where the two nodes have z nodes in common.
We obtain
pij = 1 (1 p 2 )z ,
that corresponds to 1 minus the probability that i and j are not linked with a common
neighbor. However, if we assume that we have large graphs having low average
degree, the probability of having more than one common neighbor (if we already
have one) is very low.2
For these reasons, we can assume that values of pij are higher for connected pairs
than non-connected pairs.
If we suppose that all connected pairs (i, j) have pij = p, and that non-connected
nodes u and v have a lower probability of being connected, thus, for a threshold
below p, only pairs of connected nodes provide connectivity, and as all connected
pairs have nearly the same pij , we have only one consensual community containing
all the nodes of the input graph (for large enough values of the average degree,
the graph is connected, otherwise we have as many consensual communities as the
number of connected components).
Conversely, since the distribution of pij values for connected pairs is strongly
centered on the value p, any value of the threshold above p will destroy the consen-
sual communities very quickly and we obtain trivial consensual communities, each
containing only one node.
Finally, we can compute the value of this threshold. Let us assume that k % of links
are intra-community links. Then, this means that for each execution of the algorithm,
one node u will be put in expectation with k % of its neighbors, or equivalently each
neighbor will be with the given node u for k % of the executions. This value k is thus
the value of pij corresponding to the p that we have used so far.
Computing exactly the value of p is an open problem that seems to be difficult [15].
However, numerical studies (see Fig. 9) show that it decreases with the graph density,
but the exact decrease pattern is quite complex.
2 Assumptions in classical mean field make extensive use of the fact that a random graph whose
size tends to infinity is locally a tree.
162 R. Campigotto and J.-L. Guillaume
Fig. 9 Proportion of internal links for a random graph. a With 1,000 nodes. b With 10,000 nodes
5 Conclusion
We have shown here that consensual communities allow to distinguish graphs with
a real community structure from graphs where the community structure arises from
fluctuations. To do so, we have shown that consensual communities in random
graphs are trivial, containing either all the nodes of the graph or one node each.
The Power of Consensus: Random Graphs Still Have No Communities 163
These observations have been made using different quality functions optimized using
a generic version of the Louvain algorithm.
Some future works remain to further understand the absence of non-trivial
consensual communities in random graphs. First, it is necessary to compute the
exact value of the threshold as a function of the parameters (size and average degree)
of the Erdos-Rnyi graph. For graphs generated from the configuration model, the
task is more difficult since there are many degree one nodes for which the modularity
function requires that they are placed in the community of their only neighbour. Such
local correlations are harder to take into account.
Another perspective would be to make a similar study on regular graphs, in which
we know that it does not exist community structures. In particular, for regular grids
and torus, previous studies have shown that a high modularity partition can be found,
but the regularity of such network naturally allows many different partitions which
are simply translations of any partition. Intuitively, it means that many high quality
partitions can be found and that should not exist.
Acknowledgments We would like to thank the anonymous referees for their insightful comments
and suggestions, which have helped to improve the presentation of this paper. This work is partially
supported by the DynGraph ANR-10-JCJC-0202 and CODDDE ANR-13-CORD-0017-01 projects
of the French National Research Agency.
References
13. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc
Natl Acad Sci USA 99(12):78217826
14. Guimer R, Danon L, Diaz-Guilera A, Giralt F, Arenas A (2003) Self-similar community
structure in a network of human interactions. Phys Rev E 68(6):065103
15. Guimer R, Sales-Pardo M, Amaral LAN (2004) Modularity from fluctations in random graphs
and complex networks. Phys Rev E 70(2):025101
16. Janson S, Vegelius J (1982) The J-index as a measure of association for nominal scale response
agreement. Appl Psychol Meas 6:111121
17. Karrer B, Levina E, Newman M (2008) Robustness of community structure in networks. Phys
Rev E 77(4):046119
18. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community
detection algorithms. Phys Rev E 78(4):046110
19. Lancichinetti A, Fortunato S (2012) Consensus clustering in complex networks. Sci Rep 2(336)
20. Mancoridis S, Mitchell B, Rorres C (1998) Using automatic clustering to produce high-level
system organizations of source code. In: Proceedings of the 6th international workshop on
program comprehension, pp 4553
21. Marcotorchino JF (2013) Optimal transport, spatial interaction models and related problems,
impacts on relational metrics, adaptation to large graphs and networks modularity
22. Molloy M, Reed B (1995) A critical point for random graphs with a given degree sequence.
Random Struct Algorithms 6(23):161180
23. de Montgolfier F, Soto M, Viennot L (2011) Asymptotic modularity of some graph classes. In:
ISAAC, pp 435444
24. Newman M (2001) The structure of scientific collaboration networks. Proc Natl Acad Sci
98(2):404409
25. Newman M, Girvan M (2004) Finding and evaluating community structure in networks. Phys
Rev E 69(2):026113
26. Reichardt J, Bornholdt S (2006) Statistical mechanics of community detection. Phys Rev E
74(1):016110
27. Rosvall M, Bergstrom C (2010) Mapping change in large networks. PLoS One 5(1):e8694
28. Seifi M, Guillaume JL (2012) Community cores in evolving networks. In: Proceedings of the
mining social network dynamic 2012 workshop (MSND). Lyon, France, pp 11731180
29. Seifi M, Guillaume JL, Junier I, Rouquier JB, Iskrov S (2012) Stable community cores in
complex networks. In: 3rd international workshop on complex networks. Melbourne, Florida
30. Senshadhri C, Kolda TG, Pinar A (2012) Community structure and scale-free collections of
Erdos-Rnyi graphs. Phys Rev E 85:056109
31. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach
Intell 22:888905
32. Wang Q, Fleury E (2010) Uncovering overlapping community structure. In: 2nd international
workshop on complex networks, pp 176186
33. Wang Q, Fleury E (2009) Detecting overlapping communities in graphs. In: European confer-
ence on complex systems (ECCS). Warwick
34. Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthr
Res 33:452473
35. Zahn CT (1964) Approximating symmetric relations by equivalence relations. SIAM J Appl
Math 12:840847
Link Prediction in Heterogeneous
Collaboration Networks
1 Introduction
In many social media tools, link prediction is used to detect the existence of
unacknowledged linkages in order to relieve the users of the onerous chore of pop-
ulating their personal networks. The problem can be broadly formulated as follows:
given a disjoint node pair (x, y), predict if the node pair has a relationship, or in
the case of dynamic interactions, will form one in the near future [39]. Often, the
value of the participants experience is proportional to the size of their personal net-
work so bootstrapping the creation of social networks with link prediction can lead
to increased user adoption. Conversely, poor link prediction can irritate users and
detract from their initial formative experiences.
Although in some cases link predictors leverage external information from the
users profile or other documents, the most popular link predictors focus on modeling
the network using features intrinsic to the network itself, and measure the likelihood
of connection by checking the proximity in the network [14, 30]. Generally, the
similarity between node pairs can be directly measured by neighborhood methods
such as the number of shared neighbors [24] or subtly measured by path methods [21].
One weakness with network-based link prediction techniques is that the links
are often treated as having a homogeneous semantic meaning, when in reality the
underlying relationship represented by a given link could have been engendered
by different causal factors. In some cases, these causal factors are easily deduced
using user-supplied meta-information such as tags or circles, but in other cases the
provenance of the link is not readily apparent. In particular, the meaning of links
created from overlapping communities are difficult to interpret, necessitating the
development of heterogeneous link prediction techniques.
In the familiar example of scientific collaboration networks, authors usually have
multiple research interests and seek to collaborate with different sets of co-authors
for specific research areas. For instance, Author A cooperates with author B on
publishing papers in machine learning conferences whereas his/her interaction with
author C is mainly due to shared work in parallel computation. The heterogeneity in
connection causality makes the problem of predicting whether a link exists between
authors B and C more complicated. Additionally, Author A might collaborate with
author D on data mining; since data mining is an academic discipline closely related
to machine learning, there is overlap between the two research communities which
indicates that the linkage between B and D is more likely than a connection between
B and C. In this article, we detect and leverage the structure of overlapping commu-
nities toward this problem of link prediction in networks with multiple distinct types
of relationships.
Community detection utilizes the notion of structural equivalence which refers
to the property that two actors are similar to one another if they participate in equiva-
lent relationships [25]. Inspired by the connection between structural equivalence and
community detection, Soundarajan and Hopcroft proposed a link prediction model
for non-overlapping communities; they showed that including community infor-
mation can improve the accuracy of similarity-based link prediction methods [32].
Link Prediction in Heterogeneous Collaboration Networks 167
2 Related Work
The link prediction problem has drawn increased attention over the past few years
[5, 29, 33]. A variety of techniques for addressing this problem have been explored
including graph theory, metric learning, statistical relational learning, matrix fac-
torization, and probabilistic graphical models [17, 18, 35, 39]. This chapter is an
extended version of our prior work on supervised link prediction models [38].
Most link prediction models assume that the links in the network are homogeneous.
In this work, we focus on predicting links in link-heterogeneous networks such as
coauthorship collaboration networks, which can be modeled as networks that contain
different types of collaboration links connecting authors. From a machine learning
point of view, link prediction models can be categorized as being supervised or unsu-
pervised. Hasan et al. studied the use of supervised learning for link prediction in
coauthorship networks [13]. They identify a set of link features that are key to the
performance of their supervised learner including (1) proximity features, such as
keywords in research papers, (2) aggregated features, obtained from an aggregation
operator, and (3) topological features. The combination of these features showed
168 X. Wang and G. Sukthankar
Recently, some researchers started applying random walk models to solve the link
prediction problem. For instance, Backstrom and Leskovec developed a supervised
random walk algorithm that combines the information from the network structure
with node and edge level attributes and evaluated their method on coauthorship net-
works extracted from arXiv. The edge weights are learned by a model that optimizes
the objective function such that more strength is assigned to new links that a random
walker is more likely to visit in the future [3]. However, they only focus on predicting
links to the nodes that are 2-hops from the seed node. Liu et al. proposed a similar-
ity metric for link prediction based on type of local random walk, the Superposed
Random Walk (SRW) index [19]. By taking into account the fact that in most real net-
works nodes tend to connect to nearby nodes rather than ones that are far away, SRW
continuously releases the walkers at the starting point, resulting in a higher similarity
between the target node and the nearby nodes. Apparently this assumption is invalid
in DBLP and other scientific collaboration datasets. Similarly Yin et al. estimated
link relevance using the random walk algorithm on an augmented social graph with
both attribute and structure information [41]. Their framework leverages both global
and local influences of the attributes. Different to their model, our diffusion-based
techniques LPDP and LPDM only rely on the network structural information with-
out considering any nodes local (intrinsic) features. Additionally, the experiments
described in [19] and [41] evaluated the problem of recognizing existent links in the
network rather than predicting future ones.
in the nodes immediate neighbors, and path methods, such as PageRank, which
predict the links based on the paths between nodes [21]. Essentially, the prediction
score represents the similarity between the given pair of nodes: the higher the score,
the more likely that there exists a connection between them. Using the Common
Neighbors (CN) scoring method, two nodes with 10 common neighbors are more
likely to be linked than nodes with only a single common neighbor.
However, these neighborhood approaches intrinsically assume that the connections
in the network are homogeneous: each nodes connections are the outcome of one
relationship. Directly applying homogeneous link predictors to overlapping commu-
nities can cause prediction errors. A simple example is shown in Fig. 1, where two
types of relationships co-exist within the same network. The solid line represents
the coauthorship of a paper in a data mining conference and the dashed line repre-
sents the activity of collaborating on a machine learning paper. Note that the link
types are hidden from the methodonly the presence of a link is known. Author 1 is
associated with 2 affiliations since he/she participates in both activities. If all interac-
tions were considered homogeneously, the prediction score for linking authors 2 and
6, CN(2, 6), and that for authors 2 and 3, CN(2, 3), under the Common Neighbors
scoring method would be the same, since both node pairs share only one common
neighbor; yet this is clearly wrong. The question now becomes how can we capture
type correlations between edges to avoid being misled by connection heterogeneity?
In the next section, we describe how edges in the network can be analyzed using
edge clustering [34] to construct a social feature space that makes this possible.
The idea of constructing edge-based social dimensions was initially used to address
the multi-label classification problem in networked data with multiple types of
links [34]. Connections in human networks are often the result of affiliation-driven
social processes; since each person usually has more than one connection, the involve-
ments of potential groups related to one persons edges can be utilized as a repre-
sentation for his/her true affiliations. Because this edge class information is not
always readily available in the social media application, an unsupervised clustering
algorithm can be applied to partition the edges into disjoint sets such that each set
represents one potential affiliation. The edges of actors who are involved in multiple
affiliations are likely to be separated into different sets.
In this article, we construct the nodes social feature space using the scalable edge
clustering method proposed in [34]. However, instead of using the social feature
space to label nodes, in this article our aim is to leverage this information to reweight
links. First, each edge is represented in a feature-based format, where the indices of
the nodes that define the edges are used to create the features as shown in Fig. 1.
In this feature space, edges that share a common node are more similar than edges
that do not. Based on the features of each edge, k-means clustering is used to separate
the edges into groups using this similarity measure. Each edge cluster represents
Link Prediction in Heterogeneous Collaboration Networks 171
Fig. 1 A simple example of a coauthorship network (a). The solid line represents coauthorship
of a paper in a data mining conference and the dashed line represents the activity of collaborating
on a machine learning paper. In edge-based social features (b), each edge is first represented by a
feature vector where nodes associated with the edge denote the features. For instance here the edge
13 is represented as [1, 0, 1, 0, 0, 0, 0, 0, 0, 0]. Then, the nodes social feature (SF) is constructed
based on edge cluster IDs (c). Suppose in this example the edges are partitioned into two clusters
(represented by the solid lines and dashed lines respectively), then the SFs for node 1 and 2 become
[3, 3] and [0, 2] using the count aggregation operator. Employing social features enables us to score
26 (cross-affiliation link) lower than 23 even though they have the same number of common
neighbors
In this article, the weights of the link are evaluated based on the users social
features extracted from the network topology under different similarity measures. For
our domain, we evaluated several commonly used metrics including inner product,
cosine similarity, and Histogram Intersection Kernel (HIK), which is used to compare
color histograms in image classification tasks [4]. Since our social features can be
regarded as the histogram of persons involvement in different potential groups, HIK
can also be adopted to measure the similarity between two people. Given the social
features of person vi and person v j , (SFi , SF j ) X X , the HIK is defined as
follows:
m
K HI (vi , v j ) = min{SFi , SF j }, (1)
i=1
In order to investigate the impact of link weights for link prediction in collaboration
networks, we compare the performances of eight benchmark unsupervised metrics
for unweighted networks and their extensions for weighted networks. The prediction
scores from these unsupervised metrics can further be used as the attributes for
learning supervised prediction models. We detail the unsupervised prediction metrics
for both unweighted and weighted networks in the following sections.
Let N (x) be the set of neighbors of node x in the social network and let Dx be
the degree (the total number of neighbors) of node x. Obviously, in an unweighted
network, Dx = |N (x)|. Let w(x, y) be the link weight between nodes x and y in a
weighted network. Note that in our generated weighted network, the weight matrix
W is symmetric, i.e. w(x, y) = w(y, x).
The CN measure for unweighted networks is defined as the number of nodes with
direct connections to the given nodes x and y:
The CN measure is one the most widespread metrics adopted in link prediction,
mainly due to its simplicity. Intuitively, the measure simply states that two nodes
Link Prediction in Heterogeneous Collaboration Networks 173
that share a high number of common neighbors should be directly linked [24].
For weighted networks, the CN measure can be extended as:
CN(x, y) = w(x, z) + w(y, z). (3)
zN (x)N (y)
The JC measure assumes that the node pairs that share a higher proportion of common
neighbors relative to their total number of neighbors are more likely to be linked.
From this point of view, JC can be regarded as a normalized variant of CN. For
unweighted networks, the JC measure is defined as:
|N (x) N (y)|
JC(x, y) = . (4)
|N (x) N (y)|
The PA measure assumes that the probability that a new link is created from a node x
is proportional to the node degree Dx (i.e., nodes that currently have a high number
of relationships tend to create more links in the future). Newman proposed that the
product of a node pairs number of neighbors should be used as a measure for the
probability of a future link between those two [24]. The PA measure for an unweighted
network is defined by:
neighbors that have fewer neighbors. The AA measure for unweighted networks is
defined as:
1
AA(x, y) = . (8)
log(N (z))
zN (x)N (y)
The Resource Allocation Index has a similar formula as the Adamic-Adar Coefficient,
but with a different underlying motivation. RA is based on physical processes of
resource allocation [26] and can be applied on networks formed by airports (for
example, flow of aircraft and passengers) or networks formed by electric power
stations such as power distribution. The RA measure was first proposed in [42] and
for unweighted networks it is expressed as follows:
1
RA(x, y) = . (10)
|N (z)|
zN (x)N (y)
The Path Distance measure for unweighted networks simply counts the number of
nodes along the shortest path between x and y in the graph. Thus, when two nodes
x and y share at least one common neighbor, then PD(x, y) = 1. In this article, we
adopt the Inverse Path Distance to measure the proximity between two nodes, where
IPD is based on the intuition that nearby nodes are likely to be connected. In a
weighted network, IPD is defined by the inverse of the shortest weighted distance
Link Prediction in Heterogeneous Collaboration Networks 175
between two nodes. Since IPD quickly approaches 0 as path lengths increase, for
computational efficiency, we terminate the shortest path search once the distance
exceeds a threshold L and approximate IPD for more distant node pairs as 0.
4.1.7 PropFlow
PropFlow [18] is a new unsupervised link prediction method which calculates the
probability that a restricted random walk starting at x ends at y in L steps or fewer
using link weights as transition probabilities. The walk terminates when reaching
node y or revisiting any nodes including node x. By restricting its search within the
threshold L, PropFlow is a local measure that is insensitive to noise in network topol-
ogy far from the source node and can be computed quite efficiently. The algorithm
for unweighted networks is identical to that for weighted networks, except that all
link weights are set equal to 1.
4.1.8 PageRank
The PageRank (PR) algorithm of Google fame was first introduced in [6]; it aims
to represent the significance of a node in a network based on the significance of
other nodes that link to it. Inspired by the same assumption as made by Preferential
Attachment, we assume that the links between nodes are driven by the importance of
the node, hence the PageRank score of the target node represents a useful statistic.
Essentially, PageRank outputs the ranking scores (or probability) of visiting the target
node during a random walk from a source. A parameter , the probability of suffering
to a random node, is considered in the implementation. In our experiment, we set
= 0.85 and perform an unoptimized PageRank calculation iteratively until the
vector that represents PageRank scores converges.
For weighted networks, we adopted the weighted PageRank algorithm proposed
in [10].
PRw (x) w(x)
PRw (x) = + (1 ) N . (12)
L(k) y=1 w(y)
kN (x)
N
where L(x) is the sum of outgoing link weights from node x, and y=1 w(y) is the
total weights across the whole network.
to the scoring function a priori. In other words, the assumption is both the links
in the existing network and the predicted links score highly on the given measure.
Second, the ranking of node pairs is performed using only a single metric, and hence
the strategy may completely explore different structural patterns contained in the
network. By contrast, supervised link prediction schemes can integrate information
from multiple measures and can usually better model real-world networks. Most
importantly, unlike in other domains where supervised algorithms require access to
appropriate quantities of labeled data, in link prediction we can use the existing links
in the network as the source of supervision. For these reasons, supervised approaches
to link prediction are drawing increased attention in the community [13, 18, 28].
In this article, we follow a standard approach: we treat the prediction scores from
the unsupervised measures as features for the supervised link predictor. We compare
the accuracy of different classifiers on both unweighted and weighted collaboration
networks.
1 http://www.informatik.uni-trier.de/~ley/db/.
Link Prediction in Heterogeneous Collaboration Networks 177
Similar DBLP datasets have previously been employed by Kong et al. to evaluate
collective classification in multi-relational networks [15]. In this article, we aim to
predict the missing links (coauthorship) in the future based on the existing connection
patterns in the network.
In this article, the supervised link prediction models are learned from training links
(all existing links) in the DBLP dataset extracted between 2000 and 2008, and the
performance of the model is evaluated on the testing links, new co-author links
generated between 2009 and 2010. Link prediction using supervised learning model
can be regarded as a binary classification task, where the class label (0 or 1) represents
the link existence of the node pair. When performing the supervised classification,
we sample the same number of non-connected node pairs as that of the existing links
to use as negative instances for training the supervised classifier.
In our proposed LPSF model, the edge clustering method is adopted to construct
the initial social dimensions. When conducting the link prediction experiment, we use
cosine similarity while clustering the links in the training set. The edge-based social
dimension in our proposed method, LPSF, is constructed based on the edge cluster
IDs using the count aggregation operator, and varying numbers of edge clusters are
tested in order to provide the best performance of LPSF. The weighted network is
then constructed according to the similarity score of connected nodes social fea-
tures under the weight measure selected from Sect. 4. The search distance L for
unsupervised metrics Inverse Path Distance and PropFlow is set to 5. We evaluate
the performance of four supervised learning models in this article, which are Naive
Bayes (NB), Logistic Regression (LR), Neural Network (NN) and Random Forest
(RF). All algorithms have been implemented in WEKA [12], and the performance
of each classifier is tested using its default parameter setting.
In the DBLP dataset, the number of positive link examples for testing is very
small compared to negative ones. In this article, we sample an equivalent number
of non-connected node pairs as links from the 2009 and 2010 period to use as the
negative instances in the testing set. The evaluation measures for supervised link
prediction performance used in this article are precision, recall and F-Measure.
178 X. Wang and G. Sukthankar
4.4 Results
This section describes several experiments to study the benefits of augmenting link
prediction methods using LPSF. First, we compare the performance of different
weighting metrics used in LPSF. Second, we evaluate how the number of social
features affects the performance of LPSF. Finally, we examine how several super-
vised link prediction models perform on unweighted and weighted networks, and
the degree to which LPSF improves classification performance under different eval-
uation measures.
Fig. 2 Classification performance of LPSF on the DBLP Dataset using different similarity measures
on nodes social features. The number of edge clusters is set to 1,000, and Histogram Intersection
Kernel (HIK) performs the best in both datasets. a DBLP-A dataset. b DBLP-B dataset
Link Prediction in Heterogeneous Collaboration Networks 179
Fig. 3 Classification performance of LPSF using HIK on the DBLP Dataset with varying number
of social features, using different supervised classifiers. a DBLP-A dataset. b DBLP-B dataset
Here, we evaluate how the number of social features (edge clusters) affects the link
prediction performance of LPSF, and Fig. 3 shows the corresponding classification
accuracy under the F-Measure metric. In the DBLP-A dataset, Naive Bayes and
Random Forest are relatively robust to the number of social features while Logis-
tic Regression and Neural Network perform better with a smaller number of social
features (less than 500). Similarly in the DBLP-B dataset, LPSF demonstrates bet-
ter performance with fewer social features. Therefore we set the number of social
features to 300 and 500 for the DBLP-A and DBLP-B datasets respectively.
Figures 4 and 5 display the comparisons between LPSF and the baseline methods
on the DBLP datasets using a variety of supervised link classification techniques,
against both the unweighted and weighted supervised baselines. The same features
are used by all methods, with the only difference being the weights on the network
links. In this article, we compare the proposed method LPSF with alternate weighting
schemes, such as the number of co-authored papers, as suggested in [9]. We see that in
both DBLP datasets, Unweighted, Weighted and LPSF perform almost equally under
Precision, though LPSF performs somewhat worse for some classifiers (Random
Forest and Naive Bayes). When considering the number of collaborations between
author pairs, the Weighted method slightly improves upon the performance of the
Unweighted method.
The proposed reweighting (LPSF) offers substantial improvement over both the
Unweighted and Weighted schemes on Recall and F-Measure in both datasets. In
the DBLP-A dataset, LPSF outperforms the unweighted baseline the most dramati-
cally on Logistic Regression, with about 23 % improvement and 40 % on Recall and
F-Measure respectively. In the DBLP-B dataset, LPSF shows the best performance
180 X. Wang and G. Sukthankar
using Neural Network with accuracy improvements over baselines for 13 % on Recall
and 30 % on F-Measure.
LPSF calculates the closeness between connected nodes according to their social
dimensions, which captures the nodes prominent interaction patterns embedded in
the network and better addresses heterogeneity in link formation. By differentiating
different types of links, LPSF is able to discover the possible link patterns between
disconnected node pairs that may not be determined by the Unweighted and simple
Weighted method, and hence exhibits great improvement on Recall and F-Measure.
Since LPSF can be directly applied on the unweighted network, without considering
any additional node information, it is thus broadly applicable to a variety of link
prediction domains.
Figures 4 and 5 compare the performance of different supervised classifiers for link
prediction. We found that the performance of the classifiers varies between datasets.
Logistic Regression, Naive Bayes and Neural Network exhibit comparable perfor-
mance. Somewhat surprisingly, Random Forest does not perform well with LPSF.
Link Prediction in Heterogeneous Collaboration Networks 181
We also observe that LPSF using Naive Bayes will boost the Recall performance
over baseline methods at the cost of lower Precision. Therefore Logistic Regression
and Neural Network are a better choice for LPSF in that they improve the Recall
performance without decreasing the Precision. Using the traditional weighted fea-
tures [9] does not help supervised classifiers for link prediction to a great extent.
As discussed above, reweighting the unweighted collaboration network using our
proposed technique, LPSF, performs the best.
Traditional unsupervised link prediction methods aim to measure the similarity for
a node pair and use the affinity value to predict the existence of a link between
them. The performance of link predictor is consequently highly dependent on the
choice of pairwise similarity metrics. Most widely used unsupervised link predictors
focus on the underlying local structural information of the data, which is usually
extracted from the neighboring nodes within a short distance (usually 1-hop away)
from the source. For instance, methods such as Common Neighbors and Jaccards
Coefficient calculate the prediction scores based on the number of directly shared
neighbors between the given node pair. However, a recent study of coauthorship
networks by Backstrom and Leskovec shows that researchers are more interested
in establishing long-range weak ties (collaborations) rather than strengthening their
well-founded interactions [3]. Figure 6 shows the distance distribution of newly col-
laborating authors between 2009 and 2010 in the DBLP datasets. We discover that in
both datasets the majority of new links are generated by a node pair with a minimal
distance equal to or greater than two. This poses a problem for local link predictors
which ignore information from the intermediate nodes along the path between the
node pair.
In the past few years, the diffusion process (DP) model has attracted an increasing
amount of interest for solving information retrieval problems in different domains
[11, 36, 40]. DP aims to capture the geometry of the underlying manifold in a
weighted graph that represents the proximity of the instances. First, the data are rep-
resented as a weighted graph, where each node represents an instance and edges are
weighted according to their pairwise similarity values. Then the pairwise affinities are
re-evaluated in the context of all connected instances, by diffusing the similarity val-
ues through the graph. The most common diffusion processes are based on random
walks, where a transition matrix defines probabilities for walking from one node
to a neighboring one, that are proportional to the provided affinities. By repeatedly
making random walk steps on the graph, affinities are spread on the manifold, which
in turn improves the obtainable retrieval scores. In the context of social network
data, the data structure naturally leads to graph modeling, and graph-based methods
have been proven to perform extremely well when combined with Markov chain
techniques. In the following sections, we will explore the effectiveness of diffusion-
based methods on solving link prediction problems. The next section introduces the
182 X. Wang and G. Sukthankar
Fig. 6 Probability distribution of the shortest distance between node pairs in future links (between
2009 and 2010) in the DBLP datasets. Distances marked as 0 are used to indicate that no path
can be found that connects the given node pair. a DBLP-A dataset. b DBLP-B dataset
diffusion process model (DP) and an embedding method based on diffusion processes,
diffusion maps (DM). Our proposed diffusion-based link prediction models (LPDP
and LPDM) are discussed in Sects. 5.1 and 5.2.
We begin with the definition of a random walk on a graph G = (V, E), which
contains N nodes vi V , and edges ei j E that link nodes to each other. The
entries in the N N affinity matrix A provide the edge weights between node pairs.
The random walk transition matrix P can be defined as
P = D 1 A (13)
and deg(i) is the degree of the node i (i.e., the sum over its edge weights). The
transition probability matrix P is a row-normalized matrix, where each row sums
up to 1. Assuming f0 , a 1 N dimensional vector of the initial distribution for a
specific node, the single step of the diffusion process can be defined by the simple
update rule:
ft+1 = ft P (15)
Link Prediction in Heterogeneous Collaboration Networks 183
ft = f0 P t (16)
where Pt is the power of the matrix P. The entry f jt in ft measures the probability
of going from the source node to node j in t time steps.
The PageRank algorithm described in Sect. 4.1 is one of the most successful
webpage ranking methods and is constructed using a random walk model on the
underlying hyperlink structures. In PageRank, the standard random walk is modified:
at each time step t a node can walk to its outgoing neighbors with probability or will
jump to a random node with probability (1 ). The update strategy is as follows:
ft+1 = ft Pt + (1 )y (17)
Wt+1 = Wt Pt + (1 )Y (18)
(t)
where Wi j is the corresponding (i, j) entry in Wt . Note that Wt is not necessarily
a symmetric matrix, meaning Witj = W tji .
184 X. Wang and G. Sukthankar
The diffusion maps technique (DM), first introduced by Coifman and Lafon, applies
the diffusion process model toward the problem of dimensionality reduction; it
aims to embed the data manifold into a lower-dimensional space while preserving
the intrinsic local geometric data structure [7]. Different from other dimension-
ality reduction methods such as principal component analysis (PCA) and multi-
dimensional scaling (MDS), DM is a non-linear method that focuses on discovering
the underlying manifold generating the sampled data. It has been successfully used
on problems outside of social media analysis, including learning semantic visual
features for action recognition [20].
(t)
As discussed in the previous section, in diffusion models, each entry Wi j indicates
the probability of walking from i to j in t time steps. When we increase t, the diffusion
process moves forward, and the local connectivity is integrated to reveal the global
connectivity of the network. Increasing the value of t raises the likelihood that edge
weights diffuse to nodes that are further away in the original graph. From this point
of view, the Wt in the diffusion process reflects the intrinsic connectivity of the
network, and the diffusion time t plays the role of a scaling factor for data analysis.
Subsequently, the diffusion distance D is defined using the random walk forward
probabilities pit j to relate the spectral properties of a Markov chain (its matrix, eigen-
values, and eigenvectors) to the geometry of the data. The diffusion distance aims
to measure the similarity of two points (Ni and N j ) using the diffusion matrix Wt ,
which is in the form of:
(t) (t)
(Wiq W jq )2
[D (t) (Ni , N j )]2 = (20)
(Nq )(0)
q
where (Nq )(0) is the unique stationary distribution which measures the density of
the data points.
Since calculating the diffusion distance is usually computationally expensive,
spectral theory can be adopted to map the data point into a lower dimensional space
such that the diffusion distance in the original data space now becomes the Euclidean
distance in the new space. The diffusion distance can then be approximated with
relative precision using the first k nontrivial eigenvectors and eigenvalues of Wt
according to
k
[D (t) (Ni , N j )]2 (ts )2 (vs (Ni ) vs (N j ))2 (21)
s=1
The diffusion map t embeds the data into a Euclidean space in which the distance
is approximately the diffusion distance:
The diffusion maps framework for the proposed method Link Prediction using
Diffusion Maps (LPDM) is summarized in Table 2. LPDM defines the link predic-
tion score for a given node pair (Ni , N j ) by the diffusion distance, D (t) (Ni , N j ),
between them.
classifier. In this article, we report the performance of all unsupervised link prediction
methods using AUROC.
5.4 Results
As mentioned before, in diffusion processes, the diffusion time t controls the amount
of weight likelihood that diffuses between long distance node pairs. The higher the
value of t is, the more likely the link weights are to diffuse to the nodes that are
further away. Figure 7 shows the effect of varying diffusion time on the LPDP link
prediction accuracy for the DBLP dataset. In this experiment, we fix the value of
to 0.9 which offers LPDP the best performance. We discover that setting t to a higher
value does not guarantee higher link prediction accuracy. LPDP performs best when
Fig. 7 Link prediction performance (AUROC) of LPDP with fixed damping factor = 0.9 and
varying diffusion time (t) on unweighted DBLP-A and DBLP-B datasets. LPDP performs best on
both datasets when t = 15
Link Prediction in Heterogeneous Collaboration Networks 187
Fig. 8 AUROC accuracy of LPDM on DBLP datasets with varying damping factor and embedded
space size. The diffusion time t for LPDM is set to 100 and 60 for DBLP-A and DBLP-B dataset
respectively. a DBLP-A dataset. b DBLP-B dataset
t = 15, yielding an AUROC accuracy 84.61 and 85.49 % on DBLP-A and DBLP-B
datasets respectively.
Here, we evaluate how the size of the embedded space and the value of the damping
factor affect the link prediction performance of LPDM. Figure 8 shows the corre-
sponding classification accuracy measured by AUROC. The diffusion time t has an
insignificant effect on the performance of LPDM, and the results we report here are
based on setting t to 100 and 60 for DBLP-A and DBLP-B respectively. In both
datasets, a lower damping factor yields higher accuracy, and LPDM demonstrates
the best performance when equals 0.55 and 0.65 on DBLP-A and DBLP-B respec-
tively. Note that in Eq. 18, a lower results in a reduced probability of exchanges
between a node and its connected neighbors. Our results reveal that the size of the
embedded diffusion space greatly affects the performance of LPDM. Here we report
experimental results for embedded diffusion space dimensions ranging from 1 and
100. As shown in Fig. 8, the diffusion maps technique is able to identify semanti-
cally similar nodes by measuring distance on an embedded space with a much smaller
dimensionality. LPDM exhibits the best performance (79.61 and 79.08 %) when the
size of the embedded space equals 25 and 15 on DBLP-A and DBLP-B respectively.
In Sect. 4.4.3, we evaluate our supervised link classifier LPSF which employs an
ensemble of unsupervised measures as features. These unsupervised measures can
188 X. Wang and G. Sukthankar
Table 3 Link prediction accuracy of individual (unsupervised) classifiers on the DBLP-A dataset
AUROC (%) PA AA CN JC RA IPD PropFlow PageRank LPDP LPDM
Unweighted 86.68 50.95 50.95 50.95 50.20 77.46 77.52 82.54 85.49 79.61
Weighted 85.16 50.95 50.95 50.95 50.20 80.06 79.71 85.61 83.08 80.43
Performance is evaluated on both unweighted networks and weighted networks constructed using
social context features. Note that the reweighting scheme does not always improve accuracy at the
individual feature level
Table 4 Link prediction accuracy of individual (unsupervised) classifiers on the DBLP-B dataset
AUROC (%) PA AA CN JC RA IPD PropFlow PageRank LPDP LPDM
Unweighted 87.97 52.15 52.15 52.14 50.66 77.09 76.98 83.60 84.61 79.08
Weighted 87.11 52.15 52.15 52.15 50.66 76.23 76.66 87.14 80.11 80.09
Performances are evaluated on both unweighted networks and weighted networks constructed using
social context features. Note that the reweighting scheme does not always improve accuracy at the
individual feature level
Link Prediction in Heterogeneous Collaboration Networks 189
nodes after the diffusion process which results in inferior performance to LPDP.
LPDMs performance is worse than LPDP by around 5 %, while still performing
better than IPD and PropFlow. This might be because the diffusion process after
t diffusion time steps is good enough to capture the underlying similarity between
nodes at farther distances using the node similarity extracted from the final diffusion
matrix.
Third, Tables 3 and 4 also include the comparison results of different unsupervised
link predictors on weighted DBLP networks constructed using edge cluster infor-
mation. On one hand, we found that in methods such as CN, JC, AA and RA, the
weighting scheme does not affect the corresponding link prediction accuracy much.
On the other hand, the weighting scheme helps to improve the performance of IPD,
PropFlow, PageRank as well as LPDM by around 23 %. On both weighted datasets,
PageRank performs best among all unsupervised features. It is also surprising that
LPDP performs poorly on the weighted network, reducing the accuracy by 2 % on
the DBLP-A dataset and 4 % on the DBLP-B dataset.
In summary, we observe that the reweighting scheme yields dramatic
improvements in LPSF which integrates the first eight features listed in Table 3 in a
supervised setting; however, it fails to boost the unsupervised performance of individ-
ual features. As mentioned in [22], the utility of using weights in link prediction is a
somewhat controversial issue. Some case studies have shown that prediction accuracy
can be significantly harmed when weights in the relationships were considered [22].
Our experiments reveal a more nuanced picture: although link weights (using the
proposed approach) may not generate a large improvement for some individual
unsupervised feature-level techniques, employing an appropriate choice of link
weights (e.g., using LPSF) in conjunction with a supervised classifier enables us
to achieve more accurate classification results on the DBLP datasets.
Weights based on node pairs social features extracted from an unweighted
network. higher similarity between the target node and the nearby nodes. Appar-
ently this assumption is invalid in DBLP and other scientific collaboration datasets.
Similarly Yin et al. estimated link relevance using the random walk algorithm on
an augmented social graph with both attribute and structure information [41]. Their
framework leverages both global and local influences of the attributes. Different to
their model, our diffusion-based techniques LPDP and LPDM only rely on the net-
work structural information without considering any nodes local (intrinsic) features.
Additionally, experiments in [19, 41] are conducted on evaluating the existent links
in the network rather than predicting the future links.
6 Conclusion
interaction patterns from the network topology and embeds the similarities between
connected nodes as link weights. The nodes similarity is calculated based on social
features extracted using edge clustering to detect overlapping communities in the net-
work. Experiments on the DBLP collaboration network demonstrate that a judicious
choice of weight measure in conjunction with supervised link prediction enables us
to significantly outperform existing methods. LPSF is better able to capture the true
proximity between node pairs based on link group information and improves the
performance of supervised link prediction methods.
However, the social features utilized effectively by the supervised version of LPSF
are less useful in an unsupervised setting both with the raw proximity metrics and our
two new diffusion-based methods (LPDP and LPDM). We observe that in the DBLP
dataset researchers are more likely to collaborate with other highly published authors
with whom they share weak ties which causes the random-walk based methods (PR,
LPDP and LPDM) to generally outperform other benchmarks. Even though the
reweighting scheme greatly boosts the performance of LPSF, it does not always
have significant impact on its corresponding unsupervised features. In conclusion
we note that any weighting strategy should be applied with caution when tackling
the link prediction problem.
References
1. Adamic L, Adar E (2003) Friends and neighbors on the web. Soc Netw 25(3):211230
2. Ahn YY, Bagrow JP, Lehmann S (2010) Link communities reveal multi-scale complexity in
networks. Nature 466:761764
3. Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending
links in social networks. In: Proceedings of the fourth ACM international conference on web
search and data mining, pp 635644
4. Barla A, Odone F, Verr A (2003) Histogram intersection kernel for image classification. In:
Proceedings 2003 international conference on image processing, vol 3, III-513-16
5. Benchettara N, Kanawati R, Rouveirol C (2010) Supervised machine learning applied to link
prediction in bipartite social networks. In: Proceedings of the international conference on
advances in social network analysis and mining, pp 326330
6. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput
Netw ISDN Syst 30(17):107117
7. Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21(1):530
8. Davis D, Lichtenwalter R, Chawla NV (2012) Supervised methods for multi-relational link
prediction. Social network analysis and mining, pp 115
9. de S HR, Prudncio RBC (2011) Supervised link prediction in weighted networks. In: Inter-
national joint conference on neural networks (IJCNN), pp 22812288
10. Ding Y (2011) Applying weighted pagerank to author citation networks. CoRR abs/1102.1760
11. Donoser M, Bischof H (2013) Diffusion processes for retrieval revisited. In: Proceedings of
IEEE conference on computer vision and pattern recognition (CVPR), pp 13201327
12. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data
mining software: an update. SIGKDD Explor Newsl 11(1):1018
13. Hasan MA, Chaoji V, Salem S, Zaki M (2006) Link prediction using supervised learning. In:
Proceedings of the SDM workshop on link analysis, counterterrorism and security
Link Prediction in Heterogeneous Collaboration Networks 191
14. Jin EM, Girvan M, Newman MEJ (2001) The structure of growing social networks. Phys Rev
E 64:046132
15. Kong X, Shi X, Yu PS (2011) Multi-label collective classification. In: SIAM international
conference on data mining (SDM), pp 618629
16. Lee JB, Adorna H (2012) Link prediction in a modified heterogeneous bibliographic network.
In: Proceedings of international conference on advances in social networks analysis and mining
(ASONAM), pp 442449
17. Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am
Soc Inf Sci Technol 58(7):10191031
18. Lichtenwalter RN, Lussier JT, Chawla NV (2010) New perspectives and methods in link predic-
tion. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery
and data mining, pp 243252
19. Liu W, Lu L (2010) Link prediction based on local random walk. EPL (Europhys Lett) 85(5)
20. Liu J, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance.
In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp
461468
21. Lu L, Zhou T (2011) Link prediction in complex networks: a survey. Phys A 390(6):11501170
22. L L, Zhou T (2009) Role of weak ties in link prediction of complex networks. In: Proceedings
of the ACM international workshop on complex networks meet information and knowledge
management, pp 5558
23. Murata T, Moriyasu S (2007) Link prediction of social networks based on weighted proximity
measures. In: Web intelligence, pp 8588
24. Newman M (2001) Clustering and preferential attachment in growing networks. Phys Rev E
64(2):025102
25. Newman MEJ (2004) Detecting community structure in networks. Eur Phys J B - Condens
Matter Complex Syst 38(2):321330
26. Ou Q, Jin YD, Zhou T, Wang BH, Yin BQ (2007) Power-law strength-degree correlation from
resource-allocation dynamics on weighted networks. Phys Rev E 75:021102
27. Pan JY, Yang HJ, Faloutsos C, Duygulu P (2004) Automatic multimedia cross-modal correlation
discovery. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge
discovery and data mining, pp 653658
28. Popescul A, Popescul R, Ungar LH (2003) Statistical relational learning for link prediction.
In: IJCAI workshop on learning statistical models from relational data
29. Pujari M, Kanawati R (2012) Tag recommendation by link prediction based on supervised
machine learning. In: Proceedings of the international conference on weblogs and social media
30. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc,
New York
31. Sen P, Namata G, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T (2008) Collective classifi-
cation in network data. AI Mag 29:93106
32. Soundarajan S, Hopcroft J (2012) Using community information to improve the precision of
link prediction methods. In: Proceedings of the international conference on the world wide
web, pp 607608
33. Sun Y, Barber R, Gupta M, Aggarwal CC, Han J (2011) Co-author relationship prediction
in heterogeneous bibliographic networks. In: Proceedings of the international conference on
advances in social networks analysis and mining, pp 121128
34. Tang L, Liu H (2009) Scalable learning of collective behavior based on sparse social dimen-
sions. In: Proceedings of international conference on information and knowledge management
(CIKM)
35. Taskar B, Wong MF, Abbeel P, Koller D (2003) Link prediction in relational data. In: Neural
information processing systems
36. Wanga J, Lia Y, Baib X, Zhanga Y, Wangc C, Tang N (2011) Learning context-sensitive
similarity by shortest path propagation. Pattern Recognit 44(1011):23672374
37. Wang X, Sukthankar G (2011) Extracting social dimensions using Fiedler embedding. In:
Proceedings of IEEE international conference on social computing, pp 824829
192 X. Wang and G. Sukthankar
Peng Xia, Kun Tu, Bruno Ribeiro, Hua Jiang, Xiaodong Wang,
Cindy Chen, Benyuan Liu and Don Towsley
Abstract Online dating sites have become popular platforms for people to look for
romantic partners, providing an unprecedented level of access to potential dates that is
otherwise not available through traditional means. Characterization of the user online
dating behavior helps us to obtain a deep understanding of their dating preference
and make better recommendations on potential dates. In this paper we study the user
online dating behavior and preference using a large real-world dataset from a major
online dating site in China. In particular, we characterize the temporal behavior,
message send and reply behavior of users, study how users online dating behaviors
correlate with various user attributes, and investigate how users actual online dating
behaviors deviate from their stated preferences. Our results show that on average
a male sends out more messages but receives fewer messages than a female. A
female is more likely to be contacted but less likely to reply to a message than a
male. The number of messages that a user sends out and receives per week quickly
decreases with time, especially for female users. Most messages are replied to within
a short time frame with a median delay of around 9 h. Many of the user messaging
behaviors align with notions in social and evolutionary psychology: males tend to
look for younger females while females place more emphasis on the socioeconomic
status (e.g., income, education level) of a potential date. The geographic distance
between two users and the photo count of users play an important role in their
dating behavior. We show that it is important to differentiate between users true
preferences and random selection. Some user behaviors in choosing attributes in a
potential date may largely be a result of random selection. We also find that while
both males and females are more likely to reply to users whose attributes come closest
to the stated preferences of the receivers, there is significant discrepancy between a
users stated dating preference and his/her actual online dating behavior. We further
characterize how users actual dating behavior deviate from their stated preference.
These results can provide valuable guidelines to the design of a recommendation
engine for potential dates.
1 Introduction
1 http://statisticbrain.com/online-dating-statistics.
Characterization of User Online Dating Behavior and Preference . . . 195
younger and younger women. A female in her 20s is more likely to look for older
males, but as a female gets older, she becomes more open towards younger males.
In addition to the above findings, we observe that geographic distance between
two users plays an important role in online dating considerations: 46.5 % of the initial
messages occurred between users in the same city, and for messages that cross the
city boundaries, the volume quickly decreases as users live farther apart. Females
are more likely than males to send and reply to messages between distant big cities.
Profile photos affect male and females messaging behaviors differently. Females
with a larger number of photos are more likely to invite messages and secure replies
from males, but the photo count of males does not have as significant effect in
attracting contacts and replies.
Our results also show that it is important to differentiate between users true
preferences and random selection. Some user behaviors in choosing attributes in a
potential date may be a result of random selection. For example, while it appears that a
male tends to look for females shorter than he is and a female tends to look for males
taller than she is, the message send and reply behaviors of both genders closely
approximate those resulting from random selection, showing that these behaviors
may result from random selection rather than users true preferences.
Our results also indicate a significant discrepancy between a users stated dating
preference and his/her actual online dating behavior. A fairly large fraction of mes-
sages are sent to or replied to users whose attributes do not match the sender or
receivers stated preferences. Females tend to be more flexible than males in deviat-
ing from their stated preferences when sending and replying to messages. For both
males and females, out of the population of users that send messages, replies are
more likely to go to users whose attributes come closest to the stated preferences of
the receivers. We further characterize how users actual dating behavior deviate from
their stated preference. For both male and female users, when they send messages
to people who do not satisfy their stated age requirement, younger users are more
likely to send messages to people older than their stated age preference, while users
of older age group (especially males) become more likely to send messages to people
younger than their stated preference. Similarly, users of lower height are more likely
to send messages to people taller than their stated preference, while taller users are
more likely to send messages to people lower than their stated preference.
In summary, our results reveal how user message send and reply behaviors
correlate with various user attributes, how these behaviors differ from random selec-
tion, and how users actual online dating behavior deviates from their stated prefer-
ences. These results on users dating preferences can provide valuable guidelines to
the design of recommendation engine for potential dates.
The rest of the paper is structured as follows. Section 2 presents an overview of
previous studies on the data analysis of online dating sites. Section 3 describes the
dataset that we obtained from a major online dating site in China. Section 4 describes
the temporal characteristics of users online dating behavior. Users message send
and reply behaviors are studied in Sect. 5. We discuss our main results in Sect. 6.
Finally, we conclude the paper in Sect. 7.
Characterization of User Online Dating Behavior and Preference . . . 197
2 Related Work
Fiore et al. [6] analyzes peoples online dating messaging behavior and find them
consistent with predictions from evolutionary psychology, women state more restric-
tive preferences than men and contact and reply to others more selectively. Lin and
Lundquist [11] studied how race, gender, and education jointly shape interaction
among heterosexual Internet daters. They find that racial homophily dominates mate
searching behavior for both men and women. However, this is not the case of Chinese
online daters where the overwhelming majority of users are of the same race. Finkel
et al. [5] states that online dating has fundamentally altered the dating landscape by
offering an unprecedented level of access to potential partners and allowing users to
communicate before deciding whether to meet them face-to-face. On the other hand,
the authors also argue that there is no strong evidence that matching algorithms pro-
mote better romantic outcomes than conventional offline dating. Part of the problem
is that the main principles underlying these algorithms (typically similarity but also
complementarity) are much less important to relationship well-being than online
sites are willing to assume. He et al. [7] proposes two rules (potentials-attract and
likes-attract) to predict user mate choice and their results imply that likes-attract rule
(based on users actual behavior) works better than potentials-attract (based on users
stated preference), which is consistent with our observation to some extent. Interest-
ing on-the-fly statistics of OKcupid users can be found at the OkTrends blog [14].
Hitsch et al. [9] shows that in online dating there is no evidence for user strategic
behavior shading their true preference. Both male and female users have a strong
preference for similarity along many (but not all) attributes. US users display strong
same-race correlations. There are gender differences in mate preferences; in particu-
lar, women have a stronger preference than men for income over physical attributes.
In their follow-up work [8] they show that stable matches obtained through the
Gale-Shapley algorithm are similar to the actual matches achieved by the dating site,
which are also approximately efficient.
The collaborative filtering algorithm has proved an effective approach in building
recommendation system based on users activity history. Zhao et al. [21] and Cai
et al. [2] take the matching of both the tastes and attractiveness between two users into
account, and show that the method can effectively improve the performance of user
recommendation in online dating. Learning users actual dating preference based on
their attributes has become a popular methodology in recent studies of reciprocal
recommendation system. Pizzata et al. [16] proposes a content-based algorithm to
calculate compatibility scores between two users based on their attributes and activity
history for recommendation in online dating sites. Li and Li [10] considers both local
utility (users mutual preference) and global utility (overall bipartite network), and
proposes a generalized framework for reciprocal recommendation in online dating
sites. Tu et al. [18] proposes a two-side matching framework for online dating recom-
mendations and design an Latent Dirichlet Allocation (LDA) model to learn the user
preferences from the observed user messaging behavior and user profile features.
In [19], Xia et al. extract user-based features from user profiles and graph-based
198 P. Xia et al.
features from user interaction history, and use a machine learning framework to
predict user replying behavior in online dating network.
In a recent study [20], we investigated how users online dating behavior correlates
with various user attributes. In this paper, we further extend our previous work by
studying how users online dating behavior deviates from random selection as well
as their stated preference.
3 Dataset Description
We report on a dataset taken from baihe.com, a major online dating site in China.
It includes the profile information of 200,000 users uniformly sampled from users
registered in November 2011. For each user, we have his/her message sending and
receiving traces (who contacted whom at what time) in the online dating site and the
profile information of the users that he or she has communicated with from the date
that the account was created until the end of January 2012.
A users profile provides a variety of information including users gender, age,
current location (city and province), home town location, height, weight, body type,
blood type, occupation, income range, education level, religion, astrological sign,
marriage and children status, number of photos uploaded, home ownership, car own-
ership, interests, smoking and drinking behavior, self introduction essay, among
others. Each user also provides his/her preferences for potential romantic partners in
terms of age, location, height, education level, income range, marriage and children
status, etc.
Of the 200,000 sampled users, 139,482 are males and 60,518 are females,
constituting 69.7 and 30.3 % of the total number of sampled users respectively.
The dataset includes people from 34 countries and all of the provinces and munic-
ipalities (cities directly under the jurisdiction of the central government including
Beijing, Shanghai, Tianjin, Chongqing), and special administrative region (Hong
Kong, Macau) in China. Figure 1 illustrates the user geographical locations (at city
level) within China and the inter-city communications between users. Intra- and
inter-city messages constitute 46.5 and 53.5 % of the total message volume in our
data, respectively.
To give a sense of the main user demographic attributes, we plot distributions of
user reported age, height, education level, monthly income range and marriage status
in Fig. 2ae, respectively.
The youngest user is 19 years old and the largest fraction of users are in their early
20s. While there is a larger fraction of male users than female users below age 25,
the fraction of female users starts to match that of male users for age range 2535,
and exceeds that of male users after age 35. The median ages of male and female
users are 25 and 26, respectively.
The height distributions of males and females exhibit a bell shape. The median
heights of males and females are 172 and 162 cm, with a standard deviation of 5.4
and 4.7 cm, respectively.
Characterization of User Online Dating Behavior and Preference . . . 199
The fraction of female users is larger than that of male users for low income
ranges (less than 3,000 Chinese Yuan per month). For higher income ranges, the
trend becomes opposite. In general, males have larger incomes than females in our
dataset. The median income ranges of male and female users are 3,0004,000 and
2,0003,000 Chinese Yuan, respectively.
With respect to users education level, females stated education levels tend to be
higher than males. About 66.5 % of females state that they have at least a community
college degree in contrast with only 53.2 % of the males. The fraction of users with
stated doctoral and post-doctoral degrees is 0.61 %.
As shown in Fig. 2e, the majority of users in their early 20s are singles. As the
user age increases, the ratio of single users decreases while the ratio of widowed
users increases. The ratio of divorced users first increases with the user ages until
mid-40s and then starts to decrease. In general, the ratios of widowed and divorced
female users are larger than those of male users.
Unlike online dating behaviors in US where race plays an important role when it
comes to finding potential romantic partners [11, 14], most of the users (98.9 %) in our
dataset are Han (ethnic majority in China), and all other ethnic groups comprise 1.1 %
of the users. Moreover, the majority of the users (97.0 %) claim to be non-religious.
Those claiming a religion (Buddhism, Taoism, Catholic, Islamism, etc.) constitute
only 3.0 % of users. Note that the race and religion compositions in our dataset are
200 P. Xia et al.
(a) (b)
0.45
male
probability density function
0.3
0.25
0.2
0.15
0.1
0.05
0
<20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 >=60
age height (cm)
(c) (d)
cumulative distribution function
1 0.35
0.4 0.1
0.3
0.05
0.2
0
0.1
Ju
Ju
Ba
Po
<200
2000
3000
4000
5000
7000
1000
1500
2000
2500
3000
>500
oc
ig
oc
ni
ni
as
st-
ch
h_
or
at
or
to
te
el
D
io
r
_H
Sc
_C
or
oc
-3000
-4000
-5000
-7000
0-100
0-150
0-200
0-250
0-300
-5000
0
00
na
ho
ol
ig
to
l_
le
h_
ol
r
Sc
ge
Sc
ho
00
00
00
00
00
ho
ol
ol
(e) 1
Single
Divorced
probability density function
Widowed
0.8
0.6
0.4
0.2
0
m
fe
m
fe
m
fe
m
fe
m
fe
m
fe
m
fe
m
fe
ale
m
ale
m
ale
m
ale
m
ale
m
ale
m
ale
m
ale
m
ale
ale
ale
ale
ale
ale
ale
ale
Fig. 2 a Age distribution of the male and female users. b Height distribution of the male ad female
users. c Cumulative distribution function of users reported monthly income. d Education level
distribution of the male and female users. e Marriage status distribution of the male and female
users
significantly different from those of online dating sites in the US where there is more
diversity [9, 11].
For each user in our sample, we have the time stamps of the messages as well as
the profile information of users that this user has communicated with. In this paper
we focus on the initial messages exchanged between users. Subsequent messages
between the same pair of users do not represent a new sender-receiver pair and
cannot be used as the only indicator for continuing relationship as users may choose
Characterization of User Online Dating Behavior and Preference . . . 201
female 35 female
number of messages
0.45
0.4 30
0.35
senders
0.3 25
0.25 20
0.2
0.15 15
0.1 10
0.05
0 5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
week week
Fig. 3 a Fraction of users who sent out at least one message during a week. b Average number of
messages a user sent out each week given that a user sends at least one message
to go off-line from the site and communicate via other channels (e.g., email, phone,
or meet in person).
4 Temporal Behaviors
We are interested in how a users online dating activity level changes over time after
he or she registers an account on the online dating site. Since we only have eight
full weeks worth of online dating data for users who joined in November 30, 2011,
we only consider the activities of each user during the first eight weeks of his/her
membership. The following analysis is based on the activities of the 200,000 users
in the dataset described in Sect. 3.
During the eight-week period, 2,089,029 initial messages were sent by 76,654 males
(55.0 % of the males in the dataset) to 508,118 unique females, which in turn gener-
ated 156,774 replies (a reply rate of 7.5 %). During the same time period, 1,217,672
initial messages were sent by 29,535 females (48.8 % of the females in the dataset)
to 440,714 unique male users, which in turn generated 112,696 replies (a reply rate
of 9.3 %).
The fraction of users from the dataset that sent out at least one message and the
average number of messages sent by each user are shown in Fig. 3a, b, respectively.
We observe that while a considerable fraction of users (51.2 % of males and 43.0 %
of females) sent out at least one message during the first week of their memberships,
the fraction decreases sharply in the second week (down to 11.3 % for males and
12.8 % for females) and further decreases in subsequent weeks. Except for the first
202 P. Xia et al.
(a) 1 (b) 1
male male
female female
complementary cumulative
complementary cumulative
0.1 0.1
distribution function
distribution function
0.01 0.01
0.001 0.001
0.0001 0.0001
1e-05 1e-05
1 10 100 1000 10000 100000 1 10 100 1000 10000
number of messages number of messages
Fig. 4 a CCDF of the number of messages a user sent out during the first eight weeks of his/her
membership. b CCDF of the number of messages a user received during the first eight weeks of
his/her membership
week, females are slightly more likely to send out a message than males on average.
The average number of messages a male sends out each week given that he sends
at least one message lies between 15 and 20 messages per week. While the average
number of messages a female sends given that she sends at least one message is more
than twice that of a male in the first week, it decreases sharply in the second week
and remains relatively stable at a much lower level than that of a male over the next
seven weeks.
For both males and females, we obtain the distribution of the number of messages
sent by each user per week given that a user sends at least one message during the
week, and plot its complementary cumulative density function (CCDF) in Fig. 4a.
We observe that the distributions exhibit heavy tails. Most users only sent out a small
number of messages: 94.6 % of males and 96.5 % of females sent out less than 100
messages during the first eight weeks of their membership. On the other hand, there
are small fractions of users that sent out a large number of messages. According to
the online dating site, most of these highly active users are likely to be fake identities
created by spammers and their accounts have been quickly removed from the site.
During the same time period, 328,645 initial messages were sent by 94,179 females
to 44,509 males, which in turn generated 58,946 replies (a reply rate of 17.9 %).
1,586,059 initial messages were sent by 288,602 males to 45,623 females, which
in turn generated 150,917 replies (a reply rate of 9.5 %). Note that males are more
likely to initiate contact than females while messages from females are more likely
to generate replies than those from males.
The fraction of users from the dataset that received at least one message and
the average number of messages received by each user during the first eight weeks
of his/her membership are shown in Fig. 5a, b, respectively. We observe that the
Characterization of User Online Dating Behavior and Preference . . . 203
number of messages
16
0.5
14
0.4 12
0.3 10
8
0.2
6
0.1 4
0 2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
week week
Fig. 5 a Fraction of users who received at least one message during a week. b Average number of
messages a user received each week given that a user received at least one message
fractions of both males and females that receive at least one message each week
gradually decreases over time, and that females are much more likely to receive
messages than males. Also, for each week, the average number of messages a user
received generally decreases over time for both genders, and the number of messages
received by a female each week is much larger than that for a male.
For those users that received at least one message during the first eight weeks
of their membership, we show the complementary cumulative distribution function
(CCDF) of the number of messages received by each user for both males and females
in Fig. 4b. We observe that the distributions for both male and female users exhibit a
log-normal-like behavior, and that females tend to receive more messages than male
users.
To investigate how long it takes a user to reply after receiving a message, we
define the reply delay of a message to be the time elapsed from when the message
is sent until the corresponding reply is generated when there is a reply. The reply
delay may have certain psychological implications to some people and hence affect
the progress of the communication. Thus it is an important metric to study.
We obtained the reply delay distribution for 209,863 messages replied to by
users within the dataset and plot it in Fig. 6. The reply delay distribution exhibits a
log-normal behavior with a cut-off point around 79,424 min (approximately 56 days
or 8 weeks). Note that the cut-off point is due to the fact that we only have the com-
munication record for each user during the first eight weeks of his/her membership,
so the obtained distribution is limited by this factor.
There is little difference in the reply delay distribution for male and female users.
The median reply delays of males and females are 8.9 and 9.0 h, respectively. Most
messages were replied to within a short time frame. Around 23.0 % of the messages
were replied to within 1 h, and 72.6 % of the messages were replied to within 24 h.
On the other hand, there is a small fraction of the messages with a long reply delay
of tens of days. For example, about 6.3 % of the messages required a week or more
to generate a reply.
204 P. Xia et al.
0.1
0.01
function
0.001
0.0001
1e-05
1 10 100 1000 10000 100000 1e+06
reply delay (minutes)
After a user creates an account on the online dating site, he/she can search for potential
dates based on information within the profiles provided by other users including user
location, age, etc. Once a potential date has been discovered, the user then sends
a message to him/her, which may or may not be replied to by the recipient. The
message sending and replying behaviors of a user are strong indicators of what he/she
is looking for in a potential partner and reflect the users actual dating preferences.
In this section, we first present the correlation between user send and reply
behaviors with various user attributes including age, height, income, education level,
distance, and photo count. We further examine how actual user behavior deviates
from random selection where user attributes (e.g., age, height, income, etc.) of the
recipient of a message are randomly drawn from their respective distributions. When
appropriate, error bars are provided with a 95 % confidence interval.
At the online dating site, a user can provide his/her preferences for potential
dates in terms of age, location, height, education level, income range, marriage and
children status, etc. In the design of a recommendation algorithm for potential dates, it
is important to know whether and to what extent users follow their stated preferences
in actual dating. The discrepancy between a users stated preference and his or her
actual dating behavior is often referred to as dissonance in social psychology, and has
been previously observed [4]. In this section, we examine the degree of dissonance
of online dating in our dataset. In particular, we study to what extent users adhere to
their stated preferences and how reply probability varies as a function of the number
of user attributes that match receivers stated preference.
Characterization of User Online Dating Behavior and Preference . . . 205
male send to female (random selection) female reply to male (random selection)
0.1 female send to male (random selection) male reply to female (random selection)
0.25
reply probability
0.08
0.2
0.06
0.15
0.04
0.02 0.1
0
-30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 -25 -20 -15 -10 -5 0 5 10 15 20 25
age difference age difference
Fig. 7 a Distribution of age difference between senders and receivers. b Reply probability for users
with different age difference
5.1 Age
Figure 7a shows the distribution of the age difference between the sender and receiver
of all messages sent by the sample users in the dataset. The age difference is computed
as the senders age less the receivers age. While the age difference between senders
and receives covers a wide range, the preferences of males and females are opposite
of each other. Males tend to look for younger females and the distribution is skewed
towards much younger females. On the other hand, females tend to look for older
males and the distribution is skewed toward older males. The median age difference
is two for messages sent from males to females and 4 vice versa. Male and female
preferences are not random; they look for potential dates with a smaller age difference
than predicted by random selection.
Figure 7b plots the reply probability as a function of the age difference between the
sender and receiver of a message. For both males and females, the reply probability
deviates significantly from the result of random selection, exhibiting a bell shape
mode at a age difference of ten years older and eight years younger, respectively.
Males tend to reply to younger females while females tend to reply to older males
within a certain range of age difference.
Figure 8 depicts the heatmap of the fraction of messages and reply probabilities
between users of different age. As a male gets older, he searches for and replies to
relatively younger females. A female in her 20s is more likely to communicate with
older males, but as a female gets older, she becomes more open towards younger
males. This is the cause for the reply probability increase in the age difference range
from 3 to 10, as shown in Fig. 7b. These results are consistent with observations
made in [6].
206 P. Xia et al.
(a) (b)
(c) (d)
Fig. 8 Heat map of fraction of messages and reply probabilities between users of different age:
a fraction of messages sent from males to females, b fraction of messages sent from females to
males, c reply probability of males to females, d reply probability from females to males
5.2 Height
Figure 9a shows the distribution of height difference between the sender and receiver
of all messages sent by sample users. The height difference is computed as the
senders height less the receivers height. We observe that users message sending
behaviors with respect to height closely match those resulting from random selection.
While it appears that a male tends to look for females shorter than him and a female
tends to look for males taller than her, this is likely to be a result of random selection
rather than users preference.
Figure 9b plots the message reply probability as a function of height difference
between the senders and receivers. Similarly, user message reply behavior with
respect to height closely match that of random selection, and are thus likely to be the
result of random selection rather than user preference.
Characterization of User Online Dating Behavior and Preference . . . 207
(a) (b)
0.45 0.3
fraction of messages sent
female send to male female reply to male
0.4 male send to female male reply to female
female send to male (random selection) female reply to male (random selection)
0.35 male send to female (random selection) 0.25 male reply to female (random selection)
reply probability
0.3
0.25 0.2
0.2
0.15 0.15
0.1
0.1
0.05
0 0.05
(-30,-
(-25,-
(-20,-
(-15,-
(-10,-
(-5,0]
(0,5]
(5,10
(10,1
(15,2
(20,2
(25,3
]
5]
0]
5]
0]
25]
20]
15]
10]
5]
0
-30 -20 -10 0 10 20 30
height difference (cm) height difference (cm)
Fig. 9 a Distribution of height difference between senders and receivers. b Reply probability for
users with different height difference
(a) (b)
fraction of messages sent
0.3
0.7 male send to female female reply to male
female send to male male reply to female
0.6
reply probability
male send to female (random selection) 0.25 female reply to male (random select)
female send to male (random selection) male reply to female (random select)
0.5
0.4 0.2
0.3 0.15
0.2
0.1 0.1
0
0.05
(-40000,-35000
(-35000,-30000
(-30000,-25000
(-25000,-20000
(-20000,-15000
(-15000,-10000
(-10000,-5000]
(-5000,0]
(0,5000]
(5000,10000]
(10000,15000]
(15000,20000]
(20000,25000]
(25000,30000]
(30000,35000]
(35000,40000]
<2000
2000-3000
3000-4000
4000-5000
5000-7000
7000-10000
10000-15000
15000-20000
20000-25000
25000-30000
30000-50000
>50000
]
]
]
]
]
]
Fig. 10 a Distribution of income difference between senders and receivers. b Reply probability
for senders with different incomes
5.3 Income
Figure 10a shows the distribution of income difference between senders and receivers.
A user reports monthly income within a range such as below 2,000, 2,0003,000 (all
in Chinese Yuan), etc. We take the median value of the reported income range as a
users income and the income difference between the sender and receiver of a message
is computed as the difference sender income and receiver income. We observe that
user message sending behavior with respect to income closely matches that resulting
from random selection. While it appears that males tend to send messages to females
with lower income and females tend to send messages to males with higher income,
this is likely to be a result of random selection and the fact that male incomes are
larger than female incomes rather than users preference.
Figure 10b shows how reply probability varies with sender income. The reply
probability of female recipients increases with male sender income, deviating
208 P. Xia et al.
(a) (b)
0.6 0.4
fraction of messages sent
reply probability
female send to male female reply to male
0.5 male send to female 0.35 male reply to female
female send to male (random selection) 0.3 female reply to male (random selection)
0.4 male send to female (random selection)
0.25
male reply to female (random selection)
0.3 0.2
0.2 0.15
0.1
0.1 0.05
0 0
junior_high_sc
vocational_sch
high_school
junior_college
bachelor
master
doctor
post_doctor
junior_high_sc
vocational_sch
high_school
junior_college
bachelor
master
doctor
post_doctor
hool
ool
hool
ool
receiver education level
sender education level
Fig. 11 a Fraction of messages sent to users of different education levels. b Reply probability for
messages from users of different education levels
significantly from the flat line of random selection. There is a strong correlation
coefficient of 0.90 between the reply probability and male sender income. On the
other hand, the income of a female does not have as significant an effect on the like-
lihood of her messages being replied to. The reply probability fluctuates around the
line of random selection. The correlation between the reply probability and female
sender income is much weaker with a correlation coefficient of 0.50.
Figure 11a shows the fractions of messages sent to users of different education levels.
We observe that male behavior closely matches that of random selection, while
female behavior deviates considerably from that of random selection towards higher
education levels.
Figure 11b shows how reply probabilities vary with sender education levels for
males and females. The higher the education level of a male sender, the more likely
his messages will be replied to. The reply probability of a female user deviates
significantly from a random selection. On the other hand, the education level of a
female does not have as significant an effect on the likelihood of her messages being
replied to. The reply probability of male users stays relatively flat across different
education levels, similar to that resulting from random selection.
(a)
fraction of messages sent (b)
0.45 0.25
female send to male female reply to male
0.4 male send to female male reply to female
0.35
0.2
reply probability
0.3
0.25
0.2 0.15
0.15
0.1
0.1
0.05
0
0.05
0-200
200-400
400-600
600-800
800-1000
1000-1200
1200-1400
1400-1600
1600-1800
1800-2000
2000-2200
2200-2400
2400-2600
2600-2800
2800-3000
0
0 500 1000 1500 2000
distance (km)
distance (km)
between users in different cities, we further study how message sending behavior and
reply probability varies with the distance between users (computed as the straight
line distance between the two cities).
As shown in Fig. 12a, in general the fraction of messages decreases as the distance
between users increases. The messages between users of at least 1,000 km apart
constitutes only a small fraction (11.7 %) of the total number of messages. Note
that there is a small increase in the fraction of messages between distance 800 and
1,400 km for female senders.
Figure 12b depicts how reply probability varies with distance between a sender
and receiver. When a male receives a message from a female, the reply probability
generally decreases with distance between them. For females, the reply probability
first decreases with distance but increases in the range from 800 to 1,400 km.
The increase of the initial message ratio and reply probability of females for the
distance range from 800 to 1,400 km is due to the following. There is an increasing
number of big cities (Shanghai, Beijing, Hong Kong, Chongqing, Guangzhou, Xian,
etc.) between many of which the distance falls into this range, and unlike males,
females are more likely to send and reply to messages between these cities.
On the dating site, a user can post photos on his/her profile page. Figure 13a plots
the distribution of the number of photos posted by a user. A large fraction of users
did not post or posted only a small number of photos. In our dataset, about 69 % of
male users and 59 % of female users did not post any photos. Female users tend to
post more photos than male users.
As shown in Fig. 13b, a user tends to receive more messages if he/she has posted
more photos online, with the trend being more pronounced for females than for
210 P. Xia et al.
(a) (b)
0.5 100
0.4 80
0.3 60
0.2 40
0.1 20
0 0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
number of photos number of photos
(c) 0.35 female reply to male
male reply to female
0.3
reply probability
0.25
0.2
0.15
0.1
0.05
0 2 4 6 8 10
number of photos
Fig. 13 a Distribution of users photo count. b Average number of received messages during first
eight weeks of their memberships for users with different photo count. c Reply probability for users
with different photo counts
males. The number of received messages by a male user starts to level off after some
point.
Figure 13c shows how message reply probability varies with the number of photos
posted by the sender. We observe that male reply probability tends to increase with
the number of photos posted by the female sender. Interestingly, when a female
receives a message, the reply probability remains relatively stable as the number of
photos of the male sender increases.
On the online dating site in our study, a user can specify a set of attributes that he/she is
looking for in a date, including age range, geographic location, height range, marriage
status (never married, divorced), education level, income range, house ownership,
and children status (no children, children living with user, children not living with
user).
Characterization of User Online Dating Behavior and Preference . . . 211
0.9
male receive from female without replying
male reply to female
female receive from male without replying
0.85 female reply to male
0.8
0.7
0.65
0.6
0.55
0.5
1 2 3 4 5 6 7 8
week
Fig. 14 Fraction of reply messages that violate senders stated preference as a function of time
unmatch ratio
0.25
0.25
0.2
0.2
0.15
0.15
0.1 0.1
0.05 0.05
0 0
Lo
Ch
Ed n
In
Lo
In
Ed Ow
Ch on_
M n
ge
ei
om
ge
ei
om
co
co
ar
ar
ca
uc
ca
uc
ild
ild
gh
gh
ria
ria
m
m
tio
e_
tio
e_
at
at
re
re
t
t
e
e
ge
ge
io
i
O
n
n
n_
w
ne
ne
Le
Le
r
r
ve
ve
attributes l attributes
l
Fig. 15 Unmatch ratio of different user attributes in reply messages sent by a male users and
b female users
(a) (b)
Fig. 16 Unmatch ratio of different user attributes in each week sent by a male users and b female
users
the user attributes remain relatively stable, including the income, house ownership,
marriage status and children status. The unmatch ratio for age decreases from 23.0
to 16.9 % during the eight week period, while the unmatch ratios for height and
education increase by a small amount during the same time period. For females,
the unmatch ratio for age remains relatively stable, while for other attributes, they
are more likely to follow their stated preference in the first week but become more
flexible afterwards.
At the online dating site in our study (baihe.com), a user can specify his or her age
and height preference by setting the maximum and minimum values. In Fig. 17, we
plot the difference between the age of receivers who do not meet the age requirement
of the sender and the age range specified by the sender for different sender age
groups. For both male and female users, when they send messages to people who do
not satisfy their stated age requirement, younger users (2029 years old) are more
likely to send messages to people older than their stated age preference, while users
of older age group (especially males) become more likely to send messages to people
younger than their stated preference.
Characterization of User Online Dating Behavior and Preference . . . 213
(a) (b)
(c)
Fig. 17 The age difference between user actual behavior and their stated preference for different
age group senders: a 2029 b 3039 c 4049
Similarly, as shown in Fig. 18, users of lower height (150159 cm) are more likely
to send messages to people taller than their stated preference, while taller users are
more likely to send messages to people lower than their stated preference.
Figure 19 shows how male and female reply probabilities vary as a function of the
number of sender attributes that match the receivers stated preference. The margin
of error is provided with a 95 % confidence level. We observe that except for the
case where is no matching attribute, the reply probability increases with the number
of matched user attributes, indicating that both males and females tend to reply to
senders whose attributes best match their stated preferences. Note that although the
reply probability for zero matching user attribute is larger than that for one matching
user attribute, the sample size of users with zero matching attribute is rather small and
thus the corresponding margin of error is too large to make the calculation statistically
sound.
Figure 20a, b compare the reply probabilities of two different scenarios where a
senders attribute matches or does not match the receivers stated preference, respec-
tively. As expected, we observe that for both males and females, the reply probability
is larger when the senders attribute matches the receivers stated preference.
214 P. Xia et al.
(a) (b)
(c)
Fig. 18 The height difference between user actual behavior and their stated preference for different
height group senders: a 150159 cm b 160169 cm c 170179 cm
6 Discussion
Part of our results on user messaging behavior align with notions in social and
evolutionary psychology [1, 3, 12]. Males tend to look for younger females but do
not seem to care much about the socioeconomic status such as income and education
level of a potential date. On the other hand, females tend to look for older males
and place more emphasis on the socioeconomic status of a potential date. Moreover,
we observe that as a male gets older, he searches for relatively younger females. A
female in her 20s is more likely to look for older males, but as a female gets older,
she becomes more open towards younger males.
Online dating sites significantly increase the level of access to potential dates in
terms of geographic locations from traditional means. In our dataset, a considerable
fraction (53.5 %) of the initial messages traversed across city boundaries while the
remaining 46.5 % occurred between users in the same city. Users still prefer dates in
close proximity. For inter-city messages, the sending volume and reply rate quickly
decrease as users live farther apart. Compared to male users, females are more likely
to send and reply to messages between distant big cities (e.g., Beijing, Shanghai,
Hong Kong, Guangzhou, Xian, etc.).
Characterization of User Online Dating Behavior and Preference . . . 215
Fig. 19 Reply probability as a function of the number of user attributes that matches receivers
stated preference
(a) (b)
Fig. 20 Reply probabilities for scenarios where senders attribute matches or does not match the a
male and b female receivers stated preference
On the online dating site, a user can post his/her own photos and view other users
photos. But profile photos affect male and females messaging behaviors differently.
Females with a larger number of photos are more likely to invite messages and secure
replies from males, but the photo count of males does not have as significant effect
in attracting contacts and replies.
In the analysis of users dating preferences, our results show that it is important
to differentiate between user dating preferences and the results of random selection.
Some user behaviors in choosing attributes in a potential date may largely explained
by random selection. For example, while it appears that a male tends to look for
females shorter than him and a female tends to look for males taller than her, the
message sending and replying behaviors of both genders closely approximate those
resulting from random selections, showing that these may be partly due to random
216 P. Xia et al.
selection rather than users true preferences. Similar observations have been made
for the behaviors of male users in terms of choosing the income and education level
of a potential date, while the corresponding female behaviors deviate significantly
from the random selection and thus reflect their true preferences.
Our results also show that there is significant level of discrepancy between a users
stated dating preference and his/her actual online dating behavior. A fairly large
fraction of messages are sent to or replied to users whose attributes do not match
the sender or receivers stated preferences. Females tend to be more flexible than
males in following their stated preferences when sending and replying to messages.
Both male and female users share the same top-three most violated user attributes:
age, location and height. For male users, the unmatch ratios of other attributes are all
very low (below 5 %), while female users are most strict with marriage and children
status, as well as the education level of the male senders. For both males and females,
out of the population of users that send messages, replies are more likely to go to
users whose attributes come closest to the preferences of the receivers.
7 Conclusion
We study how people online dating behaviors correlate with various user attributes
using a large real-world dataset from a major online dating site in China. Many of
our results align with notions in social and evolutionary psychology. In particular,
males tend to look for younger females while females put more emphasis on the
socioeconomic status (e.g., income, education level) of a potential date. Moreover,
geographic distance between two users and the photo count of users play an important
and different role in dating behaviors of males and females. Our results show that it is
important to differentiate between users true preferences and the results of random
selection. Some user behaviors in choosing attributes in a potential date may be a
result of random selection. Our results also show that there is significant discrepancy
between a users stated dating preference and his/her actual online dating behavior.
Our study provides a firsthand account of the user online dating behaviors in China,
a country with a large population and unique culture. These results on users dating
preference can provide valuable guidelines to the design of recommendation engine
for potential dates.
Acknowledgments This work was supported by the NSF grant CNS-1065133 and ARL Coop-
erative Agreement W911NF-09-2-0053. The views and conclusions contained in this document
are those of the authors and should not be interpreted as representing the official policies, either
expressed or implied of the NSF, ARL, or the US Government.
Characterization of User Online Dating Behavior and Preference . . . 217
References
1. Buss DM (1989) Sex difference in human mate preferences: evolutionary hypotheses tested in
37 cultures. Behav Brain Sci 12:149
2. Cai X, Bain M, Krzywicki A, Wobckes W, Kim YS, Compton P, Mahidadia A (2011) Col-
laborative filetering for people to people recommendation in social networks. Adv Artif Intell
476485
3. Eagly AH, Wood W (1999) The origin of sex differences in human behavior: evolved disposi-
tions versus social roles. Am Psychol 54:408423
4. Eastwick PW, Finkel EJ (2008) Sex difference in mate preferences revisited: do people know
what they initially desire in a romantic partner? J Pers Soc Psychol 94:245264
5. Finkel EJ, Eastwick PW, Karney BR, Reis HT, Sprecher S (2012) Online dating: a critical
analysis from the perspective of psychological science. Psychol Sci Public Interest 3:366
6. Fiore AT, Taylor LS, Zhong X, Mendelsohn GA, Cheshire C (2010) Whos right and who writes:
people, profiles, contacts, and replies in online dating. In: Proceedings of Hawaii international
conference on system sciences
7. He Q, Zhang Z, Zhang J, Wang Z, Tu Y, Ji T, Yi T (2013) Potentials-attract or likes-attract in
human mate choice in china. PLoS ONE 8(4):e59457
8. Hitsch GJ, Hortacsu A, Ariley D (2010) Matching and sorting in online dating. Am Econ Rev
100:130163
9. Hitsch G, Hortasu A, Ariely D (2010) What makes you click? Mate preferences in online
dating. Quant Mark Econ 8:393427
10. Li l, Li T (2010) MEET: a generalized framework for reciprocal recommender systems. In:
Proceedings of ACM international conference on information and knowledge management
11. Lin K-H, Lundquist J Mate selection in cyberspace: the intersection of race, gender, and edu-
cation. Am J Sociol (forthcoming)
12. Luo S, Klohnen EC (2005) Assortative mating and marital quality in newlyweds: a couple-
centered approach. J Pers Soc Psychol 88:304326
13. Match.com, Bailey CM (2010) Recent trends: online dating
14. OkTrends. http://www.okcupid.com
15. Online dating statistics. http://www.statisticbrain.com/online-dating-statistics/
16. Pizzato L, Rej T, Chung T, Koprinska I, Kay J (2010) RECON: a reciprocal recommender for
online dating. In: Proceedings of ACM conference on recommendation system
17. Slater D (2013) Love in the time of algorithm. Penguin Group, New York
18. Tu K, Ribeiro B, Jensen D, Towsley D, Liu B, Jiang H, Wang X (2014) Online dating recom-
mendations: matching markets and learning preferences. In: Proceedings of 5th international
workshop on social recommender systems, in conjunction with 23rd international world wide
web conference
19. Xia P, Jiang H, Wang X, Chen C, Liu B (2014) Predicting user replying behavior on a large
online dating site. In: Proceedings of 8th international AAAI conference on weblogs and social
media
20. Xia P, Ribeiro B, Chen C, Liu B, Towsley D (2013) A study of user behaviors on an online
dating site. In: Proceedings of the IEEE/ACM international conference on advances in social
networks analysis and mining
21. Zhao K, Wang X, Yu M, Gao B (2014) User recommendation in reciprocal and bipartite social
networksa case study of online dating. In: Proceedings of intelligent systems. IEEE
Latent Tunnel Based Information Propagation
in Microblog Networks
1 Introduction
With the rapid growth of social network services and applications such as Facebook,
Twitter and Weibo, research on social networks and social media is becoming a hot
area. Microblogging services offer a real-time platform to update personal status and
share information with friends. Consequently, information propagates over a social
network through homophily [19] and word-of-mouth (WOM) [12]. One example is
social advertising [14], which utilizes users relationships, interests and published
data to target social advertisement to potential users. For example, social advertising
as a kind of recommendation systems of sharing information between friends has
begun to attract attention in recent years [1]. Microblogs, also called microposts,
allow users to exchange small elements of content such as short sentences, individ-
ual images, or video links. Microbloggers post about topics ranging from the simple,
such as what Im doing right now, to the thematic, such as sports cars. Com-
mercial microblogs also exist to promote web sites, services and products, and to
promote collaboration within an organization. The study in [21] shows that a signif-
icant 88 % of all marketers indicated that their social media efforts have generated
more exposures for their businesses.
user 1 user 1
Fig. 1 Social network with simple links (left) and latent topics (right)
relevance of such communication when propagating the target message. The basic
assumption in information propagation is that a target message is more likely to be
forwarded or retweeted if it is interesting to both the sender and the recipient, and
an interested user is more likely to react to a message (e.g., buying the advertised
product). This assumption is consistent with previous studies that social influences
are associated with certain topics [22] and marked tags or labels are useful for social
interest discovery [15].
To illustrate the differences from influence maximization, Fig. 1 shows a social
network with link structure (left) and the network with links representing commu-
nication on certain topics (right), where each color represents a topic and the width
of a link represents the intensity of the topic. Suppose that we want to propagate a
target message on the topic corresponding to the yellow color, user 3 is more likely
to be the best seed user to start the propagation because the message could reach
two other users, namely user 1 and user 4. However, if this problem is treated as the
traditional influence maximization, user 1 will be selected as the seed user because of
its maximum out-degree, despite the fact that user 1 will not forward the message due
to the lack of out-going communication on this topic. In this example, information
propagation depends on not only the link structure of the network, but also the nature
of a link in terms of the topics of the information exchanged.
To our knowledge, propagation of content-rich messages in a microblog network
in a topic-aware manner has not been considered previously. The challenge is that
the topics for messages and links are not explicitly expressed in a real life microblog
network where only exchanged messages are observed. Manually labeling the topics
for all messages and links, even for a training set, is unrealistic because expensive
user involvement is required. The key to information propagation is to extract the
hidden topics from the observable published messages in a microblog network and
leverage them for identification of seed users.
222 C. Zhang et al.
1.2 Contributions
2 Related Work
One of the most robust findings in social networks is homophily [19] (i.e., love of the
same), the tendency of individuals to associate and bond with similar others. Based
on homophily, users tend to share interesting messages from their friends and spread
from one person to another in the style of a biological epidemic. In addition to the link
structure like in all social networks, a microblog network has its own characteristics,
i.e., exchanges of abundant but short messages and interpersonal activities such as
mentions and retweets. Such messages and activities convey certain important infor-
mation about the users involved and play an important role in analyzing information
Latent Tunnel Based Information Propagation in Microblog Networks 223
Probabilistic topic models such as LDA were introduced by [2]. [18] presented the
Author-Recipient-Topic (ART) model to learn the distribution specific to author-
recipient pairs. [23] proposed a supervised learning approach to categorize links and
quantify influence of web pages. Neither work considered information propagation.
The supervised learning approach requires a training data set that is a link-labeled
and link-weighted graph. Our work does not require such training data because it
works directly on the microblog messages published by users.
Topic modeling has been used to predict social influences between users. [22]
developed topical affinity propagation to model the topic-level social influence based
on information of nodes, which was extended to heterogeneous networks in [16].
These methods assumed a given topic distribution for each node and found all topic
level influence networks G z (Vz , E z ) for every topic z, where Vz is a subset of nodes
that are related to topic z and E z is the set of pair-wise weighted influence relations
over Vz . These works did not consider propagation of information, which is the focus
of our work. They assumed that the topic distribution is given for each user, whereas
we assume that the topic distribution is hidden in the messages exchanged between
users (thus, links). These works considered social influences for one topic at a time,
whereas we treat each link as communication involving a topic distribution, instead
of a single topic. The works in [20, 25] proposed a page rank based algorithm to find
influential topics in twitter and citation network. Again, this work did not consider
information propagation.
Influence maximization proposed in [11] aims to identify a set of seed users who
could influence the most number of other users in a social network. Two popular
influence propagation models are Independent Cascade Model and Linear Threshold
Model. These models assume influence probability based on simple heuristics, such
as uniform probability or probability proportional to the degree of a node. Moreover,
this problem does not have a target message nor consider the topics for a link. Most
previous works focused on improving the efficiency of greedy algorithms [3, 7,
13, 17], such as the CELF optimization based on the submodularity of incremental
influences [7, 13].
224 C. Zhang et al.
Table 1 Notation
Symbol Definition
G(V, E) A microblog network graph
T Number of topics
W Number of words in dictionary
M Number of microblogs
N Number of word occurrences in all microblogs
k Number of seed users to be selected
m Index of a message
e Index of a social link, e {1, . . . , |E|}
e T -dimensional topic distribution for a social link e: {e (1), . . . , e (T )}, where
j e ( j) = 1
j W -dimensional word distribution for a topic j: { j (1), . . . , j (W )}, where
w j (w) = 1
w N -dimension vector representing the word occurrences in all microblogs, where
wi [1, . . . , W ] and i [1, . . . , N ]
z N -dimension vector of the topic indicator for all word occurrences in all
microblogs, where z i [1, . . . , T ] indicates the topic of the word occurrence
wi in w
Pe Propagation probability for a social link e
Our work is closely related to the work on inferring the influence probability of
a link. [5] used an action log to infer the influence probability of a link, where an
action refers to a pre-determined activity such as joining a group. In the case that
such actions are not explicitly captured in the social network, acquiring the action log
requires the assistance from external information sources. Our work does not require
such action logs. The works in [4, 6] used time decay to infer influence probability.
Our work can be considered as a new way of estimating propagation probability
by taking into account the topics of the microblog messages readily available in a
microblogging service.
In summary, our work differs from previous works in two major aspects: we model
a social network as a network of content oriented communication, and we propose a
topic-aware estimation for propagation probability for such networks (Table 1).
The first step of our method is to extract the latent topics on social links in a microblog
network. We discuss first the extraction of microblog messages for a link and then
topic modeling for such messages. The outcome is the topic distribution for each link
and the word distribution for each topic. These distributions are used to estimate the
propagation probability of a link for the target message in the next section.
Latent Tunnel Based Information Propagation in Microblog Networks 225
allocation (LDA) [2] to the collection of microblog messages. Two issues must be
resolved. The first is that each microblog message is very short (up to 140 characters),
thus, sparse for topic mining. The second issue is that each social link may represent
zero or more messages; dealing with each message individually leads to multiple
topic distributions for each link, which is not only noisy due to the word sparsity
of each message but also unrelated to each other due to the separate topic spaces.
To address both issues, we model the set of messages associated with a link as one
aggregated message by taking the union of these messages, and apply LDA to these
messages plus all broadcast messages. Notice that there is only one topic modeling
for the entire corpus, not one topic modeling per link.
LDA is a generative model that allows sets of observations to be explained by
unobserved variables that explain why some parts of the data are similar. In our case,
observations are words collected into messages and unobserved variables are the
per-message topic distribution and the per-topic word distribution. Each message is
a mixture of a small number of latent topics and each words creation is attributable
to one of the messages topics. Suppose that we have T topics, we can write the
probability of the ith word in a given message as
T
P(wi ) = P(wi |z i = j)P(z i = j) (1)
j=1
P(w) can be observed from the collection of messages while P(w|z) and P(z) are
hidden and are the target of topic modeling. More formally, let W be the size of
the dictionary (i.e., the number of words) for messages, T be the number of latent
topics, e be the T -dimensional topic distribution for a social link e, and j be the
W -dimensional word distribution for a topic j. The generative process of LDA is as
follows:
1. choose j Dir() where j [1, . . . , T ]
2. choose e Dir() for each message on the link e
3. for each word wi that belongs to the link e
a. choose a topic z i Mul(m )
b. choose a word wi Mul(zi )
where is the parameter of the Dirichlet prior on the per-topic word distribution and
is the parameter of the Dirichlet prior on the per-message topic distributions.
Figure 2 expresses the generative model of topic mining on social links, assuming
that the topic structure, i.e., the topic z, the link-topic distribution , and the
topic-word distribution , is already known. and are hyperparameters, spec-
ifying the nature of the priors on and . However, the problem we face is the
reverse of this generative process: all words in messages w through social links e,
contactor factors C and relation factors R are observed, as indicated by shaded nodes
in Fig. 2, while the topic structure (i.e., z, , ) is hidden and must be estimated using
the observed variables.
Latent Tunnel Based Information Propagation in Microblog Networks 227
3.3 Inference
The Gibbs sampling [8] is widely used to infer the latent variables and . This
method sequentially samples all variables for z from its conditional distribution
P(z i |w, z i , , ) given the current values of all other variables and the data, where
z i refers to the topic assignments of all other words before sampling word wi . This
conditional distribution is derived as follows:
P(z, w|, )
P(z i |w, z i , , ) = (2)
P(z i , w|, )
where
|E|
|E|
(T ) j (n e, j + )
P(z|) = (5)
()T (n e, + T )
e=1
where n j,w is the number of times word w has been assigned to topic j in sampling,
n e, j is the number of times topic j is assigned to e in sampling, n e, is the number
of times all topics are assigned to e in sampling, and n j, is the number of times
a word is assigned to topic j in sampling. Substituting these into the equation for
P(z i |w, z i , , ) and continuously conducting Gibbs sampling, we finally get the
topic distribution e for a social link e and the word distribution j for each topic j
computed as follows:
228 C. Zhang et al.
n e, j +
e ( j) = (6)
n e, + T
n j,w +
j (w) = (7)
n j, + W
The topic distribution for links and the target message used in the next section are
summarized as follows:
Topic distribution for links: the topic distribution of e is already learnt by Eq. (6).
A high value of e ( j) for a topic j indicates the existence of a tunnel for the com-
munication on the topic j through the link e.
Topic distribution for the target message: For the target message m, the topic
distribution of m is computed as follows. For each word wi occurring in m, we
determine the most likely topic for wi , i.e., the topic j such that j (wi ) is maximal,
and consider this as one vote for the topic j. Let vm, j denote the total number of votes
T
for the topic j and let m, j = vm, j / i=1 vm,i , 1 j T . The topic distribution
of m is defined by m = {m,1 , . . . , m,T }.
A first cut solution is ignoring all published messages and the target message and
selecting seed users solely based on the topological structure of the microblog net-
work. This is exactly the traditional influence maximization and a general greedy
algorithm exists. This algorithm takes the graph structure of the microblog network
and the number k as input and returns k seed users as output. Algorithm 1 below,
GeneralGreedy, is an implementation of this algorithm from [3, 11]. It uses two inter-
nal parameters, the propagation probability P for all links e and the Monte Carlo
random process of propagation starting from a set of users S, MC(S, P). MC(S, P)
Latent Tunnel Based Information Propagation in Microblog Networks 229
returns the estimated number of users reached by those in S. At each iteration, the
algorithm greedily selects the next seed user v such that MC(S{v}, P) is maximized.
Algorithm 1: GeneralGreedy(G, k)
1 uniformly set propagation probability Pe for all social links;
|E|
2 P = e=1 {Pe };
3 initialize S = ;
4 forall i = 1 to k do
5 select v = arg maxuV \S (MC(S {u}, P));
6 S = S {v};
7 end
8 return S
Not surprisingly, the GeneralGreedy algorithm does not perform well for
information propagation because it uses the uniform propagation probability Pe for
all links e, which ignores the topic relevance of a link to the target message. Next, we
present two topic-aware algorithms that take into account this topic relevance to infer
the propagation probability Pe . These algorithms differ in the way of quantifying the
topic relevance of a link.
Let m denote the target message, e denote a link, and Pe denote the propagation
probability of m through the link e. Recall that e denotes the topic distribution of
e and m denotes the topic distribution of m. To determine Pe for a link e, we need
to determine what topics of e are relevant to m. One way is cutting off insignificant
topics in the topic distribution e by a threshold, but this is not robust because it is
difficult to know the proper threshold, which could vary from links to links. Our first
topic-aware algorithm deals with this issue by classifying the topics for e as tunneled
topics and blocked topics. The former refers to the topics that have large probabilities
in e to allow information flow on such topics, whereas the latter refers to the topics
with insufficient probabilities to allow information flow. The intrinsic motivation for
this classification is the observation that the distribution e usually consists of a small
number of major topics that have much higher probabilities than other topics. Topics
with high probability form the tunnels for information transfer while topics with low
probability remain blocked.
To identify these two groups of topics, we apply the 2-means clustering method to
the T data points represented by e , where the jth point represents the probability on
the topic j. The result is one cluster for tunneled topics and one cluster for blocked
topics, represented by the indicator vector Ie :
230 C. Zhang et al.
1 if j is a tunneled topic
Ie ( j) = (8)
0 if j is a blocked topic
T
Pe (m) = m, j e ( j)Ie ( j) (9)
j=1
In words, Pe (m) is the inner product of the topic distribution of m and the topic
distribution of e, except that only the topics j with Ie ( j) = 1 have effect.
The seed user selection based on the above propagation probability, called filtered
tunnel algorithm and denoted FilteredTunnel, is given in Algorithm 2. It takes a
microblog network G, a positive number k, and the target message m as the input,
and returns a set of k seed users as the output. The algorithm is an adaptation of
the GeneralGreedy algorithm but uses the propagation probability Pe (m) defined
in Eq. (9).
Algorithm 2: FilteredTunnel(G, k, m)
1 foreach social link e do
2 compute Pe (m) as in Eq. (9);
3 end
|E|
4 P = e=1 {Pe (m)};
5 initialize S = ;
6 forall i = 1 to k do
7 select v = arg maxuV \S (MC(S {u}, P));
8 S = S {v};
9 end
10 return S
The filtered tunnel algorithm adopts the all or nothing strategy for each topic on a
link in order to focus on major topics. Sometimes two users communicate on a broad
range of topics where there is no clear cut between tunneled topics and blocked
topics. In such cases, a target message covering many topics could still be exchanged
by users. This situation calls for the second approach, unfiltered tunnel algorithm,
where topics with small probability are considered too for topic relevance. We can
model this approach conveniently by making all topics the tunneled topics, that is,
Ie ( j) = 1 for every topic j in Eq. (9), so the propagation probability Pe (m) in
Latent Tunnel Based Information Propagation in Microblog Networks 231
Eq. (9) degenerates into the usual inner product of the topic distribution m of the
target message m and the topic distribution e of the link e:
T
Pe (m) = m, j e ( j) (10)
j=1
There are two cases for having a large propagation probability Pe (m): either m
and e have high probability in a few common topics, or m and e have small
probability in many common topics. The former corresponds to communication on
focused topics and the latter corresponds to communication on diversified topics.
With Pe (m) being defined by Eq. (10), the unfiltered tunnel algorithm remains the
same as Algorithm 2.
5 Experimental Evaluation
The ideal way of evaluating the propagation of a target message is placing the message
to the selected seed users in a live microblogging service and tracing the propagation
of the message. Unfortunately, this kind of evaluation requires full control over the
microblogging service, which is possible only for the owner of a microblogging ser-
vice. Without such full control over a microblogging service, we resort to publicly
available Twitter microblog datasets1 to approximate this evaluation. This dataset
has over 9 million microblogs covering domains such as news, music, entertain-
ment, technology, and web. We performed the following preprocessing: removed all
users who have no social links and their broadcast messages because such users do
not contribute to information flow; for the remaining users, took a random sample
of their messages because topic modeling does not need all the data and running
topic modeling on the whole collection of data is too slow; removed stop words and
URLs from all messages. The final dataset contains 323,481 messages (10 % broad-
cast, 66 % conversation, and 24 % retweet), 10,892 Twitter users, and 63,454 links
corresponding to followee/follower relations.
1 http://user.informatik.uni-goettingen.de/~txu/cuckoo/dataset.html.
232 C. Zhang et al.
forwarded the given target message according to the data set. While hit_ratio does
measure the users who propagated the given target message, it does not consider the
possibility of propagating any other messages, even such messages are similar to
the target message. This exact syntax based measure could be too stringent because
often a message is propagated because of its content, not because of its exact syntax.
For example, if a user forwards the message @B, Does anyone know Canucks
standing, likely the user will also forward the message @B, Is Canucks in first or
second place, if this message is presented instead. But the syntax based measure
does not consider this flexibility.
In the second experiment, we relax the exact syntax requirement and consider two
messages to be equivalent (with respect to propagation) if they are similar in topics.
For two messages m 1 and m 2 with the topic distributions m i = {m i ,1 , . . . , m i ,T },
i = 1, 2, the topic equivalence of m 1 and m 2 is defined as
T
sim(m 1 , m 2 ) = m 1 , j m 2 , j (11)
j=1
m 1 and m 2 are topic equivalent if sim(m 1 , m 2 ) > for some specified threshold .
We consider the following three metrics based on topic equivalence. The
publish_ratio is the fraction of the messages published by the seed users that are
topic equivalent to the target message:
a
publish_ratio = (12)
b
where a represents the number of topic relevant messages that were published by
seed users, and b represents the number of all messages that were published by seed
users. A higher publish_ratio value means that seed users are more likely to publish
the given target message.
The spread_ratio is the fraction of forwarding (i.e., retweet messages) originated
from the seed users that are topic equivalent to the target messages,
c
spread_ratio = (13)
d
where c represents the number of forwarding of the topic relevant messages published
by seed users, and d represents the number of forwarding of all messages published
by seed users. A higher spread_ratio means that the target message is more likely to
be propagated if it is published by a seed user.
The reach_num is the number of users reached through such forwarding. A larger
value in these metrics means that a topic equivalent message is more likely to be
published and propagated by the selected seed users. We randomly picked up 100
target messages for this experiment.
We evaluated three algorithms for information maximization.
Latent Tunnel Based Information Propagation in Microblog Networks 233
The first step in FT and UT is to model the topic distribution using conversation
and retweet messages on social links, as described in Sect. 3. Table 2 shows 6 out
of the 50 topics extracted. Each topic has a distribution of keywords (here top five
keywords are shown), which explains the topics latent semantics. For example, topic
19 is about music and online media; topic 32 is about microblog network; topic 39 is
about movie and TV series; topic 43 is about events of time; topic 46 is about games;
topic 50 is about Apples products and other web services. We find these topics learnt
are meaningful and use their distributions to estimate the propagation possibilities.
Table 3 shows the hit_ratio of GG, FT and UT (averaged over all target messages).
Understandably, hit_ratio is rather low for all algorithms because only the users who
published the exact target message are considered in this metric. Despite this, there
is a notable difference among the three algorithms. GG is very sensitive to the setting
of the propagation probability Pe . In fact, GG always selects the same set of seed
2 http://www.cs.ubc.ca/~goyal/code-release.php.
234 C. Zhang et al.
users for any target message because it considers only the network structure, not
the content of messages. For the small propagation probability Pe = 0.01, GG0.01
tends to select central users in a dense community as seed users; such users usually
have a higher degree, thus, are likely publishing the target message. This explains
the higher hit_ratio. As the propagation probability increases, GG tends to select
seed users who bridge different communities because of the increased reachability,
but such users actually are less influential because the number of forwarding is very
low. In contrast, UT and FT are able to select seed users based on the topics of the
target message. Such users are likely to publish the target message. FT has a better
performance (i.e., a higher hit_ratio) than UT because of its focus on major topics.
See more discussions on this point below.
Figure 3 shows publish_ratio, spread_ratio and reach_num (from left to right) of GG,
FT and UT. The upper row is for k = 10 seed users and the bottom row is for k = 50
seed users. The three colors represent the three settings of the threshold for topic
equivalence in Eq. (11).
FT has significantly higher publish_ratio, spread_ratio, and reach_num than GG.
This improvement comes from a better selection of seed users by considering the
relevance of links to the target message. In particular, for a given target message,
FT considers not only the link connection, but also whether similar messages were
previously propagated through such links. As such, FT tends to select those users
who are likely to publish the target message (i.e., a high publish_ratio) and have a
network of users who are likely to forward such messages (i.e., a high spread_ratio).
Consequently, the target message can reach more users (i.e., a high reach_num). In
Latent Tunnel Based Information Propagation in Microblog Networks 235
(a)
(b)
1 1 1200
0.9 0.9
Publish_ratio of FT
Spread_ratio of FT
0.9
Spread_ratio of FT
Reach_num of FT
other words, the seed users selected by FT are influential in that their friends tend to
forward their messages, and so do friends friends.
For a closer examination, Fig. 4 shows the comparison of FT and GG0.01 at
the individual target message level for the case of k = 10 seed users. For each
target message, there is a point (x, y) where y represents the metric for FT and x
represents the metric for GG0.01. A point above the diagonal line y = x means that
FT outperforms GG0.01 by having a higher publish_ratio, a higher spread_ratio,
and a higher reach_num. For nearly all target messages considered, FT outperforms
GG0.01 through a higher value in all three metrics. This suggests that FT selects
more influential seed users than GG0.01. Another study, which is not shown here,
236 C. Zhang et al.
showed that FT outperforms UT in these metrics. One reason is that UT keeps many
minor topics that are insufficient to trigger publishing or forwarding of the target
message. This study suggests that the focus on major topics in FT is an effective
strategy.
We also studied the actual seed users selected. For discussion purpose, we consider
the single topic target message m 1 containing the words for topic 50, and the mixed
topic target message m 2 containing all of the words from topics 46 and 50. In general,
the seed users selected by GG are central in dense parts of the network but may not be
influential in the topics of the target message, in terms of the likelihood of published
messages being forwarded by others, whereas the seed users selected by FT are more
influential. The seed users selected by UT tend to be a mixture of those selected by
GG and those selected by FT because UT not only considers topic relevance but also
adds low propagation probability to each link.
Figure 5 shows the topic distribution of the messages published by seed users.
For m 1 (on the left), which has the topic 50, the messages published by the seed users
selected by FT have the highest probability for topic 50, followed by the messages of
the seed users selected by UT, followed by the messages published by the seed users
selected by GG0.01. For m 2 (on the right), which is on the topic 46 and the topic 50,
the messages published by the seed users selected by FT have higher probabilities
in both of these topics than those selected by UT and GG0.01.
Table 4 further exhibits the intersection size of seed users when propagating target
message m 1 , in which the greedy algorithm with different settings share more seed
users than our topic-aware methods. This demonstrates that FT and UT tend to choose
different seed users according to the content of the target message, compared with
the greedy algorithm solely depending on the social structure.
0.14 0.12
GG0.01 GG0.01
0.12 FT 0.1 FT
UT UT
0.1
Probability
0.08
Probability
0.08
0.06
0.06
0.04
0.04
0.02 0.02
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Topic_# Topic_#
Fig. 5 Topic distribution of published messages of seed users for m 1 (left) and m 2 (right). k = 10
Latent Tunnel Based Information Propagation in Microblog Networks 237
Table 5 Keywords occurrence in the published messages of top users selected by FT and GG
User Keywords occurrence Total
Propagating m 1 as target message
UGG {google = 20, iphone = 1, app = 2, web = 4, apps = 4} 31
UFT {google = 84, iphone = 47, app = 51, web = 89, apps = 16} 287
Propagating m 2 as target message
UGG {google = 20, iphone = 1, app = 2, web = 4, apps = 4} {game 78
= 21, team = 7, play = 15, football = 0, fan = 4}
UFT {google = 84, iphone = 47, app = 51, web = 89, apps = 16} 325
{game = 8, team = 22, play = 5, football = 1, fan = 2}
The following is a case study on the detailed statistics of top user selected by GG0.01,
GG0.02, GG0.05 (user id 14703185, denoted as UGG ) and FT (user id 9453872,
denoted as UFT ) for propagating target message m 1 and m 2 mentioned in the previous
section. Both users have published 3,200 messages and UGG has 173 followers
while UFT has 138 followers. We use keywords occurrence (number of times that
the keywords occurred in the published messages) to measure whether a user is
interested in the topic and influential to place a target message. As shown in Table 5,
when propagating m 1 , the keywords occurrence of UFT is significantly higher than
that of UGG ; when propagating m 2 , although the keywords occurrences with topic 46
of two users are similar, the occurrence with topic 50 of UFT is significantly higher
than that of UGG . Both results demonstrate that UFT is more proper to be chosen.
Next, we randomly pick up five followers of each top user to check if they get
influenced and spread target messages. We verify the situation in propagating m 1 .
Their representative messages related to topic 50 are listed in Table 6. For UGG ,
only one user (id 33256817) has ever forwarded UGG s messages related to topic 50,
while other four have not forwarded such messages before (although out of these four
users, users with id 10877652 and 10355192 have forwarded 41 and 21 messages
from UGG respectively). For UFT , all five users have ever forwarded UFT s messages
related to topic 50. This experiment shows that the target messages of the seed users
238 C. Zhang et al.
selected by FT are more likely to be spread than those selected by general greedy
algorithm.
Then we conduct a case study on the group performance of all k = 10 seed
users selected by different methods. We evaluate it by the keywords occurrence
averaged over 10 seed users and report the results in Table 7. The keywords counted
are the same as in Table 5. From the results, we find two topic-aware methods achieve
significant better performance than the greedy algorithm, further indicating the seed
users selected by FT and UT are more related to the target messages.
To summarize, our study suggests that the topic-aware FT and UT perform better
than the traditional topic-blind GG for information propagation: they tend to select
right seed users, as demonstrated by higher probability of the target message being
published (i.e., higher publish_ratio), higher probability of being forwarded (i.e.,
Latent Tunnel Based Information Propagation in Microblog Networks 239
4
10
3
10
Runtime (min)
2
10
1
10
0
10
GG0.01 GG0.02 GG0.05 GG0.1 FT UT
higher spread_ratio), and more users being reached (i.e., higher reach_num). The
superiority of FT over UT suggests that taking all topics of messages into account
does not necessarily yield better results; in fact, minor topics tend to mislead the
selection of seed users. FT addresses this issue by focusing on major topics.
5.7 Runtime
Although our focus is on selecting more relevant seed users, the topic-aware selection
also helps reduce the running time of the selection process. For FT and UT, topic
modeling took about 2 min in our experiments. This step does not depend on the
choice of the target message and was performed only once for all target messages.
Figure 6 shows the running time (in logarithmic scale) for the selection of seed users.
GG is highly sensitive to the choice of the propagation probability Pe because a
larger probability means that GG will explore a larger part of the microblog network,
e.g., 700 min at Pe = 0.05 and more than 1,200 min at Pe = 0.1. This scale is
consistent with the study in [3, 7] in which GG0.1 took 2,439 min on a network
of 15 K nodes and 32 K unique edges. For the topic-aware UT and FT, the running
time is significantly reduced because propagation probability depends on the match
between the topics of a link and the topics of the target message; consequently, only
the links that are highly relevant to the target message are explored.
240 C. Zhang et al.
6 Conclusion
References
1. Bakshy E, Eckles D, Yan R, Rosenn I (2012) Social influence in social advertising: evidence
from field experiments. In: Proceedings of the 13th ACM conference on electronic commerce
(EC), pp 146161
2. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:9931022
3. Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In: KDD,
pp 199208
4. Gomez-Rodriguez M, Leskovec J, Krause A (2010) Inferring networks of diffusion and
influence. In: KDD, pp 10191028
5. Goyal A, Bonchi F, Lakshmanan LVS (2011) A data-based approach to social influence
maximization. Proc VLDB Endow 5(1):7384
6. Goyal A, Bonchi F, Lakshmanan LVS (2010) Learning influence probabilities in social net-
works. In: WSDM, pp 241250
7. Goyal A, Lu W, Lakshmanan LVS (2011) Celf++: optimizing the greedy algorithm for influence
maximization in social networks. In: WWW (Companion Volume), pp 4748
8. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci USA 101:5228
5235
9. Griffiths T, Steyvers M (2006) Probabilistic topic models. Latent semantic analysis: a road to
meaning. Laurence Erbaum, Hillsdale
10. Kang JH, Lerman K, Plangprasophchok A (2010) Analyzing microblogs with affinity propa-
gation. In: 1st workshop on social media analytics (SOMA), pp 6770
11. Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social
network. In: KDD, pp 137146
12. Leskovec J, Adamic LA, Huberman BA (2006) The dynamics of viral marketing. In: EC06:
proceedings of the 7th ACM conference on electronic commerce, pp 228237
Latent Tunnel Based Information Propagation in Microblog Networks 241
13. Leskovec J, Krause A, Guestrin C, Faloutsos C, Van Briesen JM, Glance NS (2007) Cost-
effective outbreak detection networks. In: kDD, pp 420429
14. Li Y, Shiu Y (2012) A diffusion mechanism for social advertising over microblogs. Decis
Support Syst 54(1):922
15. Li X, Guo L, Zhao YE (2008) Tag-based social interest discovery. In: WWW, pp 675684
16. Liu L, Tang J, Han J, Jiang M, Yang S (2010) Mining topic-level influecne in heterogeneous
networks. In: CIKM, pp 199208
17. Mathioudakis M, Bonchi F, Castillo C, Gionis A, Ukkonen A (2011) Sparsification of influence
networks. In: KDD, pp 529537
18. McCallum A, Corrada-Emmanuel A, Wang X (2007) The author-recipient-topic model for
topic and role discovery in social networks: experiments with enron and academic email.
J Artif Intell Res 30(1):249272
19. McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social net-
works. Annu Rev Sociol 27(1):415444
20. Nallapati R, McFarland D, Manning C (2011) Topicflow model: unsupervised learning of
topic-specific influences of hyperlinked documents. In: AISTATS, pp 543551
21. Stelzner MA (2011) 2011 social media marketing industry report
22. Tang J, Sun J, Wang C, Yang Z (2009) Social influence analysis in large-scale networks. In:
KDD, pp 807816
23. Tang J, Zhang J, Yu JX, Yang Z, Cai K, Ma R, Zhang L, Su Z (2009) Topic distributions over
links on web. In: ICDM, pp 10101015
24. Ugander J, Backstrom L, Marlow C, Kleinberg J (2012) Structural diversity in social contagion.
Proc Natl Acad Sci 109(16):59625966
25. Weng J, Lim E, Jiang J, He Q (2010) Twitterrank: finding topic sensitive influential Twitters.
In: WSDM, pp 261270
26. Yang J, Counts S (2010) Predicting the speed, scale, and range of information diffusion in
Twitter. In: ICWSM
27. Zhang C, Sun J (2012) Large scale mircoblog mining using distributed mb-lda. In: WWW
(Companion Volume), pp 10351042
28. Zhang C, Sun J, Wang K (2013) Information propagation in microblog networks. In: Proceed-
ings of the 2013 IEEE/ACM international conference on advances in social networks analysis
and mining, pp 190196
Scaling Influence Maximization with Network
Abstractions
1 Introduction
2 Related Work
3 Method
Fig. 1 The flowchart for our algorithm, Hierarchical Influence Maximization (HIM)
H3
H2
H1
Fig. 2 At each hierarchical level (Hi ) local neighborhoods are created and virtual nodes (black)
are generated. By using an optimization technique the influential nodes (red) are selected. Nodes
that have been selected at least once as an influential node are transferred to the next level of the
hierarchy. At the higher levels, the connection between selected nodes is defined using the shortest
path distance in the original network. The process is repeated until the final set of influential nodes
is smaller than the total advertising budget
The same process is repeated at the next hierarchy to select more influential nodes.
The procedure terminates at the last hierarchy when the number of influential nodes
finally is smaller than the advertising budget.
pij = u ji
i AR, j AP (1)
Threshold
0 otherwise
where the Threshold parameter is the total number of links that Product agent can
make with Regular agents. The bounds on Threshold are a natural consequence of
the limited budget of companies in advertising their products. The u ji parameter is
an indicator marking whether the Product agent is connected to the Regular agent.
At each interaction there is a chance for agents to influence each other and change
their desire vector for purchasing or consuming a product. During these interactions
the Product agents never change their attitude and maintain a fixed desire vector of 1
toward themselves and 1 toward the other advertising companies. The probability
that agent i is susceptible to agent j is denoted as ij and calculated as:
ejii i, j A R
din
ij = (2)
cte i AR, j AP
We use a generalized version of ICM similar to [13, 21]. The dynamics of the model
at each iteration k proceed as follows:
1. Agent i initiates the interaction according to a uniform probability distribution
over all agents. Then agent i selects another agent among its neighbors with
probability pij . Note that the desire dynamic can occur with probability N1 ( pij +
pji ) as agent is attitude can change whether it initiates the interaction or is selected
by agent j.
2. Conditioned on the interaction of i and j:
With propagability ij , agent i will change its desire:
X i (k + 1) = ij M X i (k) + 1 ij M X j (k)
(3)
X j (k + 1) = X j (k)
Recall that M is the pre-defined matrix indicating the correlation between the
demands of different products.
With probability of (1 ij ), agent i is not influenced by the other agent:
X i (k + 1) = X i (k)
(4)
X j (k + 1) = X j (k)
It is worthwhile to note that the above interaction model can be degraded to the
IC model, if we set ij = 0, M = I, and restrict pij s to be equal to 1 right after
activation of any node and equal to 0 the rest of the time. Also since the values of
the desire vector range from [1 1], the xi p s [0 1] and xi p s [1 0] can be
quantized to 1 and 0 respectively to match the IC model representation of activation
and deactivation.
250 M. Maghami and G. Sukthankar
each node. The radius of the neighborhood, denoted with parameter r , indicates the
granularity of analysis. Based on radius r , we partition the network into subsections,
(E iH ), and update the probability matrices, Pi and Ai for that subsection. HIM selects
the influential agents in that local network, E iH , using an optimization technique and
tags them for future use. The process of node selection is described in detail in
Sect. 3.3.2. Then we add these influential nodes to the set of influential nodes that
have been identified in other neighborhoods in the same hierarchy.
When a local neighborhood is detached from the complete network, there exist
boundary nodes that are connected to nodes outside the neighborhood. These con-
nections that fall outside of the neighborhood can potentially affect the desire vector
of agents within the neighborhood. One possible approach is to ignore these effects
and only consider the nodes inside the partition. In this chapter we account for these
effects by allocating a virtual node to each boundary node. This virtual node is the
representative of all nodes outside the neighborhood that are connected to the bound-
ary node. Figure 3 illustrates the abstraction of outside world effect and shows how
the models parameters are calculated between each boundary and virtual node.
The process of selecting influential nodes is repeated at each hierarchy and at each
local neighborhood surrounding node i. Following previous works [12, 13, 21], we
model the desire dynamic of all agents as a Markov chain where the state of the local
neighborhood is a matrix of all existing agents desire vectors at a particular iteration k
and the state transitions are calculated probabilistically from the pair-wise interaction
Fig. 3 The network on the left is an example of a neighborhood around node e; the network on
the right is the equivalent network with virtual nodes representing the outside world effect. Here
w can be any interaction parameter such as links weight, , or . The direction of the interaction
with the virtual node is based on the type of links the boundary node has with the nodes outside the
neighborhood. The value of the parameter is the average over all similar types of interactions with
outside world
252 M. Maghami and G. Sukthankar
between agents connected in a network. The state of the local network around agent
i at the kth iteration is a vector of random variables, denoted as Xi (k) R N Hi P1
(created through a concatenation of NiH vectors of size P) and expressed as:
[ X 1 (k)]
..
Xi (k) =
.
[ X N H (k)]
i
We calculate the expected long-term desire of the agents in each local network
around agent i and this calculation results in the following formulation:
Fig. 4 Q matrix is a block matrix with size N N where N is the total number of agents (R + P)
and each block has the size of P P. Matrices A and B are the non-zero part of this matrix
which represent the interactions among Regular agents and interactions between Regular agents
and Products, respectively
Scaling Influence Maximization with Network Abstractions 253
Moreover,
R and
P are vectors representing the expected long-term desire
of Regular agents and Product agents, respectively, at iteration k . Note that
vector
P is known since the Product agents, the advertisers, are the immutable
agents, who never change their desire. Solving for
R yields the vector of expected
long-term desire for all regular agents, for a given set of influence probabilities on a
deterministic social network.
R +B
A
P = 0
R = A1 (B
P) (7)
Thus, we can identify the influential nodes in the network and connect the products
to those agents in a way that maximizes the long-term desire of the agents in the social
system. We define the objective function as the maximization of the weighted average
of the expected long-term desire of all the Regular agents in the network toward all
the products as:
max (i
R,i ) (8)
u
1kP iA R
3.3.3 Convergence
Using the Brouwer fixed-point theorem [18], we prove that each local neighborhood
has a fixed-point, hence solving Eq. (5) at steady state is a valid choice. The theorem
states that:
Theorem 1 Every continuous function from a closed ball of a Euclidean space to
itself has a fixed point.
According to the calculation of Eq. (5), E[Xi (k + 1)] is a continuous function as
it is the sum of two continuous ones. Also since X i (k + 1) in Eq. (3) is a bounded
function in [1 1], its expectation (E[Xi (k + 1)]) will be bounded as well. As a
result we have a bounded, continuous function which is guaranteed a fixed point
by the Brouwer fixed-point theorem. This allows us to solve our problem with the
proposed optimization algorithm to find the assignment of u ji s in a way to maximize
the long-term expected desire vector of agents toward all the products in the market.
When we proceed from one hierarchy to the next one, the selected nodes which
are propagated to the upper hierarchy are not necessarily adjacent. Therefore, we
254 M. Maghami and G. Sukthankar
need to define the interaction model between them based on their position in the
real network. The UpdateHierarchy function is responsible for building the proper
network connection and interaction model for the next hierarchy based on the selected
influential nodes in current hierarchy. These nodes were propagated to the higher
hierarchy by being selected as influential nodes in at least one local neighborhood. It
is possible for a node to be present in multiple partitions and be selected more than
once.
Note that the selected nodes are unlikely to be adjacent nodes in the actual network
E. Therefore we need to find a way to form their connections to construct E H . To
do so, we look at the shortest path between these nodes in network E and use that to
calculate the weight of the edges in E H . In the E H network the weight of the link
between two selected nodes is the product of the weights of the shortest path between
these two nodes in the previous hierarchy. Also the probabilities of interaction and
influence between two influential nodes is set to be the product of the probabilities
along the shortest path between them.
The best assignment of Product agents to Regular agents is obtained through solving
the following optimization problem:
Here, we are looking for a set of u ji s which minimizes our cost or, in another
words, maximizes the desire value of agents. Since u ji s indicate the existence or lack
of connection between Product and Regular agents, they are binary variables and can
be identified using mixed integer programming. To solve our optimization problem,
we used the GNU Linear Programming Kit (GLPK) package, which is designed for
solving large-scale linear programming (LP) and mixed integer programming (MIP)
problems. GLPK is a set of routines written in ANSI C and organized in the form of
a callable library which is free to download from http://www.gnu.org/software/glpk.
Scaling Influence Maximization with Network Abstractions 255
4 Evaluation
4.2 Benchmarks
measures commonly used in social network analysis for identifying influential nodes
based on network structure [14].
OIM: The Optimized Influence Maximization method finds the influential nodes
globally using our optimization method on the original network.
Degree: Assuming that high-degree nodes are influential nodes in the network,
we calculated the probability of advertising to a Regular agent based on the out-
degree of the agents and linked the Product agents according to a preferential
attachment model. Therefore, nodes with higher degree had an increased chance
of being selected as an advertising target.
Betweenness: This centrality metric measures the number of times a node appears
on the geodesics connecting all the other nodes in the network. Nodes with the
highest value of betweenness had the greatest chance of being selected as an
influential node.
PageRank: On the assumption that the nodes with the greatest PageRank score
have a higher chance of influencing the other nodes, we based the probability of
node selection on its PageRank value.
Random: In this baseline, we simply select the nodes uniformly at random.
To evaluate these methods, we started the simulation with an initial desire vector
set to 0 for all agents, and simulated 60,000 iterations of agent interactions. The
entire process of interaction and influence is governed by Eqs. (3) and (4) (Sect. 3.2).
At each iteration, we calculated the average of the expected desire value of the agents
toward all products. This average is calculated over 100 runs (10 simulations on 10
different network structures) for the synthetic dataset and 100 runs on the real-world
datasets. Note that the desire vector of Product agents remain fixed for all products;
in our simulation it was set to 1 for the product itself and 0.1 for all other products
(e.g., 1 = [1 0.1 0.1 . . . 0.1]).
For the synthetic dataset, we used the same network generation technique described
in [21] for generating customer networks. To compare the performance of these
methods, the average expected desire value of the agents in a network with 150 agents
has been shown over time in Fig. 5. Here we selected 150 agents as an optimal number
of agents to compare all the algorithms together. With fewer agents, having ten
simultaneously marketed products saturates the network while with a larger number
of agents OIM suffers from scalability issues.
0.014
0.012
0.01
0.008
0.006
Random
0.004 Degree
Betweenness
0.002 HIM
OIM
PageRank
0 4
0 1 2 3 4 5 6x 10
Iterations
Fig. 5 The average of agents expected desire versus number of iterations, calculated across all
products and over 100 runs (10 different runs on 10 different networks). The optimization methods
have the highest average in comparison to the centrality measurement heuristics. As HIM is a
sub-optimal method, it is unsurprising that its performance is worse than the global optimization
method, OIM
most, resulting in the largest number of sales. Although HIM sacrificed some per-
formance in favor of scalability, it clearly outperforms the centrality measurement
methods. The locally-optimal selection approach of HIM results in a slightly lower
performance compared to globally optimal OIM.
Figure 6 shows the final average value of the expected desire of agents in the
last iteration for different number of Regular agents. Although OIM with global
0.014
Random
Degree
Average of Expected Desire
0.012 Betweenness
HIM
OIM
0.01 PageRank
0.008
0.006
0.004
0.002
0
50 100 150 300
Number of agents
Fig. 6 The average of the final expected desire vectors for different numbers of Regular agents and
10 Product agents. The optimization based methods (OIM and HIM) outperform the other methods
in selecting the seed nodes. While OIM is more successful than HIM in selecting the influential
nodes, it is unable to scale-up to networks with 300 agents and higher
258 M. Maghami and G. Sukthankar
4.3.2 Run-Time
Table 3 shows a runtime comparison between the two optimization methods, HIM
(proposed) and OIM (original). In small networks the runtime of the global opti-
mization method is less than the hierarchical but as the size of network grows, its
run time increases exponentially while the run time of the HIM increases at a slower
rate. The long runtime of OIM for the networks larger than 200 nodes makes the
algorithm impractical for finding influential nodes in very large networks.
1
Random 1.00 0.01 0.01 0.02 0.01 0.01
0.8
Degree 0.01 1.00 0.03 0.10 0.05 0.03
IM
IM
k
es
an
re
do
O
eg
nn
R
an
ge
D
ee
R
Pa
tw
Be
Fig. 7 The average Jaccard similarity measurements between different methods, calculated over
100 runs (10 runs on 10 different networks). Lighter squares denote greater similarity between a
pair of algorithms. Note that HIMs selection of nodes is fairly close to OIMs optimal selection
WikiVote The network contains all the Wikipedia voting data from the inception
of Wikipedia until January 2008. Nodes in the network represent Wikipedia users,
and a directed edge from node i to node j indicates that user i voted on user j.
SlashDots is a technology-related news website known for its user community.
The website features user-submitted technology-oriented news. In 2002 Slashdot
introduced the Slashdot Zoo feature which allows users to tag each other as friends
or foes. This network contains friend/foe links between Slashdot users, obtained
in February 2009.
Epinions This is a network extracted from the consumer review site Epinions.com.
Nodes are members of the site who have reviewed products. A directed edge from
i to j indicates j trusts is reviews (and thus i has influence over j).
In all the experiments on real-world social media, we have preprocessed the networks
to eliminate isolated nodes and boundary nodes (nodes with a degree of one).
Table 4a, b summarize the statistics of these real-world networks before and after
the preprocessing stages, respectively. We used the same experimental parameters
(presented in Sect. 4.1). The only differences are the number of products and the
advertising budget which are equal to 10 and 50, respectively.
We benchmarked our optimization methods against two state of the art influence
maximization methods, Prefix-excluding Maximum Influence Arborescence (PMIA)
[25] and DegreeDiscount [9], in addition to the centrality measures.
PMIA: This heuristic algorithm, [25], examines the local neighborhood of each
node to find the influence pattern in each local arborescence in order to estimate the
influence propagation across the network. To our knowledge, the PMIA algorithm
is the best scalable solution to the influence maximization problem under the
Independent Cascade Model.
DegreeDiscount: This heuristic algorithm presented by Chen et al. [9], refined
the degree method by discounting the degree of nodes whenever a neighbor has
already been selected as an influential node.
260 M. Maghami and G. Sukthankar
Although using a hierarchical approach reduces the problem of dealing with huge
interaction matrices, it is still possible for network partitions to be quite large if they
are centered on a high degree node that is connected to a large portion of the network.
In addition to creating huge interaction matrices, these nodes will create star-shape
subgraphs which result in an infeasible solution for the optimization process. There
are a couple of solutions for dealing with these very high degree nodes: (1) ignore
them when we partition the network and assume that their high connectivity guar-
antees that they will appear within the network neighborhood of other nodes or (2)
ignore some of the low-degree neighbors of the node. In the following experiments,
we adopted the first approach in dealing with these large partitions. Therefore, in
all networks we only centered partitions around nodes with a degree less than 100.
Examining the average degree of nodes in all datasets presented in Table 4b shows
that this selection not only prevents huge matrices and star-shaped subgraphs but
still gives us a high percentage of nodes to process. The following results have been
generated for the WikiVote and Epinion datasets.
Figure 8 gives the average expected desire value for all the agents over time for
300 K iterations of the simulated market. In this result, the OIM algorithm has the
highest value while HIM algorithm follows it closely. The performance of the HIM
algorithm approaches the global optimization method (OIM). The performance of
the DegreeDiscount heuristic, PMIA, and PageRank algorithms are very close to
each other with no significant differences.
While our algorithms outperform the other benchmarks on the WikiVote dataset,
on the Epinion dataset the degree-based algorithms perform better. Figure 9 shows the
Scaling Influence Maximization with Network Abstractions 261
3
x 10 Average Desire value
3.5
HIM
PMIA
3 PageRank
Degree
Degree Discount
2.5 OIM
Expectation
2
1.5
0.5
0
1 2 3 4 5 6
iterations /50000
Fig. 8 The average of agents expected desire versus number of iterations for the WikiVote dataset,
calculated across all products over 100 runs. The dataset was preprocessed by eliminating isolated
and boundary nodes, yielding 2 K nodes, and the simulation was run for 300 K iterations. The
optimization methods have the highest average in comparison to the rest of benchmarks. As the
HIM algorithm is a sub-optimal method, its performance is less than the global optimization method
0.008
0.006
0.004
0.002
0
1 2 3 4 5 6
iterations /500000
Fig. 9 The average of agents expected desire versus number of iterations for the Epinion dataset,
calculated across all products, over 100 runs. The dataset was preprocessed by eliminating iso-
lated and boundary nodes, yielding 20 K nodes, and the simulation was run for 300 K iterations.
HIM outperforms PMIA and PageRank, but it beaten by the degree-based algorithms, Degree and
DegreeDiscount. The OIM algorithm could not be run on this dataset, due to the size of the network
262 M. Maghami and G. Sukthankar
Fig. 10 The final expected desire value of the agents at the end of the simulation for the different
methods and datasets. The OIM algorithm could not be run on the Epinion dataset, due to the size
of the network
results for all the benchmarks and the HIM algorithm. Although the HIM performance
is better than PMIA and PageRank, it does not beat the degree-based algorithms,
Degree and DegreeDiscount.
Figure 10 summarizes the final expected desire value of agents for different
algorithms and for different datasets. The low value of desire vector is a consequence
of having a low number of advertisers within huge networks; during influence prop-
agation, the agents desire vectors are repeatedly multiplied by and .
To understand the poor performance of HIM on the Epinion dataset, we examined the
network structure to see how the networks different from one another. Table 5 shows
the quantile analysis of the node degree for the pre-processed datasets. Based on this
analysis we see that the WikiVote network is a very small network compared to other
two datasets, yet the max degree of the lower quartiles is higher the other networks.
This indicates that the WikiVote network has a more uniform degree distribution,
where node degree is not likely to be a highly discriminating feature of influence
propagation potential.
This can be verified by looking at the degree distributions of the datasets (Figs. 11,
12, and 13). In the Epinion and SlashDot datasets we have a small number of nodes
Fig. 11 The degree histogram of the WikiVote dataset. The x-axis shows the logarithmic scale of
degree, and the curve shows the kernel density estimation. In this dataset the majority of nodes lie
in the middle range and have a degree between 50 and 100
Fig. 12 The degree histogram of the Epinion dataset. The x-axis shows the logarithmic scale of
degree, and the curve shows the kernel density estimation. In this dataset the network has a sparse
structure, with the majority of nodes possessing a degree less than 10
264 M. Maghami and G. Sukthankar
Fig. 13 The degree histogram of the SlashDot dataset. The x-axis shows the logarithmic scale of
degree, and the curve shows the kernel density estimation. In this dataset, the same as Epinion
dataset, the network has a sparse structure, with the majority of nodes possessing a degree less
than 10
with very high degrees while most of the nodes in the network possess a degree less
than 10. In these networks, a few nodes serve as hubs and are highly connected,
whereas the other nodes have few connections that, in the worst case, arent even
connected to the high degree node. Hence our heuristic of not centering the partitions
on high degree nodes sabotages the performance of HIMs optimization procedure.
On the other hand the degree-based algorithms can effectively target these high degree
nodes. In contrast, in the networks such as WikiVote or the synthetic networks where
the node degree is more uniform, HIM works well as the nodes in the middle bins
are more numerous and better connected to the entire network. In this case, the
degree-based algorithms perform poorly since degree is not as discriminative.
3
Average Desire value
2 x 10
PMIA
1.8 PageRank
Degree
1.6 Degree Discount
OIM
1.4 No U
Expectation
1.2
1
0.8
0.6
0.4
0.2
0
0 2 4 6 8 10 12 14
iterations /1200000
Fig. 14 The average of agents expected desire versus number of iterations for the Epinion dataset,
calculated across all products and over 10 different runs, for 300 K iterations. The dataset was
preprocessed by selecting the 1 % top degree nodes and building a subgraph based on the shortest
path between these nodes, rendering the graph small enough to be directly processed with OIM.
OIM outperforms the degree-based methods
computed with community detection algorithms for the first level of the hierarchy.
Furthermore, working with dynamic networks where the agents can enter and leave
the network would be useful for practical applications in which the pool of customers
is constantly changing.
An important potential extension of this work would be to generalize the market
simulation to explicitly model the adversarial effects between competing advertisers
as a Stackelberg competition, in which one advertiser places ads and subsequent
competitors have knowledge of existing ad placement. In this chapter we assumed that
the probability of interaction and influence between two agents is small, compared
to the size of the network, which results in the agents sticking to a decision for a
reasonable period of time. However if the network is smaller or the probability of
interaction increases, there can be large fluctuations in the agents desire vector.
Applying a parameter to the model which forces the agents to retain their decisions
for a minimum period, regardless of external interactions, would ameliorate this
issue [20]. A more general framework for modeling and simulating customer product
adoption within social networks would be of great practical importance; our model
represents initial steps towards this ambitious goal.
References
10. Chen W, Yuan Y, Zhang L (2010) Scalable influence maximization in social networks under the
linear threshold model. In: Proceedings of the IEEE international conference on data mining
(ICDM), pp 8897
11. Hartline J, Mirrokni V, Sundararajan M (2008) Optimal marketing strategies over social net-
works. In: Proceeding of the international conference on world wide web. ACM, pp 189198
12. Hung B (2010) Optimization-based selection of influential agents in a rural Afghan social
network. Masters thesis, Massachusetts Institute of Technology
13. Hung B, Kolitz S, Ozdaglar A (2011) Optimization-based influencing of village social networks
in a counterinsurgency. In: Proceedings of the international conference on social computing,
behavioral-cultural modeling and prediction, pp 1017
14. Kempe D, Kleinberg J, Tardos (2003) Maximizing the spread of influence through a social
network. In: Proceedings of the ACM SIGKDD international conference on knowledge dis-
covery and data mining. ACM, pp 137146
15. Kempe D, Kleinberg J, Tardos (2005) Influential nodes in a diffusion model for social
networks. In: Automata, Languages and Programming, pp 11271138
16. Kimura M, Saito K (2006) Tractable models for information diffusion in social networks. In:
Knowledge discovery in databases (PKDD), pp 259271
17. Kimura M, Saito K, Nakano R, Motoda H (2009) Finding influential nodes in a social network
from information diffusion data. Social computing and behavioral modeling. Springer, New
York, pp 18
18. Leborgne D (1982) Calcul diffrentiel et gometrie. Presses universitaires de France
19. Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N (2007) Cost-effective
outbreak detection in networks. In: Proceedings of the ACM SIGKDD international conference
on knowledge discovery and data mining, pp 420429
20. Liow L, Cheng S, Lau H (2012) Niche-seeking in influence maximization with adversary. In:
Proceedings of the annual international conference on electronic commerce. ACM, pp 107112
21. Maghami M, Sukthankar G (2010) Identifying influential agents for advertising in multi-agent
markets. In: Proceedings of the international conference on autonomous agents and multiagent
systems, pp 687694
22. Maghami M, Sukthankar G (2013) Hierarchical influence maximization for advertising in
multi-agent markets. In: Proceedings of the IEEE/ACM international conference on advances
in social networks analysis and mining. Niagara Falls, Canada, pp 2127
23. Pathak N, Banerjee A, Srivastava J (2010) A generalized linear threshold model for multiple
cascades. In: International conference on data mining (ICDM), pp 965970
24. Shakarian P, Paulo D (2012) Large social networks can be targeted for viral marketing with
small seed sets. In: Proceedings of the IEEE/ACM international conference on advances in
social networks analysis and mining (ASONAM), pp 18
25. Wang C, Chen W, Wang Y (2012) Scalable influence maximization for independent cascade
model in large-scale social networks. Data Min Knowl Discov 132
26. Yang W, Dia J, Cheng H, Lin H (2006) Mining social networks for targeted advertising.
In: Proceedings of the annual Hawaii international conference on system sciences. IEEE
Computer Society
Glossary
Link prediction The problem of link prediction can be formally defined as given a
disjoint node pair (x, y), predict if the node pair has a relationship, or in the case
of dynamic interactions, will form one in the near future.
Network abstraction A network abstraction is a representation of the network in
which less important nodes are omitted from being explicitly represented. It can
be used to create a downsampled version of the network that is computationally
cheaper to browse.
Perspective community A set of participating actors and the temporal ties they
share for joint activities performed during a given time period.
Possibility theory A mathematical theory for dealing with certain types of uncer-
tainty.
Random graphs A graph is random if its edges are created according to a probability
distribution or by a random process.
Scale-free network A network whose degree distribution follows a Power law at
least asymptotically.
Social network analysis Use of graph network theory together with other methods
and techniques to analyze social networks.
Temporal dynamic model A temporal dynamic model of social network is a more
realistic representation of the network development process in time, in which
temporal information is expressed.
Index
E
Eidenbenz, Stephan J., 27 J
Elite grouping, 119, 122, 140, 269 Japan tsunami, 1, 2, 4, 7, 13, 15, 16, 18, 19,
Email networks, 32, 37 22, 23
Emergency management, 1, 22, 23 Jiang, Hua, 193
Springer International Publishing Switzerland 2014 271
R. Missaoui and I. Sarr (eds.), Social Network Analysis Community
Detection and Evolution, Lecture Notes in Social Networks,
DOI 10.1007/978-3-319-12188-8
272 Index
K S
Key players, 1 Sarr, Idrissa, 45
Keyword extraction, 77 Seed users, 219223, 228, 230233, 235,
236, 238, 239
Semantic model, 119, 121, 130, 132, 133,
L 139, 140
Latent Dirichlet Allocation, 197, 222, 225 Semantic overlaps, 139, 141
Link prediction, 166169, 172, 176, 177, Seridi-Bouchelaghem, Hassina, 119
179, 181, 186, 189, 190, 270 Sims, Benjamin H., 27
Liu, Benyuan, 193 Sinitsyn, Nikolai, 27
Social features, 167, 168, 171, 172, 178, 179,
190
M Sukthankar, Gita, 165, 243
Maghami, Mahsa, 243 Sun, Jianling, 219
Marketing, 243245, 249, 255
Melancon, Guy, 89
Microblog networks, 222, 223 T
Missaoui, Rokia, 45 Tabatabaei, Seyed Amin, 71
Modularity, 91, 123, 146, 152, 159 Temporal dynamic network, 121, 270
Multi-agent social simulations, 247 Topic mining, 223, 226
Topic modeling, 222225
Towsley, Don, 193
N
Trend analysis, 76
Natural language processing, 1, 2
Tu, Kun, 193
Ndong, Joseph, 45
Twitter, 1, 2, 4, 5, 7, 11, 13, 1618, 22, 23,
Network visualization, 27, 35
48, 7175, 85, 220, 223
O
Online dating, 193199, 201, 204, 210, 215, U
216 User/actor attributes, 11
Optimization, 37, 233, 244, 246, 250, 260, User/actor behavior analysis, 47
265
Organizational hierarchies, 27, 28
Organization subdivisions, 27 V
Overlapping communities, 147, 166, 170, Viaud, Marie-Luce, 89
190
Overlaying networks, 45
W
Wallace, William A., 1
P Wang, Ke, 219
Perspective community, 49, 60 Wang, Xi, 165
Possibility theory, 45, 54, 55 Wang, Xiaodong, 193
Power law model, 27, 28
X
R Xia, Peng, 193
Random graphs, 145, 146, 152, 153, 157,
162
Random walk, 6, 11, 168, 169, 175, 181, 183, Y
184 Yulia, Tyshchuk, 1
Recommendation, 193, 194, 196, 197, 204,
216
Renoust, Benjamin, 89 Z
Ribeiro, Bruno, 193 Zhang, Chenyi, 219