(Lecture Notes in Social Networks) Rokia Missaoui, Idrissa Sarr (Eds.) - Social Network Analysis - Community Detection and Evolution-Springer International Publishing (2014) PDF

Lecture Notes in Social Networks
RokiaMissaoui
IdrissaSarr Editors
Social Network
Analysis
Community
Detection and
Evolution
Series editors
Reda Alhajj, University of Calgary, Calgary, AB, Canada
Uwe Glsser, Simon Fraser University, Burnaby, BC, Canada
Advisory Board
Charu Aggarwal, IBM T.J. Watson Research Center, Hawthorne, NY, USA
Patricia L. Brantingham, Simon Fraser University, Burnaby, BC, Canada
Thilo Gross, University of Bristol, UK
Jiawei Han, University of Illinois at Urbana-Champaign, IL, USA
Huan Liu, Arizona State University, Tempe, AZ, USA
Ral Mansevich, University of Chile, Santiago, Chile
Anthony J. Masys, Centre for Security Science, Ottawa, ON, Canada
Carlo Morselli, University of Montreal, QC, Canada
Rafael Wittek, University of Groningen, The Netherlands
Daniel Zeng, The University of Arizona, Tucson, AZ, USA
More information about this series at http://www.springer.com/series/8768
Rokia Missaoui Idrissa Sarr

Editors
Social Network
Analysis Community
Detection and Evolution
123
Editors
Rokia Missaoui Idrissa Sarr
Dpartement dInformatique et Ingnirie Dpartement de Mathmatiques et
Universit du Qubec en Outaouais Informatique
Gatineau, QC Universit Cheikh Anta Diop
Canada Dakar
Senegal
ISSN 2190-5428 ISSN 2190-5436 (electronic)

ISBN 978-3-319-12187-1 ISBN 978-3-319-12188-8 (eBook)
DOI 10.1007/978-3-319-12188-8
Library of Congress Control Number: 2014956200
Springer Cham Heidelberg New York Dordrecht London

Springer International Publishing Switzerland 2014
Chapter 2 was created within the capacity of an US governmental employment. US copyright protection
does not apply.
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made.
Printed on acid-free paper
Springer International Publishing AG Switzerland is part of Springer Science+Business Media

(www.springer.com)
This book on social network analysis is
dedicated to our respective families who have
been our constant source of inspiration.
They instill in us the drive and the power
to face any challenge with enthusiasm
and good spirit. Without their countless
love and support, this project would
not have been made possible.
Rokia and Idrissa

Foreword
Creatures including humans, animals, insects, etc. avoid living in isolation and tend to
form communities or societies. Though Ferdinand Tnnies distinguished between a
community and a society in 1887, we may roughly say a community is a group of
individuals who agreed or asked to be together in order to achieve a certain task,
socialize, etc. Communities range from static and closed to dynamic and open. Some
communities are persistent, while others are volatile or ad hoc. Examples of com-
munities include families, friends, neighbors, schoolmates, employees working on a
project, etc. Even birds immigrate in communities with specic leadership. Tradi-
tionally, the establishment of communities was location indexed, i.e., required the
existence of individuals in the same location. However, the development in the
communication technology triggered a revolution in the way human communities are
established and dissolved. There is a visible rapid shift from physical to virtual
communities, i.e., from expecting individuals within a community to know and see
each other to accepting the ability of individuals to communicate as sufcient to form
a community. The latter trend allows communities to grow and shrink without a real
control. However, not all individuals within a community are likely equal when it
comes to skills and influence. Thus, analyzing communities to identify and study key
individuals, information propagation, evolution, behavior, structure, etc. is essential
for knowledge discovery leading to informative decision-making. Thanks to the rapid
development in the information technology and computing, which allows researchers
to build scalable solutions capable of handling big data. Such an analysis could have
been otherwise impossible. In fact, when the study of social communities started as a
branch of sociology and anthropology, applications and discoveries remained lim-
ited, mainly because researchers concentrated on the study of small communities,
which remained small due to the restrictions, which have been released when the
ability to communicate became the only requirement and raised the need for the study
and analysis of large communities. In other words, earlier studies concentrated on
physical communities, while recently virtual communities do exist and are evolving
and dominating. Realizing the need to handle evolving communities, researchers
from various elds, including computer science, mathematics, statistics, physics, and
vii
viii Foreword
many other domains joined the efforts to develop new and more powerful techniques
capable of accomplishing various types of studies related to communities. A number
of new contributions and discoveries are described well in this volume titled Social
Network Analysis Community Detection and Evolution, edited by two leading
researchers Prof. Rokia Missaoui and Dr. Idrissa Sarr.
This volume is indeed unique in its coverage and the background of the elite
community of authors who have written in various chapters. Some of the important
topics covered include the study of complex networks from understanding group
cohesion to group detection, to internetwork community evolution, as well as
dealing with Information propagation without relying entirely on the link structure
of social networks. The key novelty of the approach relies on the ability to mine the
published messages within a microblog platform and to extract the hidden topics to
identify the seed users. The volume also discusses the notion of consensual com-
munities and to show that they do not exist within a random graph, yet another
evidence in support of the targeted formation of communities. Online communities
and behavior are also discussed with emphasis on dating sites to understand how
user attributes can help predict who will date whom, and hence provide a recom-
mendation system for online dating website. Further, a group of authors discuss the
modeling and visualization of hierarchical structures in large organizational email
networks. The evolution of groups and communities on Twitter is also tackled by
employing a technique that mixes natural language processing and social network
analysis. Another interesting study covers the influence of social media in the
election process with a case study on the analysis of tweets related to Iranian
presidential elections. Finally, by combining all these topics related to communities
and evolution this volume is an attractive source and reference for researchers,
practitioners, and students who want to learn some interesting latest developments
in the eld.
Calgary, August 2014 Reda Alhajj

Preface
Introduction
Most of the contributions in the present book contain recent studies on community
detection and/or evolution and represent extended versions of a selected collection
of articles presented at the 2013 IEEE/ACM international Conference on Advances
in Social Network Analysis and Mining (ASONAM), which took place in Niagara
Falls in Canada between August 25 and 28, 2013. The topics covered by this book
can be categorized into two groups: community detection and evolution in the rst
seven chapters, and two other related topics, namely link prediction and influence/
information propagation or maximization, in the last four chapters.
Community Detection and Evolution
The discovery of cohesive groups, cliques, and communities inside a network is one
of the most studied topics in social network analysis. It has attracted many
researchers in sociology, biology, computer science, physics, criminology, and so
on. Community detection aims at nding clusters as subgraphs within a given
network. A community is then a cluster where many edges link nodes of the same
group and few edges link nodes of different clusters.
A general approach to community detection consists in considering the network
as a static view in which all the nodes and links in the network are kept unchanged
throughout the study. Recent studies focus also on community evolution since most
social networks tend to evolve over time through the addition and deletion of nodes
and links. As a consequence, groups inside a network may expand or shrink and
their members can move from one group to another one over time.
Most of the studies on community evolution use topological properties to
identify the updated parts of the network and characterize the type of changes such
as network shrinking, growing, splitting, and merging. However, recent work has
ix
x Preface
focused on community evolution/detection by relying entirely on the behavior of

group members in terms of the activities that occur in the network rather than
exclusively considering links and network density.
Another interesting feature of social networks is the cohesiveness of a group and
how it varies over time. In fact, the cohesiveness of a group is a social factor that
assesses how members of a group are close to each other, and may help predict a
possible community splitting or disaggregation. Chapters Entanglement in Multiplex
Networks: Understanding Group Cohesion in Homophily NetworksThe Power of
Consensus: Random Graphs Have No Communities are proposed to portray trends
towards cohesiveness evaluation.
Chapter about The Emergence of Communities and Their Leaders on Twitter
Following an Extreme Event by Yulia Tyshchuk, Hao Li, Heng Ji, and William
A. Wallace, combines natural language processing together with social network
analysis to explore Twitter messages in order to identify actionable ones, construct
an actionable network, identify communities with their central actors, and show the
behavior of the community members. The approach has been evaluated on two
important real-life events, namely the 2011 Japan Tsunami and the 2012 Hurricane
Sandy. The results help understand the behavior of communities as a whole or as
individual members of such cohesive groups. Since the two events have different
characteristics, the behavior of involved people is dissimilar from one event to the
other one. In particular, it was observed that there was a limited participation of
Government on Twitter during the 2011 Japan Tsunami compared to an active
involvement during the 2012 Hurricane Sandy. Moreover, the leadership roles were
stronger in the second than in the rst event, while the cohesion in virtual com-
munities on Twitter seems weaker for the Hurricane Sandy.
Chapter titled Hierarchical and Matrix Structures in a Large Organizational
Email Network: Visualization and Modeling Approaches by Benjamin H. Sims,
Nikolai Sinitsyn, and Stephan Eidenbenzof studies the visualization and modeling
aspects of community detection. Indeed, the email network of a large scientic
research organization is analyzed in order to visualize and model organizational
hierarchies in complex network structures. To that end, formal organizational
divisions and levels are integrated with network data to get an insight into
the interactions between subdivisions of the organization and other external orga-
nizations. In order to manage the complexity of the large email network, the
Girvan-Newman algorithm for community detection is applied. Then, a power law
model to forecast degree distribution of organizational email trafc is dened based
on the hierarchies that hold between managers and employees.
Chapter labeled Overlaying Social Networks of Different Perspectives for Inter-
network Community Evolution by Idrissa Sarr, Joseph Ndong, and Rokia
Missaoui uses probability and possibility theories as two alternate solutions to
discover perspective (temporary) communities and highlight community evolution.
Starting from snapshots of the network at different time periods, the underlying
social network is analyzed in order to rst identify active actors (i.e., actors that
participate in at least a predened number of activities) during a set of time slots,
and then delimit the perspective communities they form over time. Beside the fact
Preface xi
that the approach tracks the evolution of the network and identies the perspective
communities, it gives a basic way to identify both active and passive users. The
latter group of users can be seen as churners in customer relationship management
(CRM) applications. Furthermore, mapping perspective communities to an initial
(or important) network adds new links that improve the network accessibility, and
hence the information flow circulation.
Chapter titled Study of Influential Trends, Communities, and Websites on the
Post-election Events of Iranian Presidential Election in Twitter by Seyed Amin
Tabatabaei and Masoud Asadpour analyzes 1,375,510 tweets of Twitter users who
were interested in Iranian Presidential election and its post-events. The top URLs that
appeared on the tweets indicate that the most influential websites are those related to
social networking and social media websites. Important keywords used in the tweets
during nine days are extracted and the most popular websites among two distinct
groups of users (Persian and English speaking users) are found. These groups rep-
resent the core part of the network and help in interacting with abroad to communicate
the news, events, and messages. Peripheral users are identied as well as a few
subcommunities within the groups. The specication of subcommunities (i.e., the
supporters of political groups) is done based on the keywords extracted from the
tweets using a customized version of TF-IDF. Another result shows a strong link
between the posted tweets and the political events that occurred the same day.
Chapter titled Entanglement in Multiplex Networks: Understanding Group
Cohesion in Homophily Networks by Benjamin Renoust, Guy Melanon, and
Marie-Luce Viaud deals with group cohesiveness in complex networks, mainly, in
bipartite graphs. The authors use the homophily concept to assess similarity
between actors and the group homogeneity they have. The key idea is that attributes
are exploited while investigating how they interact. In other words, authors focus on
measuring the cohesion of a group through the interactions that take place between
attributes of actors. Hence, actor behavior is used to measure the intensity of
interactions and group cohesiveness. Therefore, it can be stated that interactions
between actors are a key element to identify group structure and cohesiveness.
Instead of projecting a bipartite network onto a single-type network with entities of
a same type, which can lead to a loss of information or hide subtle characteristics of
the original data, the authors propose to directly study the multiplex networks. By
doing so, they demonstrate the feasibility of detecting community structure within
complex networks without the need to compute one-mode projections.
Chapter titled An Elite Grouping of Individuals for Expressing a Core Identity
Based on the Temporal Dynamicity or the Semantic Richness by Billel Hamad-
ache, Hassina Seridi-Bouchelaghem, and Nadir Farah is related to group detection
and especially to core identication in social networks. The core of a network can
be seen as a central part having a high influence on the communication flows that
involve the other nodes. Basically, the work can be seen as another contribution to
existing studies in group detection by adding the semantic and temporal dimen-
sions. In fact, temporal dynamic behavior or semantic concepts of social entities are
an additional input to exploit in order to characterize and strengthen signicantly a
group structure and highlight its cohesiveness. The key idea of this work is that
xii Preface
actors of a social network are likely to change their interactions over time by adding
or removing relations with others. This has an impact on their social position in the
network and/or their possible afliation to one or more social groups. The temporal
change is in fact induced by many factors influencing actor behavior. Therefore,
using a semantic dimension such as the connection causality, the positive opinion of
socializing, and relationship kinds may help gauge the shape of groups and their
cohesiveness.
Chapter by Romain Campigotto and Jean-Loup Guillaume on The Power of
Consensus: Random Graphs Have No Communities denes the notion of con-
sensual communities and shows that they do not exist within a random graph. The
principle exploited by the authors is that the outcome of multiple runs of a non-
deterministic community detection algorithm is certainly more signicant than the
outcome of a single run. Authors dene a consensual community as a set of nodes,
which are frequently classied in the same community through multiple compu-
tations. In other words, a consensual community is a repeatable outcome (set of
communities) obtained from a set of community detection algorithm computations.
The main reason for using consensual communities rather than classical commu-
nities comes from the fact that most techniques used to compute communities can
usually provide more than one solution. This may depend on the initial congu-
rations or the order in which nodes are considered. Moreover, consensual
communities can provide a deeper insight into the structure of the network since
they summarize many partitions and encode more information on the structure such
as guring out the overlapping communities. However, when considering random
graphs, authors show that it is quite impossible to nd consensual communities.
The reason is that all pairs of nodes have the same probability to be connected in
random graphs. Furthermore, authors demonstrate through various community
detection algorithms the existence of a threshold beyond which a trivial consensual
community containing all the nodes is found and below which each node forms a
consensual community.
The remainder of the book covers a few use cases of community structures that
address other issues in social network analysis, namely link prediction and influ-
ence/information propagation and maximization.
Link Prediction
This important topic in social network analysis aims at predicting if two given
nodes have a relationship or will form one in the near future. It is exploited in many
social media applications such as the ones that need an embedded recommender
system to suggest new and relevant ties to the users. Like in community detection,
similarity and proximity principles are widely used for link prediction. Moreover,
information about network communities can improve the accuracy of similarity-
based link prediction methods.
Preface xiii
Chapter Link Prediction in Heterogeneous Collaboration Networks written by

Xi Wang and Gita Sukthankar concerns link prediction in heterogeneous collabo-
ration networks. It studies both supervised and unsupervised link prediction in
networks where nodes may belong to more than one community, procreating dif-
ferent types of collaborations. Links in heterogeneous networks happen for different
reasons, and hence cannot be considered in a homogeneous manner. To take into
account such a fact, a new supervised link prediction framework, called Link
Prediction using Social Features (LPSF), is proposed and integrates a re-weighting
scheme of the network by exploiting features of nodes extracted from patterns of
salient interactions in the network. It is shown that the proposed re-weighting
method in LPSF better reflects the intrinsic ties between nodes and provides a better
prediction accuracy for supervised link prediction methods.
Chapter titled Characterization of User Online Dating Behavior and Preference
on a Large Online Dating Site by Peng Xia, Kun Tu, Bruno Ribeiro, Hua Jiang,
Xiaodong Wang, Cindy Chen, Benyuan Liu, and Don Towsley studies user
behavior of an online dating website in order to understand how user attributes can
help predict who will date whom. By doing so, the authors try to provide out-
standing guidelines to design a recommendation system for online dating website.
This means that the present work can be seen as a link prediction issue since the
recommendation is done once two users are likely to date based on their proles.
An interesting aspect that this paper points out is that the connections between
individuals in the underlying network are not deeply related to simple and tradi-
tional mechanisms such as preferential attachment or homophily. Actually, user
attributes based on preferential attachment cannot be simply used because user
behavior in choosing attributes at a given date may largely be done randomly.
Moreover, authors observe that the geographic distance between two users and the
photo count of users play an important role in their dating behavior, and therefore it
is important to differentiate between the effective preferences of users and the
random selection of attributes. The main concerns during the approach validation
are: (1) How often does a user send and receive messages and how does these
operation change over time? and (2) What is the correlation or link between the
sender and receiver behavior based on their own proles?
Influence/Information Propagation and Maximization
Influence propagation is usually modeled using propagation models such as Linear

Threshold Model and Independent Cascade Model. These models assume that a
node is influenced based on the opinions of the local network neighborhood. It has
been recently shown that it is more simple and realistic to model the propagation of
negative influence, which is more contagious, than modeling the positive influence.
Moreover, relying on community membership to study influence maximization is a
viable alternate solution that researchers have considered recently as described in
the last two chapters of this volume.
xiv Preface
Chapter titled Latent Tunnel Based Information Propagation in Microblog

Networks by Chenyi Zhang, Jianling Sun, and Ke Wang deals with Information
propagation without relying entirely on the link structure of social networks. The
key novelty of the approach is to mine the published messages within a microblog
platform and extract the hidden topics to identify the seed users. The basic
assumption is that a target message is more likely to be forwarded or re-tweeted if it
is interesting to both the sender and the recipient, and an interested user is more
likely to react to a message. Hence, when a topic catches the attention of two actors
through previous messages, the authors conclude that both actors will probably
react to the messages related to that topic and share a hidden link. They afterward
identify the seeds of users that will maximize the propagation by identifying those
actors, which, when they publish a message, their recipients are likely to forward it,
and so on. To reach their goal, the authors unveil the latent topics associated with
social links by relying on a standard topic modeling technique based on Latent
Dirichlet Allocation. The modeling approach highlights the topic distribution for
each link that explains its nature in information flow. These obtained distributions
are used to estimate the propagation probability of a link for the target message.
Chapter by Mahsa Maghami and Gita Sukthankar about Scaling Influence
Maximization with Network Abstractions tackles the problem of influence maxi-
mization in social networks with an application in the advertising domain. A solu-
tion is developed to nd the influential nodes in a social network as targets of
advertisement based on the network structure, the links among the actors in the
network, and the limited advertising budget. The solution is a hierarchical influence
maximization approach for product marketing that constructs an abstraction hier-
archy to scale and adapt optimization techniques to larger networks. An exact
solution is provided on smaller partitions of the network, and a candidate set of
influential nodes is selected to be propagated upward to an abstract representation
of the original network. The process of abstraction, solution, and propagation is
iteratively executed until the resulting abstract network becomes small enough to
use an exact optimization solution.
To conclude this preface, we would like to thank all authors for their signicant
contributions that give a broad spectrum of research work on social network
analysis, mainly in community detection and evolution, link prediction, and
influence propagation. Our warm thanks go also to the reviewers for their careful
evaluation of the submissions and their useful comments and suggestions.
August 2014 Rokia Missaoui

Idrissa Sarr
Contents
The Emergence of Communities and Their Leaders

on Twitter Following an Extreme Event . . . . . . . . . . . . . . . . . . . . . . . 1
Yulia Tyshchuk, Hao Li, Heng Ji and William A. Wallace
Hierarchical and Matrix Structures in a Large Organizational

Email Network: Visualization and Modeling Approaches. . . . . . . . . . . 27
Benjamin H. Sims, Nikolai Sinitsyn and Stephan J. Eidenbenz
Overlaying Social Networks of Different Perspectives

for Inter-network Community Evolution. . . . . . . . . . . . . . . . . . . . . . . 45
Idrissa Sarr, Joseph Ndong and Rokia Missaoui
Study of Influential Trends, Communities, and Websites

on the Post-election Events of Iranian Presidential
Election in Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Seyed Amin Tabatabaei and Masoud Asadpour
Entanglement in Multiplex Networks: Understanding Group

Cohesion in Homophily Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Benjamin Renoust, Guy Melanon and Marie-Luce Viaud
An Elite Grouping of Individuals for Expressing a Core Identity

Based on the Temporal Dynamicity or the Semantic Richness . . . . . . . 119
Billel Hamadache, Hassina Seridi-Bouchelaghem and Nadir Farah
The Power of Consensus: Random Graphs Still Have

No Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Romain Campigotto and Jean-Loup Guillaume
Link Prediction in Heterogeneous Collaboration Networks . . . . . . . . . 165

Xi Wang and Gita Sukthankar
xv
xvi Contents
Characterization of User Online Dating Behavior and Preference

on a Large Online Dating Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Peng Xia, Kun Tu, Bruno Ribeiro, Hua Jiang, Xiaodong Wang,
Cindy Chen, Benyuan Liu and Don Towsley
Latent Tunnel Based Information Propagation

in Microblog Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Chenyi Zhang, Jianling Sun and Ke Wang
Scaling Influence Maximization with Network Abstractions . . . . . . . . . 243

Mahsa Maghami and Gita Sukthankar
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Contributors
Masoud Asadpour Social Networks Lab, School of Electrical and Computer

Engineering, University of Tehran, Tehran, Iran
Romain Campigotto Sorbonne Universits, Paris, France; CNRS, Paris, France
Cindy Chen Department of Computer Science, University of Massachusetts
Lowell, Lowell, MA, USA
Stephan J. Eidenbenz Los Alamos National Laboratory, Los Alamos, NM, USA
Nadir Farah Laboratory of Electronic Document Management LabGED, Badji
Mokhtar Annaba University, Annaba, Algeria
Jean-Loup Guillaume Sorbonne Universits, Paris, France; CNRS, Paris, France
Billel Hamadache Laboratory of Electronic Document Management LabGED,
Badji Mokhtar Annaba University, Annaba, Algeria
Heng Ji Computer Science Department, Rensselaer Polytechnic Institute, Troy,
NY, USA
Hua Jiang Product Division, Baihe.com, Beijing, China
Hao Li Computer Science Department, Rensselaer Polytechnic Institute, Troy,
NY, USA
Benyuan Liu Department of Computer Science, University of Massachusetts
Lowell, Lowell, MA, USA
Mahsa Maghami Department of EECS, University of Central Florida, Orlando,
FL, USA
Guy Melanon CNRS UMR 5800 LaBRI, INRIA Bordeaux Sud-Ouest, Campus
Universit Bordeaux I, Talence, France
Rokia Missaoui Universit du Qubec en Outaouais, Qubec, Canada
xvii
xviii Contributors
Joseph Ndong Universit Cheikh Anta Diop, Fann Dakar, Senegal

Benjamin Renoust CNRS UMR 5800 LaBRI, INRIA Bordeaux Sud-Ouest,
Campus Universit Bordeaux I, Talence, France; Institut National de LAudiovisuel
(INA), Paris, France
Bruno Ribeiro Department of Computer Science, University of Massachusetts
Amherst, Amherst, MA, USA
Idrissa Sarr Universit Cheikh Anta Diop, Fann Dakar, Senegal
Hassina Seridi-Bouchelaghem Laboratory of Electronic Document Management
LabGED, Badji Mokhtar Annaba University, Annaba, Algeria
Benjamin H. Sims Los Alamos National Laboratory, Los Alamos, NM, USA
Nikolai Sinitsyn Los Alamos National Laboratory, Los Alamos, NM, USA
Gita Sukthankar Department of EECS, University of Central Florida, Orlando,
FL, USA
Jianling Sun College of Computer Science, Zhejiang University, Hangzhou,
China
Seyed Amin Tabatabaei Social Networks Lab, School of Electrical and Computer
Engineering, University of Tehran, Tehran, Iran
Yulia Tyshchuk Department of Industrial and Systems Engineering, Rensselaer
Polytechnic Institute, Troy, NY, USA
Don Towsley Department of Computer Science, University of Massachusetts
Amherst, Amherst, MA, USA
Kun Tu Department of Computer Science, University of Massachusetts Amherst,
Amherst, MA, USA
Marie-Luce Viaud Institut National de LAudiovisuel (INA), Paris, France
William A. Wallace Department of Industrial and Systems Engineering, Rens-
selaer Polytechnic Institute, Troy, NY, USA
Ke Wang School of Computing Science, Simon Fraser University, Burnaby,
Canada
Xiaodong Wang Product Division, Baihe.com, Beijing, China
Xi Wang Department of EECS, University of Central Florida, Orlando, FL, USA
Peng Xia Department of Computer Science, University of Massachusetts Lowell,
Lowell, MA, USA
Chenyi Zhang College of Computer Science, Zhejiang University, Hangzhou,
China; School of Computing Science, Simon Fraser University, Burnaby, Canada
The Emergence of Communities
and Their Leaders on Twitter Following
an Extreme Event
Yulia Tyshchuk, Hao Li, Heng Ji and William A. Wallace
Abstract Twitter is presently utilized as a channel of communication and information

dissemination. At present, government and non-government emergency management
organizations utilize Twitter to disseminate emergency relevant information. How-
ever, these organizations have limited ability to evaluate the Twitter communication
in order to discover communication patterns, key players, and messages that are being
propagated through Twitter regarding the event. More importantly there is a general
lack of knowledge of who are the individuals or organizations that disseminate warn-
ing information, provide confirmations of an event and associated actions, and urge
others to take action. This paper presents results of the analysis of two events
2011 Japan Tsunami and 2012 Hurricane Sandy. These results provide an insight
into understanding human behavior, collectively as part of virtual communities on
Twitter and individually as leaders and members of those communities. Specifically,
their behavior is evaluated in terms of obtaining and propagating warning informa-
tion, seeking and obtaining additional information and confirmations, and taking the
prescribed action. The analysis will employ a methodology that shows how Natural
Language Processing (NLP) and Social Network Analysis (SNA) can be integrated
to provide these results. This methodology allows to extract actionable Twitter mes-
sages, construct actionable network, find actionable communities and their leaders,
and determine the behaviors of the community members and their leaders. Moreover,
the methodology identifies specific roles of the community leaders. Such roles include
dispensing unique/new emergency relevant information, providing confirmations to
Y. Tyshchuk (B) W.A. Wallace

Department of Industrial and Systems Engineering, Rensselaer Polytechnic Institute,
110 8th Street, Cll 5118, Troy, NY 12180, USA
e-mail: tyshcy@rpi.edu
W.A. Wallace
e-mail: wallaw@rpi.edu
H. Li H. Ji
Computer Science Department, Rensselaer Polytechnic Institute, 110 8th Street,
Cll 5118, Troy, NY 12180, USA
e-mail: haoli.qc@gmail.com
H. Ji
e-mail: hengjicuny@gmail.com
Springer International Publishing Switzerland 2014 1

R. Missaoui and I. Sarr (eds.), Social Network Analysis Community
Detection and Evolution, Lecture Notes in Social Networks,
DOI 10.1007/978-3-319-12188-8_1
2 Y. Tyshchuk et al.
the members of the communities, and urging them to take the prescribed action.
The results show that the government agencies had limited participation on Twitter
during 2011 Japan Tsunami compared to an extensive participation during 2012 Hur-
ricane Sandy. The behavior of Twitter users during both events was consistent with
the issuance of actionable information (i.e. warnings). The findings suggest higher
cohesion among the virtual community members during 2011 Japan Tsunami than
during 2012 Hurricane Sandy event. However, during both events members displayed
an agreement on required protective action (i.e. if some members were propagating
messages to take action the other members were taking action). Additionally, higher
differentiation of leadership roles was demonstrated during 2012 Hurricane Sandy
with stronger presence of official sources in leadership roles.
Keywords Social network analysis Community evolution Community detection

Natural language processing Emergency management Twitter
1 Introduction
Twitter is an important channel of information dissemination. It is particularly useful

when current and relevant information is required. The format of Twitter messages
permits people to exchange information about any occurrence. This capability is
very useful during emergencies, events that pose a significant threat to ones well-
being. In our work we focused on one type of emergenciesnatural disasters. Twitter
messages, interview data, and electronic alerts concerning the 2011 Japan Tsunami
and 2012 Hurricane Sandy provided the data for the research reported in this paper.
During emergencies, such as a tsunami and hurricane, when the impact of the people
and infrastructure is significant, people engage in information millingobtaining and
exchanging information and/or confirmation. The process requires rapid access to
the most current information. Twitter has the capability to provide this functionality.
Additionally, Twitter provides people with a way to connect with others affected by
the same emergency, which can provide emotional support [36].
One of the significant challenges in studying Twitter is a sheer volume of data and
lack of ability to efficiently read the data. In this paper Natural Language Processing
(NLP) techniques were used to extract three types of actionable events from 2011
Japan Tsunami and 2012 Hurricane Sandy datasets: receive the warning, seek infor-
mation or confirmation, and take prescribed action. NLP techniques were used to
associate tweets with following attributesmodality and polarity. These attributes
provide further insights into the information being shared on Twitter. Additionally,
first story analysis demonstrated the amount of unique/new emergency relevant infor-
mation that was exchanged among the Twitter users. The analysis was also used to
trace the information initiators.
The paper begins with an evaluation of existing methods. The paper then describes
a novel methodology that was applied, which incorporated NLP with Social Network
Analysis (SNA) techniques. The paper proceeds to describe the data set used in
The Emergence of Communities and Their Leaders . . . 3
applying the methodology. The results are then described in detail in the following
section. The paper concludes with a discussion of contributions and suggestions for
future research.
2 Related Work
2.1 Warning Response Process During Emergencies
During emergencies affected individuals participate in the warning response process,

which includes obtaining and sharing information, the evidence of which can be
discovered on Twitter during an emergency. In general, the warning response process
for an individual has been segmented into six stages [26]: (1) obtaining/hearing the
warning; (2) understanding the contents of the warning; (3) trusting the warning;
(4) personalizing the warning; (5) seeking information/confirmation; and (6) taking
action.
An individual starts the warning response process by receiving notification of the
emergency and ends the process by taking action, where doing nothing is a valid
action. However, how and when each stage is accomplished may vary across indi-
viduals and emergencies [25]. The first stage of the warning response process for
individuals is to obtain the warning from one or many sources. The second stage of the
warning response process requires assigning a specific meaning to the warning mes-
sage, which can vary from individual to individual. This meaning can also be different
from what intended by the issuing source. The third stage is trusting the warning mes-
sage, which is influenced by many factors such as the source of the message, contents,
and the channel. The fourth stage requires personalization of the warning to ones
situation. This requires an individual to assess her or his willingness to assume the
necessary personal risk. The fifth stage of warning response process is to seek addi-
tional information or attempt to obtain confirmations about the information already
obtained [26]. This process is often referred to as warning confirmation process. The
final stage of warning response process is taking action. People engage in the action
they believe is the best for them, which may be at odds with a prescribed action.
Three stages of the warning response processobtaining/hearing the warning, seek-
ing information/confirmation, and taking action can be inferred from communication
between individuals unlike the other three stages, which are cognitive processes.
2.2 Social Media During Emergencies
Social media has been used by the public as well as governmental and non-
governmental organizations during emergencies. Some examples of the use include
rapid information dissemination of ones well-being as it was demonstrated by the
researchers in [15]. In Haiti, U.S. government was able to utilize social media, such
as Wikipedias and workspace sharing media, as a knowledge based system [40]. The
researchers in [35] were able to develop a unique annotation, which facilitated the
emergence of the digital volunteers. Social media provides a natural environment for
facilitating decentralized coordination for onsite field response teams [34]. During
2011 Japan Tsunami, people utilized Twitter for information milling, warning prop-
agation, providing information about recovery efforts, and emotional support [36].
2.3 Social Network Analysis and Twitter
Social network analysis facilitates the determination of the communication patterns

among users. In [36], the researchers showed that social network analysis is a useful
tool in identifying information sources. It was demonstrated that there are various
techniques rooted in social network analysis to study emergent communities on Twit-
ter [36]. The Twitter communication networks were analyzed to find the structural
phenomena related to directed closure and its role in link formation [32]. In [33],
researchers studied the Twitter hashtag adoption based on the structural properties of
the network. The research showed that Twitter communication networks that drive
the daily interactions among people are sparse and are based on existing friends and
followers [14].
2.4 Open-Domain Event Discovery
Traditional event extraction work focused on supervised learning for pre-defined

event types in formal genres such as newswire [18, 22, 23]. However, these methods
are not appropriate for social media, which covers a wide range of diverse topics
and lacks labeled data. Early work of event discovery exploited the word distribution
differences across instances. For example, Yang et al. [39] detected events by cluster-
ing documents based on the semantic distance between documents, while Kleinberg
et al. [19] used word distributions to discover events by grouping words together.
Some recent work attempted to rapidly and automatically adapt an event extraction
system to new event types. For example, Li et al. [24] automatically acquired verb
clusters from parallel corpora and discovered novel events based on named entity
recognition, semantic role labeling, and active learning.
Unlike formal genres, social media stream is characterized by short messages
with heavily colloquial speech. To handle such data stream, Weng and Lee [38]
tackled event discovery task for Twitter by detecting important word tokens and
clustering them to represent novel events. They analyzed word-specific signals in
the time domain. The advantage was that signals for individual words were built by
applying wavelet analysis on the frequency-based raw signals of the words, hence
important words were identified based on corresponding signal auto-correlations.
The researchers in [3] developed a graphical model to extract event records from
Twitter by learning a latent set of records and a record-message alignment simulta-

neously. However, their method requires a seed set of example records as the source
of supervision so it is not appropriate for our use. The researchers in [31] trained
a supervised model to extract event tuples from tweets. However, their approach is
highly restricted to their annotated event types and was not able to capture events in
our domain (e.g., evacuation event).
To conclude, our event extraction approach is most related to the research explored
in [24, 38]. Given some event clusters as seeds, we obtained new relevant keywords
to expand each event keyword cluster and use these clusters to represent events.
In addition, we utilized semantic attributes to declaratively discriminate specific and
affirmed events from others. To the best of our knowledge, this is the first work to
incorporate semantic attributes into novel event discovery in an open domain.
2.5 First Story Detection
The traditional approach for first story detection uses a term vector to represent each
document (e.g., an news article) [1, 2]. Each new document is then compared with
the previous ones, and if its similarity with the closest document is below a threshold,
it is declared to be a new story. However, this approach is not feasible for large data
sets (e.g., tweets) because of its high computational cost. A computationally better
approach for first story detection task utilizes locality-sensitive hashing (LSH) with
a variance reduction strategy [28]. This method can achieve similar performance
while gaining more than an order of magnitude speedup compared with the system
previously described in [2]. Experiments using this method were conducted on large
streaming Twitter data sets and achieved reasonable results. In this paper, the above-
described approach is used for first story detection in tweets. Given a large amount
of tweets sorted in the timeline, we apply LSH to group similar tweets together and
identify all the tweets that discuss a new bit of information. In addition, we also link
later tweets to the previous ones if they are talking about the similar bit of information
in order to generate information clusters.
3 Methodology
3.1 Overview
An overview of the approach taken in this paper is illustrated in Fig. 1. First, data
was collected via streaming Twitter API during the time of an emergency. Then the
data was processed using the Support Vector Machines (SVMs) based on topic/off
topic binary classifier to extract tweets related to the emergency. Note that the on/off
topic classification was conducted on 2010 Japan Tsunami event only. The 2012
Fig. 1 Overview of the methodology picture
Hurricane Sandy data set was collected using hashtags #Sandy and #Hurricane,
therefore, all tweets were on-topic. Next, a selected set of search terms was used to
annotate the tweets with actionable eventspropagate the warning, seek infor-
mation or confirmation, and take prescribed action. To overcome the unstructured
format of the tweets text an appropriate set of NLP techniques was used. The anno-
tation was further enriched through assignment of attributes for each tweetpolarity
and modality. This was accomplished via SVMs based event attribute classification.
Subsequently, the first story analysis was conducted using Locality Sensitive Hashing
algorithm to detect the information clusters as well as the tweets that first introduced
the information on Twitter.
The timelines were either constructed utilizing data collected from on-site
interviews and publicly available information on the Internet or based on the 24 h
time slices. The timelines were used to construct communication networks for each
time slice. A random walk algorithm was employed to discover communities in
Twitter communication networks by time slice. SNA was used to identify the leaders
of these communities. The knowledge obtained from NLP about the tweet content
actions, attributes, first story identification, and story ranking, enabled us to make
inferences about the behaviors of community members and roles of their leaders.
3.2 NLP Approach
3.2.1 Terminology
We defined the following terminology for a series of NLP approaches.

On-topic/Off-topic Tweets: We defined the tweets that were related to the topic
of our interest as on-topic and the rest as off-topic. In our case study, all tweets
related to Japan Tsunami and Hurricane Sandy were on-topic. An on-topic tweet
example is as follows: RT @CBCAlerts: 7.2 magnitude earthquake hits Northern
Japan. Tsunami alert has been issued.#Japan #Quake while an off-topic tweet
example is as follows: I have an early wake up, but 2 hour long Skype sessions w/
distant friends are worth the minimal hours of sleep. #buddies #friendsarefamily.
Actionable Events: Events that belong to the following categories: receive the
warning; seek information or confirmation; and take prescribed action. The cate-
gories were selected from the six stages of warning response process previously
described in Sect. 2.1.
Event Attributes: Event attributes were used to measure user intention to
participate in an actionable event. Two semantic attributes were adapted from
Automatic Content Extraction 2005 Evaluation (ACE2005) [21] to describe each
actionable event: (1) modality, where an event was asserted when the author or
speaker made reference to it as though it were a real occurrence; and (2) polar-
ity, where an event was positive when it was explicitly indicated that the event
occurred.
Actionable Tweets: Tweets that belong to an actionable event (receive the warning,
seek confirmation, and take prescribed action).
First Story Tweets: Tweets that mention for the first time a seminal event and a
seminal event is a particular event that occurs at a specific time and space, e.g., an
tsunami occurred in Sendai, Japan on March 11th, 2011.
3.2.2 On-Topic Tweet Detection
According to the hashtag definition from Twitter, the hashtag symbol, #, together
with a relevant keyword or a phrase in a tweet is used to categorize tweets and allow
them to be displayed more easily in Twitter Search. Also, popular hashtagged words
are often characterized as trending topics.
Inspired by the hashtag definition, we developed a novel annotation scheme based

on the assumption that tweets with the same hashtag are on the same topic. First,
we extracted hashtags with high frequency1 that indicate trending topics. Then we
manually annotated each trending hashtag as either on-topic or off-topic hashtag.
After annotating hashtags, we propagated the on-topic/off-topic label of each hashtag
to all tweets with each hashtag. We trained an on-topic/off-topic tweet classifier, based
on Support Vector Machines (SVMs) [8], using the following features: (1) unigrams
(all unique unigrams of a tweet); (2) userID (the ID of the user who posted the tweet);
(3) replyID (the ID of the user to whom the tweet is replying); and (4) mentionID
(the ID of users mentioned in the tweet d). All hashtags were removed from tweets
during training and testing process, so the trained classifier was able to process all
of the genetic tweets without any hashtags.
3.2.3 Actionable Event Extraction
After filtering out off topic tweets, we developed a bootstrapping framework to predict
actionable events. To expand the key word seeds, we followed the cross-lingual
event trigger clustering approach described in [24] to discover words with similar
meanings. The algorithm exploited the idea that if two wordsw1 and w2 on the
source side of bi-lingual parallel corpora were aligned to the same word on the target
side with high confidence, they should have similar meanings. For each English key
word seed, the search was to find other English words that shared the same frequently
aligned Chinese terms and vice versa. The word alignment information between
each bi-lingual sentence pair was obtained by running Giza++ [27]. To eliminate
the noise introduced by automatic alignment, we filtered out stop words and those
English-Chinese word alignment pairs with frequency (in parallel corpora) less than
a threshold.2 Finally, we used each expanded keyword set as keywords to retrieve
actionable events.
3.2.4 Event Attribute Labeling
In addition to identifying actionable events, we also labeled semantic attributes

including modality and polarity for each event. We learned a separate SVMs based
classifier for each attribute from ACE2005 training data.3 The learnt classifier was
applied to predict modality and polarity values for each actionable event. Because
the training data set of ACE2005 includes news articles and our target domain is
tweets, we explored the following genre-independent features to bridge the genre
1 We treat hashtags appear more than 50 times as high frequency ones.

2 We set the frequency threshold as 4.
3 http://www.itl.nist.gov/iad/mig/tests/ace/2005/.
gap between news and tweets: (1) lexical features including unique words, lower-
case words, lemmatized words and part-of-speech tags; (2) N-gram features, where
an n-gram n g (n = 1, 2, 3) was selected as an indicative context feature if it matched
one of the following two conditions(i) n g appeared only in one class, and with
frequency higher than a threshold; and (ii) the probability that n g occurring in one
class was higher than a threshold; where both thresholds were optimized from a small
development set including 30 events; and (3) dictionary features, such as expression,
consideration, subjective, intention, condition, and negation, were used.
3.2.5 First Story Detection and Event Clustering
The Locality Sensitive Hashing (LSH) method was used to remove the curse of
dimensionality and applied to the FSD problem [28]. LSH was first proposed by
Indyk and Motwani [17]. The underlying foundation was that if two documents
are close together, then after a projection operation these two documents would
remain close together. In other words, similar documents have a higher probability
to be mapped into the same bucket thus the collision probability will be higher for
documents that are close to each other. Given a LSH setting of k bits and L hashtables,
two documents x and y are collide if and only if:
h ij (x) = h ij (y), i [1 . . . L], j [1 . . . k] (1)
and the hash function h i j (x) is defined as:
h ij (x) = sgn(u ijT x) (2)
where u ij are randomly generated vectors with components selected randomly from
a Gaussian Distribution, e.g., N (0, 1).
Algorithm 1 shows the pseudocode of LSH approach for First Story Detection and
event clustering. All the tweets are sorted in chronological order. Novelty score is then
assigned to document d by Score(d), given a threshold t [0, 1],4 if Score(d)
t then d is a first story, otherwise cluster d with its most similar document that
chronologically appears before it. To calculate distance between two documents we
adapt the standard Cosine Similarity between two vectors:
AB
distance(d, d ) = cos( ) =
||A||||B||
n (3)
i=1 Ai Bi
=
n n
i=1 (A i ) 2
i=1 (Bi )
2
4 We set t as 0.2 in our experiments.

The advantage of LSH is that it only needs to find the nearest neighbor from the
set of documents that were mapped to the same bucket instead of all the previous
tweets. Compared with the brute force search, the computation cost of score function
dropped from O(|Dt |) (|Dt | is the number of tweets have the time stamp before the
current tweets) to O(1).
Algorithm 1: LSH-based FSD

1 foreach document d in corpus do
2 add d to LSH;
3 S set of points that collide with d in LSH;
4 dismin d 1;

5 foreach d in S do

6 c = distance(d, d );
7 if c < dismin (d) then
8 dismin (d) c;
9 end
10 end
11 scor e(d) = 1 dismin (d);
12 end
3.3 SNA Methodology
3.3.1 Network Construction
The communication network of Twitter data was constructed using the communication
directional identifiers@ for directed and mention tweets and RT for the re-tweets.
Two relationships were incorporated into the communication networkthe directed/
mention and the re-tweet relationships. For directed/mention relationship an edge
existed if one user tweeted and/or mentioned another user. The user doing the tweet-
ing was at the head of the edge and the user who was mentioned or the tweet
was directed to was at the tail of the relationship. For re-tweet relationship the
edge existed if a user re-tweeted another users tweet. The user who was doing
the re-tweeting was at the tail of the edge and user sending the original message was
at the head of the relationship. The network was constructed for each of the time
slices of the event timeline previously discussed. This allowed for investigation of
the evolution and the dynamics of the network. The research evaluated actionable
behaviors on Twitter, therefore, only actionable tweets were utilized to construct the
network. The constructed network is referred to as Twitter communication network
in the following sections.
3.3.2 Attribute Setup
The NLP analysis assigned specific attributes to each actionable tweetmodality

and polarity. These attributes as well as a type of action (i.e. receive the warning,
seek and/or obtain the confirmation, and take the prescribed action) were initially
assigned as edge attributes in the Twitter communication network. However, when
the Twitter network was constructed multiple and self-loop edges were discovered.
Multiple edges represent multiple tweets between two users. The self-loop edges
represent edges from the user to itself. The presence of such edges precluded the use
of community finding algorithms. In order to address this problem the network was
simplified and edge attributes were automatically collapsed into the node attributes
to preserve all of the extracted information. Each nodes attribute was the sum of
all respective tweet attributes sent or received by the user. These attributes helped
define individuals behaviors. For example, if the user (i.e., node) has the following
attributestake the prescribed action with positive modality and polarity, the per-
son is taking the prescribed action. On the other hand, if the user (i.e., node) has the
following attributestake the prescribed action and negative modality and polarity,
someone else other than the person tweeting is not taking the prescribed action. The
NLP attribute assignment defines individual behaviors as well as collective behaviors
of Twitter users who are part of the same community.
3.3.3 Community Finding
Currently, most of the algorithms can not handle the directedness of the edges
when detecting the communities [20]. In order to overcome this issue, networks are
often converted into undirected graph for the purposes of community detection [11].
When Twitter users communicate among each other and direct their messages to
other users the evidence of communication (tweets) is displayed in the profiles of
both users. This dichotomy allowed us to justify the modification of the network
from directed to undirected graph for community detection purposes. The commu-
nity finding approach utilized in the research was a random walk community detection
algorithm. The foundation of the approach lies with the assumption that there are
only a few edges that leave communities. Therefore, the algorithm uses a number
of random walks on the network and then uses those walks to merge the separate
communities in a bottom up manner [29]. This particular algorithm is most appro-
priate to find communities in the large sparse networks, which commonly occur in
the Twitter data.
The social science literature informs the research on the properties of cohesive
groups. It suggests that the people in the same community tend to have similar
and redundant information. Moreover, there is an ease of information transfer in
cohesive groups [7, 30]. In this research, this concept was evaluated in the context
of Twitter communication network during emergencies. In order to ascertain if this
theory of group behavior applies to the communications and behaviors on Twitter the
correlation between the community members based on behaviors derived from the
Twitter users behavioral attributes was evaluated. The size of the communities found
in the data enabled us to determine how many people obtained similar information
and shared similar intents. The ten largest communities for each time slice were
evaluated by examining the similarity (correlation) of behavior among the community
members to discover the prevalent behavior.
3.3.4 Centrality and Prestige
Once the communities were identified the task was to find the community leaders.
Each community was taken separately and a community leaders were identified as the
most central/prestigious actors. The centrality/prestige measures that were utilized
in this research were outDegree, inDegree, betweenness, and eigenvalue centrality
(power). An outDegree centrality measure is simply a number of messages sent by a
Twitter user to other users in the network. An outDegree measure is associated with
faster information diffusion as it reaches more people. In [36], the researchers showed
that people with high outDegree engage in information propagation. An inDegree
measure represents a number of incoming messages sent to a Twitter user by other
users. Another measure of betweenness represents a level of control one user has
over the communication between other users. Users with high betweenness values
serve as information gatekeepers [36], the betweenness of a node is the number of
the shortest paths between any two nodes in the network that have to pass through
this node [37]. A power measure represents the nodes connectedness to other central
nodes [6].
Each centrality measure is associated with a different kind of behavior, users
scored high on each of those measures can represent different types of leadership.
Therefore, three types of leaders are definedthe diffuser, the gatekeeper, and the
information broker. The diffuser leader is a leader which diffuses the information
through the network. This type of leader is associated with an outDegree measure as
it measures the number of tweets (edges) a node sends out. Another type of leader is a
gatekeeper. A gatekeeper is a node that controls an information flow in the network.
Measures associated with the role of a gatekeeper are betweenness [12, 13] and
power [9]. There are two types of gatekeepers that emerge when betweenness and
power measures are combinedcritical gatekeeper and unique access gatekeeper [9].
A critical gatekeeper is associated with high betweenness and low power values
whereas a unique access gatekeeper is tied to low betweenness and high power
values [9]. We defined the final type of the leader as information broker, who has
access to valuable information and brokers it to other nodes in the network upon
request. An information broker is associated with high inDegree and high power
measures. A high power measure suggests access to other central actors and infor-
mation they able to provide. A high inDegree measure suggests high frequency of
inquiry from other users in the community. The frequency of inquiry for information
can be inferred from the action attributeseek and obtain confirmation.
Once the community leaders were identified their behavior was evaluated based
on the type of actionable tweets they sent out. That behavior was then compared
to the overall behavior of the community members. For example, when a leader of
the community sent out a warning to evacuate, which was accompanied by action
attributepropagate the warning and polaritytrue, the expected result was for
the community to follow the lead and send out the tweets with action attributes
propagate the warning and/or take a prescribed action and polaritytrue.
4 Data Description
The methodology presented in this work is generalizable to all emergencies. In order

to facilitate the understanding of the methodology and its generalizability two events
were chosen: (1) the 2011 Japan Tsunami and (2) 2012 Hurricane Sandy. Two events
were different in its impact as well as the duration of their impact. The tsunami
occurred on March 11th, 2011 and impacted the entire Pacific Coastline. There
were over 15,000 people whose lives were lost due to the tsunami including one in
Klamath River, CA, USA. It also produced between $12 and $16 millions of dollars
worth of damage in California [10]. In Hawaii, the governor had made a disaster
declaration [5]. Throughout the event the tsunami has triggered multiple warnings
issued by the Tsunami Warning Centers and evacuation orders issued by the local
emergency management organizations. The event spanned over the 24 h. The 2012
Hurricane Sandy had formed on October 22nd, 2012 and dissipated on October 31st,
2012. The event had affected 24 states along the eastern seaboard and had prompted
disaster declarations in eleven states along the U.S. East Coast and New England.
Hurricane Sandy had caused a significant impact with at least 286 people dead and
$65 billion dollars worth of damage in U.S. alone [4].
Two types of data were collected for both eventsqualitative and quantitative
data. For 2011 Japan Tsunami the qualitative data was collected via semi-structured
interviews with the members of emergency community who were involved during the
eventmembers of Tsunami Warning Centers, emergency managers at Hawaii Civil
Defense and Del Norte County Emergency Management Services, and members of
local broadcast media. After Action Reports were collected during the interviews,
which allowed the construction of the detailed timeline of the event summarized in
Table 1. Additional information, which was obtained from searching publicly avail-
able information, further enriched the knowledge about the event and details about
human behavior during the event. For 2012 Hurricane Sandy the qualitative data was
obtained via semi-structured interviews with New York State Department of Home-
land Security and Emergency Services Public Information Officers. Additional data
was made available via public resources provided by state governments and Federal
Emergency Management Agency. The summarized version of the timeline for 2012
Hurricane Sandy is described in Table 2 [16].
The qualitative data for both events included Twitter data. For 2011 Japan Tsunami
the data was obtained from Information Sciences Institute through collaborative
work and for 2012 Hurricane Sandy the data was collected in-house. Twitter data
was collected via streaming Twitter API. The data included all of the tweets sent
Table 1 2011 Japan Tsunami timeline

Time slice Time (UTC) Events
1 5:46:28AM5:55:02AM PTWC registers an earthquake 231 mi. from Tokyo, Japan
of magnitude 7.9 and issues first bulletintsunami watch
for HI
2 5:55:02AM6:41:22AM PTWC issues second bulletins (international & HI);
EOCs activated in HI
3 6:41:22AM7:31:00AM PTWC issues third bulletin; tsunami warning is issued in
HI
4 7:31:00AM9:01:00AM Evacuation is ordered in HI, boat evacuations in HI and
AK
5 9:01:00AM12:30:00AM Evacuation travel is completed in HI, U of HI is closed;
CA issues evacuation orders; tsunami arrives in King
Cove, AK
6 12:30:00AM13:36:00AM Tsunami arrives in HI: Hanalei, Kahului, Hilo
7 13:46:00AM17:31:00AM Tsunami warning is downgraded to advisory in HI; all
ports and evacuation zone are closed in HI; tsunami
arrives in Crescent City, CA
8 17:31:00AM21:26:00AM All clear is issued in HI
9 21:26:00AM6:36:00AM Final all clear is issued by PTWC
Table 2 2012 Hurricane Sandy timeline

Date Events
October 22 Tropical Storm Sandy had officially formed
October 23 Possible Tropical Storm Watch for Florida Keys
October 24 Tropical Storm Watch for east coast of Florida
October 25 Federal Emergency Management Agency (FEMA) elevates the enhanced watch for
Washington D.C. FEMA deploys Incident Management Assistance Teams to CT, DE,
NY, NJ, MA, NH, PA, and VT. Tropical Storm Watch was issued for NC and SC. The
state and federal response coordination efforts continued
October 26 NY, MD, D.C., PA, NC declared a state of emergency
October 27 FEMA activated the National Response Coordination Center. Non-government coor-
dination (i.e. Red Cross) had begun its coordination
October 28 Emergency declarations signed for CT, D.C., MD, MA, NJ, and NY. The USGS issued
landslide alerts for several areas. New York City had made public transportation clos-
ings in preparation to the event
October 29 Pre-disaster declarations signed for DE, RI, and PA. Hurricane Sandy downgraded to
post-tropical storm and made landfall in southern NJ
October 30 Major disasters declared for CT, NJ, and NY. Coordinated search, rescue, and recover
efforts began
October 31 Continued coordinated search, rescue, and recover efforts
or received during the time of the events. In addition to the tweet messages, it also
included user names, time stamps, and directed communication identifiers such as
for directed messages and RT for re-tweets. The data was stored locally and can be
accessed upon request.
5 Results
5.1 Natural Language Processing
For 2011 Japan Tsunami data set, we were able to annotate 800 hashtags in a very short
time period (1.5 h) and gathered a large number of human annotated tweets (311,735).
As a result, 37 hashtags were annotated as on-topic and the rest were annotated as
off-topic and thus 26,554 on-topic tweets and 285,181 off-topic tweets were gathered
respectively. To balance the training and testing data, we randomly sampled the same
amount of off-topic tweets as on-topic tweets to conduct the experiments. 42,486
tweets were randomly selected for training, and the remaining 10,622 tweets were
used for blind test. The accuracy for on-topic classification for 2010 Japan Tsunami
was 81.93 %. The accuracy results for both datasets, 2011 Japan Tsunami and 2012
Hurricane Sandy, for polarity and modality were 96.8 and 78.4 % respectively.
The actionable tweets were aggregated per time period to evaluate the results and
compare analyzed data and Twitter user behavior with the timeline of the events.
Table 3 represents the results for 2011 Japan Tsunami. There is a spike in the volume
of tweets during the time slice 4. This is natural as thats when most of the tsunami
warnings were issued and evacuations were ordered along the affected coastline.
Moreover, it is evident that the receive the warning tweets are prevalent in earlier
time slices and then gradually drops off as the event concludes. This is a natural
progression and corresponds to the event timeline. The take prescribed action tweets
peak in time slices five, six, and seven after the evacuation orders have been issued.
Finally, the confirmation tweets increase in the later time slices after the warnings
and evacuation orders were issued. Additionally, during the later time slices people
were confirming the well-being of their friends and relatives affected by the event.
Table 3 2011 Japan Tsunami attributes per time slice

Time slice Warn Confirm Action /+ asserted /+ polarity
2 58 None None 24/34 None/58
3 328 2 4 202/132 4/330
4 6,984 588 484 4,592/3,464 481/7,575
5 2,043 360 224 1,566/1,061 231/2,396
6 1,021 312 204 828/709 182/1,355
7 1,589 519 274 1,299/1,083 230/2,152
8 1,093 529 122 849/895 163/1,581
9 2,026 1,498 216 1,743/1,997 470/3,270
Table 4 2012 Hurricane Sandy attributes per time slice

Day Warn Confirm Action /+ asserted /+ polarity
Oct 25 2,009 1,792 283 2,220/1,864 447/3,637
Oct 26 12,731 8,856 2,686 15,181/9,092 2,925/21,348
Oct 27 16,167 10,761 5,759 20,743/11,944 4,689/27,998
Oct 28 47,873 50,390 37,215 83,989/51,489 18,527/11,6951
Oct 29 80,721 71,092 35,992 105,720/82,085 21,504/166,301
Oct 30 70,027 60,482 25,952 89,872/66,589 20,412/136,049
Oct 31 26,360 30,002 9,935 41,343/24,954 8,191/58,106
Similar results can be seen in 2012 Hurricane Sandy in Table 4. The volume of
receive the warning tweets rises leading up to and peaks on the day the landfall in
southern New Jersey (October 28). The volume of seek and obtain confirmation and
take the prescribed action tweets rise leading up to and peaking on the day prior to
the landfall. The warnings issued by the government emergency organizations for the
northeastern states required impacted population to take action on October 29th. The
peaks occurring on Twitter on October 29th for seek and obtain confirmation and
take the prescribed action show that users on Twitter followed the patterns of the
evolution of the event. The analysis shows that the evolution of behaviors extracted
from the NLP action assignments to the tweets correspond to a warning response
process cycle and the overall evolution of both events.
5.2 Twitter Network Communities
First the community results for the 2011 Japan Tsunami are evaluated. Table 5 shows
the results produced by the random walk algorithm. Note that the time slice (TS)
one was omitted from the results there were no communities discovered during that
time slice. The range in the table represents the size range of the communities
i.e. for time slice 2 the size of the smallest community was 2 and the size of
the largest community was 11. A higher percentage of communities of size larger
than four (Percentage of >4 com.) occur during time slices two, three, and four.
This result is expected as the users are exchanging warning information recently
issued and confirming prescribed action.
When the communities and its members were examined more closely there was
significant correlation found in the behaviors of community members. Over all time
slices, every community had 80 % or greater of its members that had exactly the same
behaviori.e., the same actionable event, modality, and polarity. For those communi-
ties, where there was a difference among the members behaviors, the difference was
in actionable events, and not in modality or polarity. The members usually split into
two groups within the community, based on the actionable eventwarning group,
those who received and propagated the warning, and take action group, those who
expressed intent to take the prescribed action. The finding suggests that people of a
Table 5 2011 Japan Tsunami communities results

TS # of com. Range # of com. (>4) Percentage of >4 com. (%)
2 10 {2:11} 1 10
3 62 {2:41} 10 16
4 1,324 {2:248} 126 10
5 705 {2:110} 39 6
6 538 {2:51} 19 4
7 729 {2:51} 33 5
8 525 {2:51} 25 5
9 878 {2:61} 33 4
community tend to exhibit similar behaviors. It is important for all members of the
community to share similar polarity for their behavior. For example, if the leader
sends out a message urging people to evacuateaction propagate the warning and
polaritypositive, the expected result for the rest of the community is to respond
with either action of propagate the warning or take prescribed action with the
same polarity. When the polarity was evaluated among the members of the commu-
nities only 5 % or less of all communities exhibited difference in polarity among its
members. Additionally, the tweets with confirmation actionable event rarely occurred
in the large communities and were more typical of communities of size <4. More-
over, when the communities were traced from time slice to time slice there was little
overlap discovered between its members. This suggests that the communities formed
on Twitter serve a purpose in each time slice such as propagate the warning, obtain
information or confirmation, and exhibit an intent to take the prescribed action. Once
the action is completed there is no longer a need to participate on Twitter.
The 2012 Hurricane Sandy event spanned over nine days from its formation on
October 22nd to its completion on October 31st. This timespan allows for higher
participation in information exchange on Twitter. Table 6 shows the results produced
by the random walk algorithm for each of the seven days of collected data (October
25October 31).
Unlike 2011 Japan Tsunami the anticipated impact of 2012 Hurricane Sandy
varied and spanned over the entire east coast of the United States. The vast area
Table 6 2012 Hurricane Sandy communities results

Day # of com. Range # of com. (>4) Percentage of >4 com. (%)
1 842 {1:167} 79 9
2 62 {1:912} 419 11
3 1,324 {1:1,481} 546 11
4 705 {1:11,289} 2,412 14
5 538 {1:7,428} 3,293 12
6 729 {1:6,040} 2,531 11
7 525 {1:2,440} 1,208 11
of impact required different actions to be taken by the impacted population. For

example, the levels of different actions ranged for areas as small as individual cities,
such as New York City. New York City was divided into three possible zones of
impact but the evacuation order was issued only for the Zone A. The diversity of the
prescribed action resulted in diversity of the behaviors among the members of the
same Twitter communities as some members were required to evacuate and others
werent. Members of the same community were receiving and propagating warnings
as well as confirming if action for their local area was required. The members of the
same community were different in their action attributes, however, the polarity for
each action was the same among the members of the same community. This finding
is consistent with 2011 Japan Tsunami finding on polarity. The first story results
suggested that each community exchanged on average 14 % of unique new informa-
tion. The larger communities possessed the least amount of unique new information.
The information in such communities was issued by selected individual members
and then diffused to the rest of the members of the communities. A new finding,
in contrast to 2011 Japan Tsunami findings, suggests that tweets with confirmation
actionable event were no longer specific to the communities of size less than four.
This finding can be explained by the fact that there was more time allotted for people
to seek confirmations prior to the hurricane impact.
5.3 Community Leaders
First the leaders of the communities discovered in 2011 Japan Tsunami were
evaluated. Specifically, only the communities of size larger than four were exam-
ined. It was discovered that the roles of diffuser and gatekeeper were assumed by the
same nodes. Additionally, it was confirmed that the action ofseek information or
confirmation is a characteristic of communities of size smaller than four, therefore,
the information broker role was taken by a selected set of users in those communities.
As shown in Tables 7 and 8, ten largest communities for time slice four, when the
critical warning information was issued, were selected for analysis, and diffuser and
gatekeeper roles were combined and defined as community leaders.
The community leaders were the members of traditional media, and primarily
focused on the diffusing the informationaction attribute of propagate the warn-
ing, and the other community members were following the leaders by either taking
the prescribed action or propagating the warning. When the leaders were issuing
information to evacuate, actionable eventpropagate the warning and polarity
true, the rest of the community followed one of two actionspropagate the warn-
ing or take the prescribed action, with the same polarity. When the lack of overlap
between the communities across the timeline was discovered, a significant finding
was the presence of the leaders in all time slices. As the members of communities
participated in the communication only during a particular time slice, the leaders con-
tinued their participation throughout the event. This evidence suggests that Twitter
Table 7 Time slice four community results

Community Community size Action Modality Polarity
2 32 Receive Non-asserted Positive
4 114 Receive Asserted Positive
Table 8 Time slice four community leadership results

Leaders Action Modality Polarity Tense
abc7 Receive Non-asserted Positive Past
BreakingNews Receive Non-asserted Positive Present
fema Receive Non-asserted Positive Present
infoBMKG Receive Asserted Positive Past
BBCBreaking Receive Non-asserted Positive Past
CNN Receive Asserted Positive Past
BBCWorld Receive Non-asserted Positive Past
DamnItsTrue Receive Asserted Positive Present
thejakartaglobe Receive Non-asserted Positive Past
cnbrk Receive Non-asserted Positive Past
users were gravitating towards the leaders who were sources of information and at
the same time in control of the information, i.e. diffusers and gatekeepers.
Next the leaders of the communities were evaluated for the 2012 Hurricane Sandy
event. Only the communities of size larger than four were examined. Two days were
selected for demonstration of the results are October 28th, the day prior to the landfall
in southern New Jersey, and October 29th the day of the landfall. The finding that
the single leader serving as diffuser and gatekeeper is consistent for both 2011 Japan
Tsunami and the 2012 Hurricane Sandy events. In contrast to 2011 Japan Tsunami,
the broker type leader, i.e. the leader who was high in InDegree value and was
high in confirmation actionable tweets, was now present in the communities of size
larger than four. This type of leader provided confirmations to other members of
communities on Twitter. The list leaders, which emerged in the day prior to the
landfall in southern New Jersey and during the landfall for top ten communities can
be seen in the Tables 9 and 10.
As previously discussed, the behaviors of the members of the communities varied
due to the variabilities of warnings, however, the peaks and valleys in the distributions
Table 9 Leadership results: Community ID Diffuser/gatekeeper Broker

Day prior to the 2012
Hurricane Sandy landfall in 1 MikeBloomberg twchurricane
southern New Jersey 2 NHCAtlantic 13News
9 HuriicaneSandy HurriicaneSandy
16 ASPCA AMoDELSLIFE
33 rickygervais rickygervais
36 JamesYammouni yumyumyumniall
37 googlemaps googlemaps
38 BBCBreaking BBCBreaking
39 KagroX KagroX
40 jimmyfallon jimmyfallon
Table 10 Leadership results: Community ID Diffuser/gatekeeper Broker

Day of the 2012 Hurricane
Sandy landfall in southern 2 NHCAtlantic WSJweather
New Jersey 19 fema BarackObama
29 DMVFollowers DMVFollowers
34 livestream mbarilla
42 nytimes BuzzFeed
91 rickygervais rickygervais
115 MikeBloomberg MikeBloomberg
147 ASPCA LindaFB
163 CP24 DopeHNIC
226 TheIlluminati GDominico
of the aggregated actions of the community members followed the peaks and valleys
of the distribution of leaders actions. This finding can be demonstrated in Tables 11
and 12 for the day prior to the landfall in southern New Jersey and Tables 13 and 14
for the day of the landfall.
The Tables show Rec for receive the warning, Seek for seek confirmation or
information, Act for take the prescribed action, and (+) for positive polarity and
() for negative polarity. These results suggest that community members followed
the actions of their respective leaders. The first story analysis has been used to
evaluate the role of the leaders in the communities and assessing the uniqueness of
the information they had shared through out the event. The number of first stories
were aggregated per each leader to identify the percentage of the unique information
shared by each leader. The result of the analysis suggests that in the days leading
up to the landfall in southern New Jersey the leaders of the communities were sharing
unique information with their respective communities. During the landfall and the
day after the landfall, the information being shared by the leaders was no longer
unique and consisted of previously transmitted information. Moreover, the most
Table 11 Community Community /+ Rec /+ Seek /+ Act

results: Day prior to the 2012
Hurricane Sandy landfall in 1 365/1,366 102/687 641/2,100
southern New Jersey 2 22/692 7/27 3/68
9 2/7 0/250 0/1
16 16/140 8/84 13/3,211
33 2/29 20/384 2/8
36 5/65 0/4 0/0
37 0/7 0/6 3/118
38 0/2 0/4 0/903
39 2/7 6/25 107/19
40 2/37 1/7 0/5
Table 12 Leadership results: Leader /+ Rec /+ Seek /+ Act

Day prior to the 2012
Hurricane Sandy landfall in MikeBloomberg 5/554 4/11 813/1,525
southern New Jersey NHCAtlantic 22/1,161 4/4 1/1
HuriicaneSandy 0/0 0/1,081 0/0
ASPCA 3/75 0/0 11/3,064
rickygervais 0/0 7/4,709 0/0
JamesYammouni 4/918 0/0 0/1
googlemaps 0/5 0/0 0/1,080
BBCBreaking 0/0 0/2 0/896
KagroX 14/0 2/8 611/2
jimmyfallon 2/516 1/5 0/3
Table 13 Community Community /+ Rec /+ Seek /+ Act

results: Day of the 2012
Hurricane Sandy landfall in 2 726/14,586 332/2,208 1,424/11,234
southern New Jersey 19 91/1431 50/433 229/6,151
29 12/17,19 14/233 2/3,309
34 9/1,238 8/10 4/24
42 124/5,503 12/202 24/388
91 15/104 43/7,461 1/28
115 13/788 12/131 154/1,320
147 28/272 1/20 366/1,799
163 34/2,194 1/15 0/18
226 14/1,605 6/7 0/1
Table 14 Leadership results: Leader /+ Rec /+ Seek /+ Act

Day of the 2012 Hurricane
Sandy landfall in southern NHCAtlantic 10/1,037 1/0 0/0
New Jersey fema 0/149 2/4 2/2,734
DMVFollowers 0/12 0/53 0/1,696
livestream 0/1,470 0/0 0/0
nytimes 6/1,382 0/19 0/151
rickygervais 2/4 2/3,762 0/2
MikeBloomberg 1/247 0/8 29/503
ASPCA 1/112 0/0 184/838
CP24 10/1,088 2/0 0/3
TheIlluminati 3/802 1/0 0/0
unique information was being shared by the official sources such as MikeBloomberg
and NYCMayorsOffice. This finding suggests that Twitter users who were part of the
communities led by the official sources obtain first hand information quicker than
the rest of the users on Twitter.
6 Conclusion and Future Research
Two different events were evaluated. Events differ in impact areas, time span, and
magnitude of impact. The 2011 Japan Tsunami event spanned over just one day with
very limited time to respond, whereas, the 2012 Hurricane Sandy spanned over nine
days with much more time to prepare and respond. During 2011 Japan Tsunami the
governmental emergency management organizations made limited use of Twitter.
However, the traditional media outlets utilized Twitter extensively to disseminate
warnings. In contrast, during 2012 Hurricane Sandy local as well as state and federal
governmental emergency management organizations made an extensive use of social
media providing a vast majority of unique information to the Twitter users.
To overcome a lack of knowledge of who are the individuals or organizations that
disseminate warning information, provide confirmations of an event and associated
actions, and urge others to take action, a methodology that combines natural language
processing and social network analyses was successfully applied to two data sets
collected from Twitter during 2011 Japan Tsunami and 2012 Hurricane Sandy. The
methodology employed was as follows: (1) assign actionable events to each on-topic
tweet using NLP; (2) construct a communication network of tweets associated with
actionable events; (3) use the network to discover communities with SNA; (4) extract
the leaders of the communities and identify their roles with SNA; and (5) evaluate
the behavior of the community members and their leaders using NLP.
The analysis was able to demonstrate that the behavior of the Twitter users was
consistent with the issuance of actionable information based on warnings. It was
also discovered that members of the same community demonstrate similar behaviors
when faced with very limited time to respond and diverse behaviors when faced
with longer time to respond. Additionally, the diversity of the levels of impact and
prescribed actions also facilitated diverse behaviors among the members of the same
communities during 2012 Hurricane Sandy. During 2011 Japan Tsunami the leaders
of the communities were typically the traditional media who were propagating the
warnings and urging the other community members to take the prescribed action.
However, during 2012 Hurricane Sandy the leaders of the communities ranged from
celebrities, specialized organizations (e.g. various weather reporting agencies), and
local, state, and federal emergency management organizations. Moreover, it was dis-
covered that the leaders maintained their role throughout the entire event, while the
rest of the community members were present during a selected time period. The
communities formed around the information sourcesi.e. the leaders. The leaders
of the communities during 2012 Hurricane Sandy were able to introduce unique
information into the communities, moreover, it was the local official organizations
who introduced the majority of the unique information. The uniqueness of the infor-
mation shared by the leaders peaked prior to the hurricane landfall in southern New
Jersey and declined during and the day after the event.
The key contributions of the research consist of the insight into the human behavior
on Twitter during two major extreme events. The paper showed how extreme events
with different characteristics can prompt different human behavior on Twitter. The
research explored collective human behavior and demonstrated that events that allow
more time to respond and impact larger territories can result in weaker cohesion in
virtual communities on Twitter. The research also conveyed stronger adoption of
Twitter by official emergency response organization during 2012 Hurricane Sandy, a
year and a half after 2011 Japan Tsunami. The official sources are not only adopting
the new technology offered by Twitter, but also become leading information sources
on Twitter as evident from leadership and first story detection analyses for 2012 Hur-
ricane Sandy. In future research, the authors will attempt to include additional event
attributesi.e. location, to better understand the impact of emergencies on commu-
nities. In addition, this will allow us to study the co-evolution of the behavior of the
community and its leaders and the structure of the network throughout an emergency.
It will also provide the means to investigate the flow of actionable information and
its distortion over time.
Acknowledgments This material is based upon work sponsored by the Army Research Lab under
Cooperative Agreement number No. W911NF-09-2-0053 (NS-CTA), U.S. NSF under the grant
number CMMI V 1162409, U.S. NSF CAREER Award under Grant IIS-0953149, U.S. DARPA
Award No. FA8750-13-2-0041 in the Deep Exploration and Filtering of Text (DEFT) Program,
IBM Faculty award and RPI faculty start-up grant. The views and conclusions contained in this
document are those of the authors and should not be interpreted as representing the official policies,
either expressed or implied, of the Army Research Laboratory, DARPA, the National Science
Foundation or the U.S. Government.
References
1. Allan J, Lavrenko V, Jin H (2000) First story detection in tdt is hard. In: CIKM, pp 374381
2. Allan J, Lavrenko V, Malin D, Swan R (2000) Detections, bounds, and timelines: Umass and
tdt-3. In: Proceedings of topic detection and tracking workshop, pp 167174
3. Benson E, Haghighi A, Barzilay R (2011) Event discovery in social media feeds. In: ACL,
pp 389398
4. Billion-dollar weather/climate disasters. In: National climatic data center and national oceanic
and atmospheric administration, 12 January 2014
5. Blair C (2011) Update: Hawaii Tsunami damage in tens of millions of dollars. In: Honolulu
civil beat. 14 March 2011
6. Bonacich P (1987) Power and centrality: a family of measures. Am J Sociol 92:11701182
7. Burt R, Lin N, Cook K (2011) Structural holes versus network closure as social capital. In:
Social captial: theory and research. Aldine Transaction
8. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM TIST 2(3):27
9. Conway D (2009) Social network analysis in R
10. Ewing L (2011) The Tohoku tsunami of march 11, 2011: a preliminary report on effects to the
california coast and planning implications. In: California coastal commission report. Natural
Resources Agency, San Francisco
11. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3):75174
12. Freeman LC (1979) Centrality in social networks conceptual clarification. Soc Netw 1(3):215
239
13. Freeman LC (1980) The gatekeeper, pair-dependency and structural centrality. Qual Quant
14(4):585592
14. Huberman BA, Romero DM, Wu F (2009) Social networks that matter: Twitter under the
microscope. First Monday 14(1):8
15. Hughes A, Palen L (2009) Twitter adoption and use in mass convergence and emergency events.
In: Proceedings of the 6th international conference on information systems for crisis response
and management (ISCRAM), Gothenburg, Sweden
16. Hurricane sandy: timeline. In: Federal emergency management agency. 12 January 2014
17. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of
dimensionality. In: STOC, pp 604613
18. Ji H, Grishman R (2008) Refining event extraction through cross-document inference. In: ACL,
pp 254262
19. Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov
7(4):373397
20. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant
communities in networks. PLoS ONE 6(4):e18961
21. LDC: Ace (automatic content extraction) english annotation guidelines for events (2005). http://
projects.ldc.upenn.edu/ace/docs/english-events-guidelines_v5.4.3.pdf
22. Li H, Ji H, Deng H, Han J (2011) Exploiting background information networks to enhance
bilingual event extraction through topic modeling. In: Proceedings of international conference
on advances in information mining and management
23. Li Q, Ji H, Huang L (2013) Joint event extraction via structured prediction with global features.
In: Proceedings of the 51st annual meeting of the association for computational linguistics.
Association for Computational Linguistics, Sofia, Bulgaria, pp 7382
24. Li H, Li X, Ji H, Marton Y (2010) Domain-independent novel event discovery and semi-
automatic event annotation. In: PACLIC, pp 233242
25. Lindell M, Perry R (2012) The protective action decision model: theoretical modifications and
additional evidence. In: Risk analysis, vol 32(4), pp 616632
26. Mileti D, Sorensen J (1990) Communiction of emergency public warnings: a social science
perspective and state-of-the-art assessement. In: State-of-the-art assessement. Report prepared
for federal emergency management agency, Oak Ridge National Laboratory, Oak Ridge
27. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput
Linguist 29(1):1951
28. Petrovic S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to
Twitter. In: HLT-NAACL, pp 181189
29. Pons P, Latapy M (2006) Computing communities in large networks using random walks.
J Graph Algorithms Appl 10(2):191218
30. Reagans R, McEvily B (2003) Network structure and knowledge transfer: the effects of cohe-
sion and range. In: Administrative science quarterly, vol 48(2), pp 240267
31. Ritter A, Mausam Etzioni O, Clark S (2012) Open domain event extraction from Twitter. In:
KDD, pp 11041112
32. Romero DM, Kleinberg JM (2010) The directed closure process in hybrid social-information
networks, with an analysis of link formation on Twitter. In: ICWSM
33. Romero DM, Meeder B, Kleinberg JM (2011) Differences in the mechanics of information
diffusion across topics: idioms, political hashtags, and complex contagion on twitter. In: WWW,
pp 695704
34. Sarcevic A, Palen L, White J, Starbird K, Bagdouri M, Anderson KM (2012) beacons of hope
in decentralized coordination: learning from on-the-ground medical Twitterers during the 2010
Haiti earthquake. In: CSCW, pp 4756
35. Starbird K, Palen (2011) voluntweeters: self-organizing by digital volunteers in times of
crisis. In: CHI, pp 10711080
36. Tyshchuk Y, Wallace WA (2012) Actionable information during extreme eventscase study:
warnings and 2011 tohoku earthquake. In: SocialCom/PASSAT, pp 338347
37. Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge
University Press, Cambridge
38. Weng J, Lee BS (2011) Event detection in Twitter. In: ICWSM
39. Yang Y, Pierce T, Carbonell JG (1998) A study of retrospective and on-line event detection.
In: SIGIR, pp 2836
40. Yates D, Paquette S (2011) Emergency knowledge management and social media technologies:
a case study of the 2010 haitian earthquake. Int J Inf Manag 31(1):613
Hierarchical and Matrix Structures
in a Large Organizational Email Network:
Visualization and Modeling Approaches
Benjamin H. Sims, Nikolai Sinitsyn and Stephan J. Eidenbenz
Abstract This paper presents findings from a study of the email network of a large
scientific research organization, focusing on methods for visualizing and modeling
organizational hierarchies within large, complex network datasets. In the first part
of the paper, we find that visualization and interpretation of complex organizational
network data is facilitated by integration of network data with information on for-
mal organizational divisions and levels. By aggregating and visualizing email traffic
between organizational units at various levels, we derive several insights into how
large subdivisions of the organization interact with each other and with outside orga-
nizations. Our analysis shows that line and program management interactions in
this organization systematically deviate from the idealized pattern of interaction pre-
scribed by matrix management. In the second part of the paper, we propose a power
law model for predicting degree distribution of organizational email traffic based on
hierarchical relationships between managers and employees. This model considers
the influence of global email announcements sent from managers to all employees
under their supervision, and the role support staff play in generating email traffic,
acting as agents for managers. We also analyze patterns in email traffic volume over
the course of a work week.
Keywords Network visualization Complex networks Community detection

Power law model Organizational hierarchies
This chapter was created within the capacity of an US governmental employment. US copyright
protection does not apply.
B.H. Sims (B) N. Sinitsyn S.J. Eidenbenz

Los Alamos National Laboratory, Los Alamos, NM 87545, USA
e-mail: bsims@lanl.gov
N. Sinitsyn
e-mail: nsinitsyn@lanl.gov
S.J. Eidenbenz
e-mail: eidenben@lanl.gov
Springer International Publishing Switzerland (outside the USA) 2014 27

DOI 10.1007/978-3-319-12188-8_2
28 B.H. Sims et al.
1 Introduction
In this paper, we present results of our analyses of large organizational email datasets
derived from the email traffic records of Los Alamos National Laboratory (LANL).1
Analyzing such large email datasets from complex organizations poses a number
of challenges. First, considerable work is required to parse large quantities of raw
data from network logs and convert it into a format suitable for network analysis and
visualization. Second, a great deal of care is required to analyze and visualize net-
work data in a way that makes sense of complex formal organizational structuresin
our case, 456 organizational units that are connected through diverse organizational
hierarchies and management chains. Finally, it can be difficult to sort out the effects
of email traffic generated by mass announcements and communications along man-
agement chains from the more chaotic, less hierarchical traffic generated by everyday
interactions among colleagues.
This paper addresses these complexities in two ways. First, we demonstrate
methods for understanding large-scale structural relationships between organiza-
tional units by using carefully thought-out visualization strategies and basic graph
statistics. Second, we propose a power law model for predicting the degree distri-
bution of email traffic for nodes of large degree that engage in mass emails along
hierarchical lines of communication. This likely characterizes a significant portion of
email traffic from managers (and their agents) to employees under their supervision.
This model goes beyond existing models of node connectivity in organizations by
considering the influence of specific email usage practices of managers.
Our motivation for this analysis is primarily sociological, with a focus on
understanding structural relationships among formal organizational divisions and
along defined management chains within a particular organization. Email network
analysis enables us to draw conclusions about the respective roles of different ele-
ments in the organizational hierarchy, beyond what is specified in organizational
charts and management plans. This offers insight into the functioning of the organi-
zation, and could have practical implications for management and communications.
Further, it provides a case study that can be compared to other organizational studies,
and demonstrates a general set of methods that can be employed to gain organiza-
tional insight from email data.
2 Analysis and Visualization of Organizational Structure
The study of social networks in organizations has a long history, going back at least
as far as the Hawthorne studies of the 1920s, in which anthropological observations
of worker interactions at Western Electrics Hawthorne Works were represented as
networks [11, 20]. The convention of representing social connections as graphs, with
1 This contribution is an extended version of [32].

Hierarchical and Matrix Structures in a Large Organizational Email Network . . . 29
circles or other shapes representing individuals and lines representing relationships

between them, emerged in these very early stages of social network research [10].
Initially, these graphs were hand drawn, and typically laid out to qualitatively repre-
sent patterns the researchers found important. With the rise of computational social
network analysis in the 1970s and 1980s, it became possible to lay out graphs algo-
rithmically. Spring-based algorithms facilitated gaining visual insights into interac-
tional patterns in a more systematic way. Today, sophisticated graph drawing tools
like Cytoscape and Gephi provide network researchers with access to a wide range
of layout algorithms and drawing styles [16, 17]. Despite the rise of sophisticated
mathematical constructs for analyzing social network graphs, visual representations
still remain important, particularly in anthropological and sociological studies.
Studies of the network structure of organizations have drawn attention to the key
roles of structural holes and brokers. A structural hole is relationship of nonredun-
dancy between two nodes in a network [4]: in other words, a structural hole exists
between two individuals if their connection would create a unique link between parts
of the network that are currently separated. Structural holes are very common in
most large organizations. When such a link is made, as long as it remains unique,
the individuals at both ends are able to function as brokers between the two parts of
the organization, a position that confers many benefits in terms of power and access
to information.
A great deal of research has focused on the respective roles of strong and weak
ties in the creation and transfer of knowledge in organizations. Weak ties are those
that are exercised rarely and often connect individuals to others who are at some
organizational or geographic remove. Strong ties are characterized by more frequent
interaction, more positive feelings, and exchanging of services. Weak ties have been
shown to be important in knowledge search, since they often provide access to novel
information, a key element in innovation [13, 14]. However, scientific and technical
knowledge have several features that are difficult to convey through weak ties. First,
these forms of knowledge often have a large tacit component [7, 29]. Tacit knowledge
is knowledge that has not been, and perhaps cannot be, formally expressed, and is
central to expert judgment. Because of this, it can only be effectively transferred
from one individual to another through prolonged, direct interaction. The transfer
of tacit knowledge between organizational units is facilitated by the existence of
multiple direct, strong ties. Scientific and technical concepts are also complex, and
thus require greater information bandwidth and/or more time to communicate, both
of which are facilitated by strong ties [14, 22]. In general, then, weak ties provide
access to new knowledge, which is key to developing innovative ideas, while strong
ties enable transfer and sharing of knowledge at a deeper level, which is necessary for
research collaboration and for the elaboration and implementation of new ideas [28].
The sociological literature on organizational gatekeepers suggests that some
individuals who occupy broker roles can play a critical role in knowledge transfer
within and between organizations. A study by Allen and Cohen [1] identified a key
tension between organization-based and discipline-based coding schemes in research
and development laboratories. Coding schemes are ways of perceiving and organiz-
ing the world that vary from one community to another. Organizations need access to
30 B.H. Sims et al.
outside coding schemes to bring in new information and ideas, while internal coding
schemes facilitate close working relationships between colleagues. In the laboratory
they studied, Allen and Cohen found that the key mechanism for managing this ten-
sion was to place a limited number of individuals in informal gatekeeper roles. These
gatekeepers had more ties to technical disciplinary communities and colleagues out-
side the laboratory, and more familiarity with the research literature. Being in this
gatekeeper position relative to the outside world also made them preferred sources
of information and advice within the organization. Tortoriello et al., in a more recent
study [33], note that the tight relationships and shared knowledge individual organi-
zational units need to function effectively inhibits their ability to interact effectively
with other organizational units. Having a limited number of people in gatekeeper roles
is a mechanism that enables groups to maintain a cohesive identity while preserving
access to important knowledge and information from elsewhere in the organization.
The rise of electronic mail as a central communication mechanism in organizations,
along with extensive archiving of email communications, has created a body of data
that can be used to analyze organizational interactions at very large scales. Auto-
matically collected email data has significant advantages for capturing interactions
among organizational units: although email does not capture all relevant interactions,
it provides comprehensive coverage across the entire organization without the over-
head involved in large-scale survey-based studies. Studies have shown that email
communication patterns generally reflect the underlying social network structure of
an organization [34].
The Enron corpus, released by regulators as part of an investigation into the
companys bankruptcy, is one of the few publicly available email datasets of significant
scope available to researchers. As such, it has played a key role in the development
of email analysis techniques [5, 8]. However, the Enron corpus is quite small (half a
million messages between 158 individuals) compared to the total email volume of a
large organization. Unfortunately, larger email corpora (like the one analyzed here)
are often not considered publicly releasable, and are accessible only to researchers
internal to the organization in question. For example, [19] describes a very large
email network of email communications among Microsoft employees. A key feature
of many of these email studies, which we build upon here, is that they track both
individual-level communications and communications across formal divisions of the
organization. Aggregating relationships based on formal organizational structures
offers an important level of insight, which can be particularly useful for managers
and analysts interested in interactions among business units, capabilities, or functions
rather than individuals.
Fig. 1 a Schematic
representation of a typical
organizational chart for a
fully matrixed organization.
Each employee reports to
one line and one program
manager, and line and
program managers
independently report to
upper management. b The
idealized communication
pattern that results from a.
Dotted line indicates less
frequent communication. c
The actual communication
pattern at LANL, revealed
through analysis of email
data. (UM =
upper management, PM =
program/project management,
LM = line management,
E = employee.)
2.1 Structural Relationships Between Elements of the

Organization
Our analysis of structural relationships within LANL focuses on two broad,

cross-cutting distinctions: program versus line organizations, and technical research
and development functions versus operations functions (safety, physical plant, etc.).
LANL is a hybrid matrix management organization. In a fully matrixed organi-
zation, each employee has two managers: a line manager and a program or project
manager (Fig. 1a). The employee is assigned to a line management unit based on their
skill set and capabilities. For example, a computer scientist might be assigned to a
Computational Modeling group, or an engineer to a Structural Engineering group.
Line management plays little or no role in guiding the day-to-day work of employ-
ees, however. Instead, the employee is assigned to work on one or more projects,
32 B.H. Sims et al.
each of which is supervised by a program or project manager. A project is generally

directed toward a specific product or deliverable, such as design of a particular model
of aircraft or completion of a particular research task. The day-to-day work of the
employee toward these particular goals is directed by the program or project man-
ager. Both line and program managers usually report, through some management
chain, to upper level general managers. The idealized communication pattern that
results is one in which program and line managers communicate primarily vertically,
interacting with both upper management and employees (Fig. 1b). In order to keep
things running smoothly, however, program and line managers must also periodically
communicate laterally, to ensure a good fit between capabilities and projects.
The matrix management model became popular in the aerospace industry with
the rise of program management in the 1950s, and was in part influenced by the
organizational structure of the Manhattan Project [3], in which Los Alamos played
a major role. At LANL today, line and program organizations play less distinct
roles. The base-level line units that house most employees are called groups, which
may be built around programs or capabilities. In our analysis, we draw a distinc-
tion between groups and higher-level line management organizations, which arent
directly involved in technical or operations work. Program organizations play a vari-
ety of coordinating roles among groups, management, and outside organizations,
and sometimes conduct technical or operations work as well. Despite this flexible
definition, our analysis reveals that technical program organizations occupy a very
well-defined structural space within the organization as a whole.
Our analysis of email traffic between organizational units at LANL is based on a
complete email record for a 25 day period in 2011. This time period was selected pri-
marily based on practical considerations of data availability; it is possible that other
time periods would yield somewhat different results [21, 35]. In order to locate indi-
viduals within organizational structures, we used organizational telephone directory
data to associate email addresses with low-level organizational units, and informa-
tion from organization charts to generate mappings of these units to higher-level
ones. We included only those email addresses that corresponded to an individual in
the LANL employee directory, thereby excluding mailing lists and external corre-
spondents. The resulting dataset comprises approximately 3 million emails between
12,000 addresses. This is a relatively large organizational communication network
compared to others described in the literature. For example, one of the data sets ana-
lyzed in [24] is an email network for a scientific research organization that appears
comparable to ours. This network consists of approximately 3 million total emails col-
lected over 18 months, but covers only 1,200 internal organizational email addresses.
There are a few examples of analysis of much larger email networks: [21] uses a data
set covering 43,000 addresses at a university over one year, while [19] is based on
emails among over 100,000 employees of a multinational corporation over a period
of 5 months.
Figure 2 shows email traffic between organizational units, laid out using a
force-vector algorithm. By aggregating email traffic this way, we in effect apply
a block model in which groupings are pre-specified by formal organizational posi-
tion. We chose not to take a generalized block modeling approach [9] because our
Fig. 2 Email traffic between organizational units at LANL, using a force-vector layout. Node
size represents betweenness centrality. Edge color is a mix of the colors of the connected nodes.
Although individual edges are difficult to discern at this scale, the overall color field reflects the
type of units that are most connected in a given region
primary goal is to understand how pre-defined organizational units interact. Organi-

zations are colored according to the technical/operational and line/program classifi-
cation described above, and their sizes represent betweenness centrality. There are
some visible patterns in this layout. First, a number of operations groups have the
highest betweenness centrality, reflecting their role as key intermediaries or brokers
in the network. Ranking the nodes by betweenness centrality confirms this: 17 of the
top 20 nodes are operations organizations. The central position of these organizations
probably reflects the fact that they provide services to most of the other organizational
units at the laboratory. In addition, operations units and technical units occupy distinct
portions of the graph; this indicates that there is generally more interaction within
these categories than between them. The highly central operations groups appear
to play a bridging role between the two categories. Administration units appear to
be somewhat more closely associated with technical units than operations units,
although this is difficult to state with certainty.
Some of the ambiguities in interpretation can be clarified by grouping all units in
a given category into a single node, resulting in the 7-node graph shown in Fig. 3.
This view, which uses a simple circular layout, reveals that there is a large amount of
email traffic (in both directions) on the technical side of the organization along the
path AdministrationManagementProgramGroup, and relatively little traffic
34 B.H. Sims et al.
Fig. 3 Email traffic between organization types at LANL. Node diameter represents total degree
(i.e. total number of incoming and outgoing emails) of the node; edge width represents email volume
in the direction indicated
between these entities along any other path. The operations side of the organization
does not display this pattern, indicating that relationships between groups, programs,
and management are more fluid there. The strength of the ties between technical pro-
gram organizations and both technical groups and technical management, in the
absence of a strong direct tie between technical groups and technical management,
suggests that technical program organizations serve as a broker between these ele-
ments of the organization. This contrasts with the role program organizations play in
a true matrix organization, where they represent an independent chain of command
from line management. The structure of this relationship at LANL is depicted in
Fig. 1c.
Figure 3 also indicates that operations organizations have lower overall volumes
of incoming and outgoing email than technical organizations, even though there are
similar numbers of employees in each category [18]. There could be a number of rea-
sons for this. Operational knowledge may be less complex and more readily codified
than technical knowledge, reducing the need for strong interactional ties. Alterna-
tively, the nature of operational work, which can take place in the field and involve
significant manual labor and use of machinery, may inhibit email communication.
Some workers may not have constant access to email during working hours, and
communication needs may be more localized and readily satisfied by direct personal
interaction. Additional research would be required to fully explore these possibilities.
Another way of understanding the roles different types of organizational units play
is in terms of their relationships with outside entities. Figure 4 plots the number of
emails each type of organization sends and receives to/from commercial versus non-
commercial domains. This indicates that all types of operational units communicate
significantly more with commercial entities, which is probably driven by relation-
ships with suppliers and contractors. Technical groups, technical management, and
administration communicate about equally with commercial and non-commercial
domains. The outlier here is technical programs, which communicate more with
external addresses than any other type of organizational unit, and are much more
highly connected to non-commercial domains.
These findings suggest that program organizations at LANL occupy the gatekeeper
position described in [1, 33]: they serve as brokers between organizational levels,
as well as a key link between the laboratory and the outside worldparticularly
non-commercial entities like academic institutions and other government agencies.
Their position between upper management and technical work organizations may
reflect their role in translating between management coding schemes and those of
technical domain experts, while their position between LANL and external entities
suggests a broader role in translating between internal and external coding schemes.
There are a number of possible applications of this kind of analysis. Studies
have shown that individuals, including managers, are not always accurate in their
perceptions of the structure of informal networks in their organizations, beyond the
individuals with whom they regularly interact [23]. Quantitative network analysis
and visualization can therefore provide significant, data-driven insights that are not
ordinarily available to managers and other employees in organizations. The findings
presented here show that program organizations at LANL have shifted from their
original role as one axis of a management matrix scheme to a role as organizational
gatekeepers. In an organization undergoing this kind of shift, some managers or
workers may not be completely aware of the nature of the change. In that case,
this kind of analysis can provide insights into how to effectively interact with and
make use of program organizations. For example, the manager of an administration
unit could hypothetically fill a structural hole by developing direct contacts with
key program units, in order to gain more insight into the organizations external
relationships. Alternatively, in some organizations, a shift in the nature of program
management might pose problems: for example, if management expects program
managers to play an active role in matrix management, their role as gatekeepers
might conflict with organizational needs. In such a case, analysis and visualization
of network relationships between organizational levels could provide a basis for
accurate organizational assessment and realignment.
2.2 Structural Relationships Within Organizational Units
We conducted a small exploratory study to demonstrate use of email network analysis

to visualize relations among members of an organizational unit. Figures 5 and 6 show
email networks that were obtained from email exchange records among the mem-
bers of two LANL groups over a period of two weeks. We intentionally chose groups
that do similar work (theoretical research). In the smaller group in Fig. 5, the two
nodes with highest betweenness centrality are group managers, and the third is tech-
nical support staff. Thus, the group has a relatively unified hierarchical structure
with management and support staff at the center. In the larger group, managers were
still among the most central nodes, but many other nodes had similar betweenness
36 B.H. Sims et al.
Fig. 4 Total emails to/from

commercial (.com, .net,
.info) versus non-commercial
(.gov, .edu, .mil, etc.)
domains, by organization
type
Fig. 5 Email network for 2 week period in smaller group. Size of a node is proportional to logarithm
of its betweenness centrality. Nodes with different colors correspond to different communities that
were identified by application of the Girvan-Newman algorithm to the groups email network
[12, 15]. Link widths are proportional to the logarithm of the number of emails exchanged along
these links. The network was visualized by assigning repulsion forces among nodes and spring
constants proportional to the link weights, and then finding an equilibrium state
centrality (Fig. 6). These include administrative assistants, seminar organizers, and
several project leaders. This indicates a flatter, less centralized organizational struc-
ture. In order to explore group structure, we applied the Girvan-Newman community
detection algorithm to each graph [12]. For the first group, this algorithm identified
four communities, the significance of which is not clear to us; for the second group, it
revealed two main communities that correspond to two previous groups that merged
to form the current group. These interpretations could be expanded by use of alterna-
tive centrality measures and comparison of various community detection methods.
Fig. 6 Email network for 2 week period in larger group
3 Node Connectivity Distribution as a Function

of Organizational Hierarchy
Several network types, including biological metabolic networks [31], the World Wide
Web, and actor networks [30], are conjectured to have power law distributions of
node connectivity. In the case of metabolic networks, the interpretation of scale free
behavior is complicated by the lack of complete knowledge and relatively small sizes
(103 nodes) of such networks, while the mechanisms of self-similarity in many large
social networks are still the subject of debate. However, organizational hierarchy has
been shown to generate degree distributions for contacts between individuals that
follow power laws [2].
Managers prefer to use email to communicate with subordinates in many different
communication contexts [25]. We propose that, in addition to the general effects of
organizational hierarchy, particular email communication practices of managers may
provide an underlying mechanism that generates power law distributions in node con-
nectivity of organizational email networks. To explore this possibility, we develop a
scale-free behavioral model that considers the effects of mass email announcements
sent by managers to subordinates. In this model, the self-similarity of the connec-
tivity distribution of the email network is a consequence of the static self-similarity
of the management structure, rather than resulting from a dynamic process, such
as preferential attachment [26] or optimization strategies [27]. More specifically,
self-similarity is due to the ability of a manager to continuously and directly com-
municate only with a relatively small number of people, while communications with
other employees have to be conveyed in the form of broad announcements.
Suppose that the top manager in an organization sends emails to all employees
from time to time. This manager must correspond to the node in the email network
that has highest connectivity N . Suppose that the top manager also talks directly (in
person) to l managers that are only one step lower in the directors hierarchy (lets
call them 1st level managers). Each of those 1st level managers, presumably, control
their own subdivisions in the organization. Assuming roughly equal spans of man-
agerial control, we can expect that, typically, one 1st level manager sends emails
38 B.H. Sims et al.
to N /l people. In reality, each manager also has a support team, such as assistants,
administrators, technicians, etc. who also may send announcements to the whole
subdivision.
Let us introduce a coefficient a which says how many support team employees are
involved in sending global email announcements in the division on the same scale
as their manager. We can then conclude that at the 1st level from the top there are al
persons who send emails to N /l employees at a lower level.
Each 1st level manager controls l 2nd level ones and we can iterate our arguments,
leading to the conclusion that there should be (al)2 managers on the 2nd level who
should be connected to N /(l 2 ) people in their corresponding subdivisions. Continu-
ing these arguments to the lower levels of the hierarchy, we find that, at a given level
x, there should be (al)x managers (or their proxies) who write email announcements
to N /(l x ) people in their subdivision.
Consider a plot that shows the number of nodes n versus the weight of those
nodes, i.e. their outdegree w. Considering previous arguments, we find that the weight
w = N /(l x ) should correspond to n = (al)x nodes. Excluding the variable x, we
find
log(al)
log(n) = (log(N ) log(w)) , (1)
log(l)
where log is the natural logarithm.

Equation (1) shows that the distribution of connectivity, n(w), in a hierarchi-
cal organizational email network should generally be a power law with exponent
log(al)
log(l) > 1. Obviously, at some level x, this hierarchy should terminate around the
point at which (al)x = N /(l x ), because the number of managers should not normally
exceed the number of employees. Hence the power law (1) is expected to hold only
for nodes with heavy weights, e.g. n > 50, i.e. for nodes that send announcement-like
one-to-many communications, and at lower n this model predicts a transition to some
different pattern of degree distribution. At this level, it is likely that non-hierarchical
communication patterns begin to dominate in any case.
In order to compare this model to actual network data, we analyzed the statistics
of node connectivity in email records at LANL during a two-week time interval
(Fig. 7). We removed nodes not in the domain lanl.gov and cleaned the database of
various automatically generated messages, such as bouncing emails that do not find
their target domain. In this case, however, we kept domains that did not correspond to
specific employees, in order to preserve emails from mailing lists that managers may
use to communicate with employees. Our remaining network consists of N 32,000
nodes, which is still about three times the number of employees at LANL. This is
partially attributed to the fact that we included addresses not tied to individuals, and
also the fact that a significant fraction of employees have more than one email address
for various practical reasons.
Numerical analysis, in principle, should allow us to obtain information about
parameters l, x and a, from which one can make some very coarse-grained conclu-
sions about the structure of the organization. Such an analysis should, of course,
always be applied with a certain degree of skepticism due to potential issues with
Fig. 7 LogLog plot of the

distribution of the number of
nodes n having the number
of out-going links w
Fig. 8 Zoom of Fig. 7

for w > 40. Red line is a
linear fit corresponding to
log(n) 14.0 2.47 log(w)
data quality, the simplicity of the model, and logarithmic dependence of the power
law on some of these parameters [6]. We found that our data for w > 40 could be
well fitted by log(n) 14.0 2.47log(w) (Fig. 8). If, e.g., we assume l = 4, then
a 7, i.e. each manager has the support of typically a 1 = 6 people, who help
her post various announcements to her domain of control. The power law should
terminate at the level of hierarchy x given by (al)x = N /(l x ), which corresponds to
x 3, i.e. the email network data suggest that there are typically x = 3 managers
of different ranks between the working employee and the top manager of the orga-
nization. The typical number of email domains to which the lowest rank manager
sends announcements is wmin N /l x 48. This should also be the degree of the
nodes at which the power law (1) should be no longer justified. Indeed, we find the
breakdown of the power law (1) at w < 40. This estimate also predicts that a typical
working employee receives emails from (x + 1)a = 28 managers or their support
teams.
40 B.H. Sims et al.
Fig. 9 The frequency of non-manager nodes receiving emails from a given number of different
managers during the considered time interval. Managers are defined as nodes sending emails to
more than 45 different addresses
Comparing these results to the actual organizational structure of the organization

is very difficult due to the lack of empirical data on many of the model parameters.
However, email data does enable us to independently test the above prediction of
28 managers (or their surrogates) sending emails to the lowest rank employees.
For purposes of analysis, we define managers as individuals sending emails to more
than 45 different addresses during the time interval represented by our data (i.e.
belonging to the power law tail of the distribution). We then produced a histogram
of the distribution of the number of emails sent to each non-manager by managers.
(In reality this corresponds to the number of emails non-managers receive from
managers as well as their surrogates.) Figure 9 shows that this distribution does indeed
peak near the mean value 26, which closely agrees with the model prediction. This
result validates our choice for l, which we set to four in previous calculations, and
shows that the model is generally consistent with our email data. One can also
see from Fig. 9 that email network characteristics, such as the number of emails
employees receive from managers, are described by a distribution rather than a single
number. Our model cannot predict the structure of such distributions. Rather, it is
useful as a relatively simple model that can recognize hierarchical features that may
be typical for email networks of large organizations. Future validation efforts could
involve collecting additional data to measure the actual values of parameters l, x and
a for LANL and other organizations, as well as characterizing patterns of mass email
usage in more detail.
Fig. 10 The number of emails sent per minute (top) and number of addresses sending email per
minute (bottom) over a one week time interval
4 Email Traffic in Real Time
Figure 10 shows total email traffic and number of addresses sending email over one
week with a one minute resolution. Working days have a bi-modal distribution with
heaviest activity at the beginning and end of the day. The lower level of activity
on Friday is related to an alternative work schedule that most LANL employees
42 B.H. Sims et al.
follow. This schedule enables employees to take every other Friday off in exchange
for working longer hours MondayThursday. As a consequence, only slightly more
than 50 % of the workforce is at work on a given Friday. This is directly reflected in
the amount of email traffic on Fridays.
5 Conclusion
Visualizing and modeling email traffic in complex organizations remains a challenging

problem. Visualizing email data in terms of formal organizational units reduces com-
plexity and provides results that are more intelligible to organization members and
analysts interested in understanding organizational structure at a macro level. For
predicting the degree distribution of high-degree nodes in an organization, we find
that it is useful to take into account both organizational hierarchy and email-specific
behavior (in particular, the use of mass emails within line management chains). These
findings suggest that considering information about formal organizational structures
alongside email network data can provide significant new insights into the function-
ing of large, complex organizations.
References
1. Allen TJ, Cohen SI (1969) Information flow in research and development laboratories. Adm
Sci Q 14(1):1219
2. Barabasi A-L, Ravasz E, Vicsek T (2001) Deterministic scale-free networks. Phys A 299:
559564
3. Bugos GE (1993) Programming the American aerospace industry, 19541964: the business
structures of technical transactions. Bus Econ Hist 22:210222
4. Burt RS (1992) Structural holes: the social structure of competition. Harvard University Press,
Cambridge
5. Chapanond A, Krishnamoorthy MS, Yener B (2005) Graph theoretic and spectral analysis of
Enron email data. Comput Math Organ Theory 11:265281
6. Clauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in empirical data. SIAM
Rev 51:661703
7. Collins HM (1985) Changing order: replication and induction in scientific practice. Sage,
London
8. Diesner J, Frantz TL, Carley KM (2005) Communication networks from the Enron email corpus
Its always about the people. Enron is no different. Comput Math Organ Theory 11:201228
9. Doreian P, Batagelj V, Ferligoj A (2005) Generalized blockmodeling. Cambridge University
Press, Cambridge
10. Freeman LC (2009) Methods of social network visualization. In: Meyers RA (ed) Encyclopedia
of complexity and systems science. Springer, Berlin, pp 29812998
11. Gillespie R (1991) Manufacturing knowledge: a history of the Hawthorne experiments. Cam-
bridge University Press, Cambridge
12. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc
Natl Acad Sci 99:78217826
13. Granovetter MS (1973) The strength of weak ties. Am J Sociol 78(6):13601380
14. Hansen MT (1999) The search-transfer problem: the role of weak ties in sharing knowledge
across organization subunits. Adm Sci Q 44(1):82111
15. Hansen DL, Shneiderman B, Smith MA (2011) Analyzing social media networks with NodeXL:
insights from a connected world. Elsevier, Burlington
16. http://gephi.github.io/
17. http://www.cytoscape.org/
18. http://www.lanl.gov/about/facts-figures/talent.php
19. Karagiannis T, Vojnovic M (2008) Email information flow in large-scale enterprises. http://
research.microsoft.com/pubs/70586/tr-2008-76.pdf
20. Kilduff M, Tsai W (2003) Social networks and organizations. Sage, London
21. Kossinets G, Watts DJ (2006) Empirical analysis of an evolving social network. Science 311:
8890
22. Krackhardt D (1992) The strength of strong ties: the importance of Philos in organizations. In:
Nohria N, Eccles RG (eds) Networks and organizations: structure, form, and action. Harvard
Business School Press, Boston
23. Krackhardt D, Hanson JR (1993) Informal networks: the company behind the chart. Harvard
Bus Rev 71(4):104111
24. Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking
diameters. ACM Trans Knowl Discov Data 1(2)
25. Markus ML (1994) Electronic mail as the medium of managerial choice. Organ Sci 5:502527
26. Mitzenmacher M (2004) A brief history of generative models for power-law and lognormal
distributions. Internet Math 1:226251
27. Papadopoulos F, Kitsak M, Serrano MA, Boguna M, Krioukov D (2012) Popularity versus
similarity in growing networks. Nature 489:537540
28. Phelps C, Heidl R, Wadwha A (2012) Knowledge, networks, and knowledge networks: a review
and research agenda. J Manag 38:11151166
29. Polanyi M (1966) The tacit dimension. Doubleday, Garden City
30. Ravasz E, Barabasi A-L (2003) Hierarchical organization in complex networks. Phys Rev E
67:026112
31. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi A-L (2002) Hierarchical organization
of modularity in metabolic networks. Science 297:15511555
32. Sims BH, Sinitsyn N, Eidenbenz SJ (2013) Visualization and modeling of structural features
of a large organizational email network. In: Proceedings of the 2013 IEEE/ACM international
conference on advances in social networks analysis and mining. ACM, New York, pp 787791
33. Tortoriello M, Reagans R, McEvily B (2012) Bridging the knowledge gap: the influence of
strong ties, network cohesion, and network range on the transfer of knowledge between orga-
nizational units. Organ Sci 4:10241039
34. Wuchty S, Uzzi B (2011) Human communication dynamics in digital footsteps: a study of the
agreement between self-reported ties and email networks. PLoS ONE 6(11):e26972
35. Zeini S, Ghnert T, Hoppe U, Krempel L (2012) The impact of measurement time on subgroup
detection in online communities. In: Proceedings of the 2012 IEEE/ACM international confer-
ence on advances in social networks analysis and mining. IEEE, Los Alamitos, pp 389394
Overlaying Social Networks of Different
Perspectives for Inter-network Community
Evolution
Idrissa Sarr, Joseph Ndong and Rokia Missaoui
Abstract In many real-life social networks, a group of individuals may be involved

in multiple kinds of activities such as professional, leisure and friendship ones. Even
though individuals may belong to a social network with a very precise type of links
such as professional ties in LinkedIn, the interactions that may happen in other social
networks such as Facebook are not reflected in the original network. We believe that
overlaying networks with various types of links helps discover interesting patterns.
The objective of this paper is then to overlay two or many social networks with
different kinds of social activities in order to unveil homogeneous groups that could
not appear in a unique social network. To that end, we propose a community detec-
tion approach based on possibility theory, which identifies time-based perspective
communities for each kind of social activities that occur within a sequence of time
windows. Furthermore, different perspectives are layered to detect communities that
may belong to several networks in a given time period. Discovered communities in
a given network for a time period can be perceived as views or perspectives in one
or many networks.
Keywords User behavior analysis Perspective community Community

evolution Possibility theory Active/passive social actors
I. Sarr (B) J. Ndong

Universit Cheikh Anta Diop, Avenue Cheikh Anta Diop, BP 5005, Fann Dakar, Senegal
e-mail: idrissa.sarr@ucad.edu.sn
J. Ndong
e-mail: joseph.ndong@ucad.edu.sn
R. Missaoui
Universit du Qubec En Outaouais, Qubec, Canada
e-mail: rokia.missaoui@uqo.ca

DOI 10.1007/978-3-319-12188-8_3
46 I. Sarr et al.
1 Introduction
The interactions or relationships between individuals of a social network can be

used to group actors into homogeneous communities with similar contact patterns or
interests. Many studies tackle the problem of identifying communities in a network
[6, 7, 15, 22, 26, 28]. Some of these studies take into account the possibility to have
overlapping communities while other ones consider disjoint communities.
To analyze the network properties, a general approach consists to consider the
network as a static view in which all the links in the final network are already present
throughout the study. However, this is a very simplified assumption that might be
useful for a network which is built instantly and does not evolve frequently over
time. Thus, if the network changes over time, it is worthwhile to take into account
the fact that ties may be temporary and some network features can change at many
time periods.
Therefore, it is important to analyze network evolution and consider a set of time
windows in order to assess how the network changes over time and consequently
discover changes in ties between nodes.
1.1 Motivations
Recent studies propose to analyze dynamic networks [1, 8, 17] to detect community
evolution. Most of these studies use topological properties to identify the updated
parts of the network and characterize the type of changes such as network shrinking,
growing, splitting, and merging [3].
There are many studies about community detection [6, 21]. A well-known
approach for community detection is described in [7] and is based on the intuition
that groups within a network may be detected through natural divisions among
the vertices without requiring to set the number of groups or put restrictions on their
size.
Many other approaches have been developed for tracking the evolution of social
communities over time [1, 16, 23, 27]. To that end, they use several static views of
the network at different time slots. For each view, one may use an existing community
detection algorithm [6] to depict the community topology. Therefore, between two
time points, changes may occur such as a network growth or partition. Most of the
new community detection approaches are devised on an underlying event framework
that defines a specific behavior of a community like birth, growth, and merging in
network evolution [1].
More recent approaches study different issues for heterogeneous information
networks [25] which contain more than one type of links or nodes. Each type of link
indicates a specific relationship between actors. A simple example is a network that
describes two types of nodes: Researcher and Publication and two categories of links:
collaboration between researchers and authorship between researchers and publica-
tions. Indeed, the authors in [25] report different studies on mining and analyzing
Overlaying Social Networks of Different Perspectives . . . 47
such networks and tackle many challenging issues such as dynamic network/group
detection, behavior analysis of an actor over time based on the network content or
the actions of other actors [11], relationship prediction, node ranking combined with
clustering (or classification), and similarity search (e.g., look for researchers who
have similar profiles).
In [11], authors rely on social bookmarking to analyze communities over time.
The approach assumes that aggregating the non coordinated tagging actions of a
large and non homogeneous group of actors can be exploited for enhanced knowledge
discovery and sharing. Therefore, based on the tags and the actors who choose them,
they provide a framework for community-based organization of web resources.
To summarize, community evolution has the advantage to foresee the overall
trend of a group and anticipate some positive or negative effects they lead to. For
example, detecting the growth of a botnet at its early stage may help foresee criminal
or suspicious attacks. The approach proposed in the present work is well related to
the recent approaches that oversee evolving networks since it relies entirely on the
actor behavior with respect to the activities that occur in a single network or even in
many networks. Moreover, contrary to most of the studies, we set the relevance of
social activities using possibility theory that helps find communities in an accurate
way.
1.2 Contributions
In this paper, we do not focus directly on detecting the community evolution as
it is often the case in the literature, but we aim to track temporary communities,
which are built based on temporary ties created between a set of actors during a
time slot. Basically, we assume that actors may have temporary links (e.g., during a
set of activities) that might disappear afterwards. Such links are mined in order to
extract dominant features of the network like temporary communities that we call
perspective communities. Moreover, we use temporary links to identify active and/or
passive actors.
Our approach relies on the methods described in [18, 24] where the authors
identify a social network from collected temporary data. Moreover, most of the
solutions proposed to detect communities generally use statistical inference meth-
ods based on the probability theory which achieves relatively good performance.
In most of the cases, modeling processes are built to get results with high proba-
bilities (90 %). In this work, we try to go beyond such techniques and our main
contributions can be summarized as follows:
A method to track changes within a social network by identifying temporary links
established between actors during activities in a given set of time slots. The tempo-
rary links are obtained using probability and mined afterwards in order to extract
dominant features of the network such as perspective communities.
A relationship prediction method based on possibility distributions to overlay a
set of networks in order to unveil hidden communities. Our approach is based
48 I. Sarr et al.
on a very simple principle between probability and possibility that may be stated
in an informal way as: what is probable should be possible. Using possibility
rather probability theory has the advantage to overcome the knowledge about the
incompleteness and the uncertainty of data from which prediction is conducted.
Consequently, the approach has the advantage to detect more precise temporary
links as well as perspective communities that highlight the dynamic changes in
one or many networks over time.
The rest of this paper is structured as follows: Sect. 2 gives basic concepts and
definitions about social networks. Sections 35 present a mechanism to detect the
network evolution over time and mainly how we figure out active nodes and virtual
communities. Section 6 covers the approach validation while Sect. 7 summarizes our
contribution and presents future work.
2 Basic Concepts and Definitions

We consider a social network S as a graph G = V, E where vertices in V are
actors such as individuals or organizations, and links/edges in E are interactions or
ties between actors (e.g., friendship, collaboration). In this paper, we assume that all
the links between actors are symmetric and un-weighted. However, the present work
will be extended to deal with weighted and directed graphs.
2.1 Activity
An activity is a social or professional event or task conducted by users. It could
be a meeting, conference, festival, concert, post, image publication, tweet/re-tweet,
etc. Inside a community or a whole network, activities are numbered and tags are
associated with them. For example, a tag in the Twitter micro-blogging platform may
be sport, high technology, culture, movie, etc.
Furthermore, actors may be involved or not in a given activity. Formally, the
behavior of an actor k with respect to an activity ai is represented as:

1 if actor k attends activity ai
bk (ai ) =
0 otherwise.
To track activities over time, we consider that they happen in a given time window
j = [T j , T j + ]. For each window, we capture a snapshot of activities which may
be of different types. To illustrate our approach, we consider a collaboration network
of researchers. Basically, the network is drawn based on co-authorship patterns, and
we track the co-participation of actors to activities such as meetings, conferences or
social events. Moreover, we assume that ten activities happen within a single time
window. Table 1 depicts the matrix that shows the participation of researchers to a
set of activities. One may see that Researcher 1 takes part to activities a2 , a3 , a5 , a8 ,
and a10 since b1 (ai ) is equal to 1 for these activities.
Table 1 Participation of actors to activities

Actors Activities
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
1 0 1 1 0 1 0 0 1 0 1
2 0 1 0 1 1 1 0 1 0 1
3 0 1 1 0 1 1 0 1 0 1
4 1 0 0 1 0 0 1 0 1 0
5 1 1 0 1 0 0 1 0 1 0
6 1 0 1 1 0 0 1 0 1 0
7 1 1 1 0 1 1 0 1 0 1
8 1 0 0 1 0 0 1 1 1 0
9 1 1 1 0 1 1 0 1 0 1
2.2 Perspective Community
Actors participating to activities may have joint interactions. For example, actors may
be linked to interact or collaborate during a meeting or a conference. Such interactions
are considered as temporary since they are established during a time period and may
be broken later on. With this in mind, we define a perspective community as a set
of participating actors and the temporal ties they share for joint activities performed
during a given time period.
3 Tracking Node Behavior Over Time
The goal of this section is to describe how we track the behavior of network nodes over
time in order to identify active and passive actors. The advantage of such identification
is also discussed.
3.1 Identification of Active and Passive Actors
An actor in a network is considered as active during a time window j if he attends

all or most of the activities that happen in j . Formally, if A j = {a1 , a2 , ..., an } is
the set of activities that happen within the j interval, an actor k is active within j
if the following inequality holds:
n
|A j | i=1 bk (ai )
r (1)
|A j |
50 I. Sarr et al.
where r is a user-defined laziness ratio, i.e. the allowed percentage

n of activities to
which an actor may not react to. When r = 0, i.e., |A j | = i=1 bk (ai ), Actor k is
called an ubiquitous actor since he attends all activities. This statement makes sense
provided each time window refers to at least one activity. A passive actor is then the
one for which the above ratio is greater than r .
An illustrative example is given in Table 1. When r is set to 40 %, Eq. (1) shows
that Researchers 7 and 9 are the active nodes. However, if r is equal to 50 %, all
researchers are active actors except Researcher 4 who takes part to less than 50 % of
activities.
Given the whole set of time widows = {1 , . . . , q }, an actor is active in that
period if Eq. (1) is true for at least the proportion R of the number of windows in .
Hence, if m is the number of time windows within which an actor ai is active, ai is
permanently active when || m
R. The set of active nodes in is named AN .
3.2 Algorithm
Algorithm 1: Active Actor Discovery

Input : C: a set of actors/nodes in the network,
= {1 , . . . ., j , . . . ., q }, a set of windows with their corresponding set of
activities in A j
b(I, J ): Participation matrix of actors to activities for each A j
r and R: two thresholds
Output : AN : the set of active nodes of C in
1 begin
2 AN = {} /* initialize the set of active actors */
3 foreach k C do
4 m=0
5 foreach j do
6 par t N umber = 0
7 foreach ai A j do
8 par t N umber = par t N umber + b(k, ai )
9 end
| A j | par t N umber
10
if | A j | r then
11 m =m+1
12 end
13 end
14
m
if || R then
15 add(AN , k)
16 end
17 end
18 return AN
19 end
Given a network, Algorithm 1 computes the set of active actors for a set of
time windows. The input covers the nodes of the network, the set as well as the
matrices related to the participation of actors to activities in the windows within
. For each actor and each time window, the algorithm computes the laziness ratio
(Lines 39). After processing all time windows in , it checks whether the number
of times a given actor is active reaches at least the threshold R (Line 11). Function
add(AN , k) adds the node k to the set of active actors AN .
The complexity of the algorithm is proportional to the cardinality of the set C of
nodes in the network.
3.3 Applications
The identification of active and passive nodes in a network has positive effects in
many real-life applications. In the following we provide two possible utilizations of
our approach:
Churn detection. After a subsequent set of time windows, our method identifies
inactive nodes that may be considered as churners [10]. Churn detection is fruit-
ful for most service-based companies like telecommunication, banking and social
network services that may see their profitability decreased with the loss of cus-
tomers. This is also useful to predict employee attrition based on the decrease of
employees participation to social or professional activities within an organiza-
tion. Therefore, predicting or detecting customer or employee attrition in its early
stage gives more flexibility to companies to apply appropriate incentives to keep
customers or employees in their business.
Targeted marketing/advertising. Detecting active actors can be applied to identify
and/or rank actors who may react positively to a social or professional invitation
or a product/service advertising. In such a framework, only active actors will be
targeted since they exhibit an important participation to past activities and could
be future attendees or customers of the promoted event, product or service.
4 Tracking Community Evolution Using Probability Theory
We recall that our goal is to build a method to identify perspective communities

in a social network. This can be achieved by means of probability distributions as
described in the following section. Only active actors contribute to the composition
of perspective communities. We recall that active actors are those who participate
to activities that occur within several time slots. Once active actors are identified,
links between them are then added using the same principle as in [2, 12, 13, 20]
where a tie is added between two nodes based on their co-occurrences on several
Web pages or documents. In our case, we add a link between two actors based on
their participation to common activities.
52 I. Sarr et al.
4.1 Estimating Node Relationship Using Probabilities
Most of the online systems created in recent years like Facebook and MySpace offer a
rich set of activities and facilities for extensive interactions [4]. These systems record
both activities and interactions, thereby enabling the construction of a social network
after a unique sequence of activities. However, our goal is not to find perspective com-
munities after each activity but after a set of activities that happen within a collection
of time windows. The main reason is that two nodes may interact during activities in
a selected window and never for the rest of subsequent windows. Hence, using only
data from one window of activities is not enough to estimate the intensity of the link
between two nodes. Therefore, we consider a universe = {1 , . . . , j , . . . , q }
of time windows. For each couple of actors (k, l), we consider the parameter vector
1 , p 2 , . . . , p q ) that characterizes the probability distribution of a ran-
pk,l = ( pk,l k,l k,l
dom variable X (the relation or link between actors) on the set . The parameter
j
pk,l is the probability that actor k is linked to actor l during the n activities found in
window j :
n j
j i=1 Mk,l (ai )
pk,l = n n (2)
min i=1 bk (ai ), i=1 bl (ai )
j
where Mk,l (ai ) is the Meeting function that indicates if both k and l have attended
activity ai in window j and have interacted. It corresponds formally to:

1 if bk (ai ) = bl (ai ) = 1
Mai (k, l) =
0 otherwise.
j n
In other words, pk,l is the overlap coefficient while i=1 Mai (k, l) corresponds to
the matching coefficient. We use the overlap coefficient because it is shown in [20]
that it is more adapted to social network analysis than the matching and Jaccard
coefficients.
The intensity of the relation between two nodes can be set as their total
co-occurrences in the whole set of windows. This value is namely represented by the
parameter vector pk,l . A heuristic method may consist to apply a threshold vector
c = (1 , 2 , . . . , q ) to pk,l to decide if a link can be added between k and l after
observing activities in the set of windows. In fact, a link is added between k and l if
i for every time window .
pk,l i i
Finally, the perspective communities based on a set of activities are identified as
follows:
run Algorithm 1 to compute the set AN of all active actors
for each couple of actors k and l in AN , add a link between k and l whenever the
computed value pk,l is at least equal to the user-defined threshold c for all the
time windows.
Fig. 1 Initial network
4.2 Example
To illustrate our approach we consider the collaboration network of researchers
described in Sect. 2. Basically, we assume that the network is drawn based on
co-authorship patterns. The initial network is depicted in Fig. 1.
For the sake of clarity, we assume that contains only the time window j .
As a consequence, the threshold vector c is reduced to the single value 1 . With
this insight, we set two distinct values of the threshold c : 40 and 60 %, and we
draw the resulting networks. Figures 2 and 3 depict the perspective networks when
c = (40 %) and c = (60 %) respectively. With the value 40 %, the perspective
community is dense since new links are added even when two actors share a low
number of activities. That is the reason we have more links in Fig. 2 than in Fig. 1
which represents the initial network. Furthermore, with a low value of c , a real
closeness of two actors is not guaranteed. However, when c = (60 %), links are
added only between actors who participate to at least 60 % of activities. This leads to
more cohesive groups that share a common behavior. Figure 3 highlights two distinct
groups formed based on the intensity of temporary links established between actors.
Moreover, one may observe in Table 1 that nodes in the group {4, 5, 6, 8} shown in
Fig. 3 have a participation rate smaller than the group {1, 2, 3, 7, 9}. If ever such
a behavior is observed (or reinforced) over subsequent time windows (or over a
long period of time), an attrition of the corresponding group may be expected. We
recall that perspective communities depict only temporary interactions (e.g., who
co-participates with whom), and are different from more stable communities in the
initial network (e.g., co-authorship network). However, when mapped over the initial
network, perspective communities give additional insight about new cohesive groups
that arise from activity participation.
4.3 Discussion
The approach presented before is heavily based on using thresholds. However, it
is not easy to find the right and adapted threshold values for each case. Generally,
an heuristic method is used to compute such values. Even though efficient methods
54 I. Sarr et al.
Fig. 2 Perspective with c = (40 %)
Fig. 3 Perspective with c = (60 %)
can be devised to set threshold values and identify perspective communities based
entirely on probabilities, we believe that it might be more useful and effective to
reinforce such techniques by appropriate considerations. Therefore, we propose to
combine both possibility and probability theory to improve the accuracy of perspec-
tive communities built from the activity data. The main reason is that possibility
theory can be viewed as an upper bound on a probability theory.
5 Tracking Community Evolution Using Possibility Theory
5.1 Why Possibility Theory?
The modeling and management of uncertainty is one of the main issues in the
design process of complex decision systems. Due to the diversity of information
sources, uncertainty can take one of the following forms: randomness, incomplete-
ness, and inconsistency. In our framework, different kinds of uncertainty can be
found with respect to: (i) the quality of the selected activities, (ii) the selection
of the appropriate number of time windows, and (iii) the choice of the underlying
distribution of identified random variables such as links between nodes.
It is important to note that both possibility and probability theories can be used to
represent uncertainty [5]. However, they do not capture the same aspects of uncer-
tainty. In fact, the basic feature of probabilistic representations of uncertainty is
additivity. Uniform probability distributions may be used to model randomness on
finite sets. They are adapted for expressing total ignorance in belief modeling. As
a consequence, probability theory offers a quantitative model for randomness and
inconsistency while possibility theory offers a qualitative model of incompleteness.
In Sect. 4 where we propose a method for capturing community evolution using
probability theory, many important questions can be raised: (i) What is the most
appropriate number of time windows to consider during the process? (ii) How can
we quantify properly the possibility of having links between actors for each time
window? and finally (iii) How can we assign a relevance degree to each time window
when the importance of the underlying events is taken into account?
In the following, we consider these questions and rely on possibility theory to
identify perspective communities in a more accurate manner. For a thorough view
of possibility theory, we refer the reader to [5, 29, 30]. Initiated by Zadeh [29],
possibility theory is based on a principle which involves the operation supremum.
The supremum (sup) is the least upper bound of a subset S of a totally or partially
ordered set T . According to Dubois et al. [5], a possibility measure on a set X is
characterized by a possibility distribution : X [0, 1], and is defined by:
A X, (A) = sup{(x), x A}. (3)
The key concept of a possibility distribution is the preference ordering it establishes

on X . Basically, indicates what one knows about the value of a given variable Y ,
and (x) > (x
) states that Y = x is more plausible than Y = x
. If (x) = 0,
then x is an impossible value of the variable Y while (x) = 1 means that x is one
of the most plausible values of Y .
Since possibility and probability theories aim at representing different kinds of
uncertainty, it is often desirable to move from one framework to another to integrate
for example heterogeneous information data. A related complex problem is to build
a possibility distribution from empirical data. In this work we assume that the dis-
crete data associated with time windows are generated from an unknown probability
distribution. The measures of possibility distribution are inferred afterwards from
the probability distribution. More details about the mapping probability-possibility
are given in the appendices.
5.2 Perspective Community Detection Using Possibility Theory

In this section, we show how to use properly a set of possibility distribution measures
to find perspective communities within the network.
56 I. Sarr et al.
5.2.1 Setting the Relevance of Activities
As we mentioned earlier, the importance of activities in a time window can have an

impact on the participation of actors to them and on their interaction with others. The
main benefits of introducing relevance is to prune time windows with less impor-
tant activities. Therefore, characterizing the importance of a class of activities by a
measure of possibility helps set a threshold with a better accuracy and draw more
effective links between nodes. To that end, activities of a time window are considered
random variables for which the probability distribution is unknown a priori.
Let nK denote the number of activities of a time window k . Thus, the random
vector n = (n 1 , . . . , nK ) can be considered as a multinomial distribution with para-
meter p = ( p1 , p2 , . . . , pK ). A confidence region for p at level 1 can be com-
puted using simultaneous confidence intervals as described in [19]. Such a confidence
region can be perceived as a set of probability distributions.
We propose to characterize the probabilities p = ( p1 , p2 , . . . , p K) of generating
the different activities by simultaneous confidence intervals with a given confidence
level 1 . Here, pk represents the probability of generating the activities within k .
From this imprecise specification, a procedure for constructing a possibility distrib-
ution is described, insuring that the resulting possibility distribution will dominate
the true probability distribution in at least 100(1 ) of the cases.
We use a rigorous step by step procedure described in Appendix B to compute the
possibility distribution from p. This procedure gives a vector of possibility distribu-
tions for the set of all activities which we represent by ts = (1 , 2 , . . . , K). This
vector is used to decide whether a set of activities within a time window is relevant
or not. We refer the reader to Appendix B for further details on the transformation
process.
5.2.2 Finding Temporary Links
To predict a link, we need to find a threshold and apply it to the vector of possibility
distributions for each pair of actors as done in Sect. 4 with probability distributions.
Thereafter, we consider that two actors may be linked if they both interact during
several time windows. The number of time windows within which two actors interact
is named the major participation (MP). Hence, if the major participation of two actors
exceeds a threshold, then a link will be added between them.
To reach our goal, we define a new measure norm that helps decide whether a link
can be added between two actors. norm corresponds to the minimum of the possibility
distribution vector ts . Basically, the choice of the minimum value is trivial and is
based on the worst case, i.e., the time window with the lowest number of interactions.
In other words, the lowest case is the window with less important activities and has
the lowest degree of possibility. However, even though this worst window should
be the one that is less possible, some actors might keep a little interest to participate
to its underlying activities. Then, this window can be used to establish the worst
scenario where only a very few number of actors are linked to others.
norm = min(ts ). (4)
The co-participation of two actors to the majority of time windows should at least
be greater than this minimum to decide whether a link may be drawn between them.
Thus, we can define the major participation of two actors as the percentage of their
co-participation over all windows of activities as follows:

MP(k, l) = [(i norm ) == 1] [(i norm )] . (5)
In this formula, we recall that i is the possibility distribution vector containing the
different values of the intensity of the relation for a given pair (k, l) of actors in
all windows. We consider that a link exists between two actors if and only if their
major participation exceeds a given user-defined threshold that reflects the term
majority of time windows.
Finally, actor k is linked to actor l if the following equation holds:
MP(k, l) . (6)
5.3 Algorithm
Algorithm 2 builds a set of perspective communities via two steps. The first step
computes the probability distributions while the second one identifies links between
actors after inferring possibility distributions.
The first step of the algorithm (Lines 48) finds probability distributions for each
couple of actors. To this end, we compute for each time window the co-participation
rate of the couple of actors k and l (Line 7). Afterwards, we add the result in a multi-
j
dimensional array by using Function addProb( pk,l , pk,l ) (Line 8). Once this task
is completed, we compute for each vector of probabilities the related distribution
possibility (Line 11). If the distribution possibility for k and l exceeds a thresh-
old for a given time window, then we increment the number of times these actors
co-participate to activities (Line 16). Finally, the last part of the algorithm checks
if the number of times two nodes co-participate at the same time is greater than a
threshold. If so, a link is added between them (Lines 2021).
6 Model Evaluation
In this section we aim to validate our approach, mainly the possibility theory solution
in order to assess the accuracy of the perspective community identification.
58 I. Sarr et al.
Algorithm 2: Perspective Community Discovery

Input : C, a community; , a threshold
= {1 , . . . ., q }, a set of time windows
AN , the set of actives nodes of C in
Output : P, a perspective community
1 begin
2 /* FIRST STEP: Estimation of the intensity of nodes
relation */
3 pk,l = [] /*Probability distributions vector of all
co-participations*/
4 foreach j do
5 foreach (k, l) AN do
6 foreach ai A j do
n
j i=1 Mai (k,l)
pk,l = n n
i=1 bk (ai ), i=1 bl (ai ))
7 min(
j j
8 addProb( pk,l , pk,l ) /* Add pk,l to the vector pk,l */
9 end
10 end
11 end
12 /* SECOND STEP: identifying links between nodes */
13 foreach (k, l) AN do
14 /* Possibility distribution for (k,l) pair. */
k,l = FFPD( pk,l ) /*where FFPD corresponds to Eq. (10)*/
15 majorPart = [k,l norm ]
16 nMajor = 0
17 foreach j = 1 : length(majorPart) do
18 if majorPart( j) == 1 then
19 nMajor = nMajor + 1
20 end
21 end
22 /*Major Participation*/
23 MP(k, l) = nMajor/size(majorPart)
24 /*Apply threshold to decide to put a link*/
25 if MP(k, l) then
26 addLink(V, k, l)
27 end
28 end
29 return V
30 end
6.1 Test Data
To validate our approach, we rely on a data set made of a collection of 132,307

reddit.com submissions [14]. Data concern the vote for (and submission of) images.
For each image, re-submission is allowed and conducted by a given group (commu-
nity) considered as a single actor based on the semantics we associate to the behavior
of each group with respect to an image. In fact, even if the members of a group submit
different votes, only the overall score representing the opinion of the whole group is
considered in the present study. Thus, a group with a score higher than a threshold
Table 2 Number of actors who participate to activities in a set of time windows

Scenario Length of TWs Number of TWs Number of actors
A 500 200 5
B 550 200 5
C 550 10 10
D 1,000 10 16
E 1,000 100 10
F 5,000 20 45
G 10,000 10 79
H 20,000 5 196
is set as participating to an image (re)-submission, and this participation is interpreted

as the response of one actor in our context. Actors comment and give scores to images.
In this data set, the notion of activity is related to submission or re-submission
of an image. This is similar to what one can get from Facebook regarding the reac-
tion of users to posts/tags. Moreover, since re-submission is allowed, this data set
is particularly interesting because it clearly emphasizes the importance of the activ-
ities an actor might have against other ones. It is worth noting that the importance
of activities is a predominant factor that might influence considerably community
creation.
By mining the huge number of activities within the data set, one can find potential
links between actors. The original data set is described by thirteen features among
which we only keep the #image id that represents the activity identification in our
context, and the #subreddit that identifies an actor. It is worth noting that we use
other features to indicate whether an actor participates to an activity or not.
Our aim is then to analyze the activities of actors to identify links between them.
To evaluate our approach, we perform an empirical study with different scenarios.
For each scenario, we consider a set of ten (10) time windows (TWs) for which we
look at the activities performed by a set of actors. The lengths of the time windows are
different, and for each scenario we retrieve the number of image (re)-submissions
that actors perform during a set of time windows. Table 2 summarizes this data
partitioning. For example, in scenario C, there are ten TWs of length 550 and ten
actors who participate to activities. Even though we perform experiences for all
scenarios, only results for scenarios C, D, E, and F are shown in Table 2. Scenarios
A and B have a very few number of communities while scenarios G and H have a
huge number of communities and it is not obvious to draw the resulting perspective
communities in this paper.
For a given scenario, we build a M N matrix where M represents actors and N
activities ((re)-submissions of images). When an actor Mi participates to an activity
N j , the corresponding cell in the matrix contains the value 1.
60 I. Sarr et al.

for scenarios C and E

for scenario D
6.2 Building the Initial Network
In order to successfully apply the proposed procedure, one can begin with an initial
network with linked actors. To that end, we run a simple algorithm based on prob-
abilities, which relies on the fact that two actors are linked if their total number of
co-occurrences exceeds a predefined threshold. For each pair of actors, we calcu-
late the probability of their co-participation within each window. If this probability
exceeds a threshold , then the actors are linked within the given time window. More-
over, we decide to set a tie between two actors if they are at least linked in at least
half of time windows, i.e., R = 0.5 %.
The initial networks are shown in Figs. 4, and 5. We detect respectively 17 links
both for scenarios C and E and 34 links for scenario D.
6.3 Validation
In the following we validate our approach on the initial networks shown in Figs. 4
and 5 and discuss the output of our procedure about perspective community detection.
As an illustration, we consider the initial network of scenario C (see Fig. 4) and
we build the entire procedure on the universe = {1 , 2 , 3 , . . . , 10 } with
ten time windows and ten actors who participate to the various activities within
each TW. The number of activities for the TWs is given by the following vector:
Number of votes
Number of votes
Actor 2 Actor 3
20 20
10 10
0 0
Number of activities Number of activities
Actor 4 Actor 5
40 20
20 10
0 0
Actor 6 Actor 7
20 40
10 20
0 0
Actor 8 Actor 9
20 4
10 2
0 0
Actor 10
2
1
0
Fig. 6 Number of appearances of the nine active actors in the first time window of scenario C
(1254, 1277, 1363, 1460, 1460, 1490, 1497, 1497, 1497, 1497). The actors are named
respectively {funny, GifSound, pics, gifs, atheism, gaming, WTF, aww,
reddit.com, 6}, but for simplicity we use numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 to
identify them respectively. The execution of Algorithm 1 to discover active actors
gives the vector AN () = {2, 3, 4, 5, 6, 7, 8, 9, 10}. This result is obtained when
we set manually the threshold to the value R = 2/3. Only Actor 1 is considered
non active since he does not have a sufficient number of participation within the
considered TWs. In Fig. 6, we show the number of actors activities inside the first
TW. Similar results are obtained for Scenarios E and F. For each time window, Table 3
shows the value of pi and pi+ (see Appendix B for more details) and indicates that
the value of the threshold norm is equal to 0.2775 (i.e., the lowest value of i ).
To identify new links between nodes, we set the threshold to 70 %. In Table 4,
after running Algorithm 2, we show the measures of possibility distribution that a
possible link may hold between a pair of actors.
For sake of clarity, we report values only for five pairs of actors. We observe the
possibility values between Actors 2 and 3 and notice that seven (7) of the ten (10)
values of the vector are greater than or equal to the threshold normal = 0.2775.
Thus, MP(2, 3) = 7/10, i.e., MP(2, 3) , and consequently a link between
Actors 2 and 3 is added. There is also a link between 2 and 4 and between 7 and 9
because MP(2, 4) = 7/10 . Conversely, one can see that only five values of the
possibility vector for Actors 2 and 5 are greater than norm , i.e., MP(2, 5) = 5/10,
62
Table 3 Interval-valued probabilities, possibility distributions, and length of each time window for scenario C and = 5 %
Time window i 1 2 3 4 5 6 7 8 9 10
pi 0.0832 0.0848 0.0907 0.0973 0.0973 0.0993 0.0998 0.0998 0.0998 0.0998 0.05
pi+ 0.0925 0.0941 0.1003 0.1072 0.1072 0.1094 0.1099 0.1099 0.1099 0.1099
iS 0.2775 0.2808 0.7884 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Length of window i 1254 1277 1363 1460 1460 1490 1497 1497 1497 1497
I. Sarr et al.
Table 4 Probabilities vectors of links between nodes and corresponding possibility distributions,
( = 0.05, i.e., to set confidence bounds at 95 %)
Time window t 1 2 3 4 5 6 7 8 9 10
Possibility and distribution vector for nodes 2 and 3
Probabilities 0.10 0.09 0.09 0.09 0.10 0.08 0.09 0.07 0.14 0.15
Possibilities 0.71 0.42 0.51 0.24 0.61 0.15 0.32 0.07 0.85 1.00
Probabilities 0.10 0.10 0.09 0.09 0.10 0.10 0.10 0.10 0.10 0.11
Possibilities 0.49 0.79 0.09 0.19 0.59 0.29 0.89 0.69 0.39 1.00
Probabilities 0.08 0.09 0.10 0.09 0.11 0.08 0.06 0.06 0.14 0.18
Possibilities 0.28 0.47 0.57 0.38 0.68 0.20 0.12 0.06 0.82 1.00
Probabilities 0.06 0.07 0.05 0.03 0.07 0.04 0.03 0.06 0.33 0.25
Possibilities 0.28 0.42 0.16 0.06 0.34 0.11 0.03 0.22 1.00 0.67
Probabilities 0.07 0.07 0.10 0.06 0.05 0.06 0.04 0.08 0.25 0.23
Possibilities 0.28 0.35 0.52 0.15 0.09 0.21 0.04 0.43 1.00 0.75
Case of scenario C
i.e., MP(2, 5) < and, thus, there is no link between these two nodes. There is no
link between Actors 7 and 8 because their MP(7, 8) = 6/10 is less than the value of
the threshold .
After computing MP for each pair of nodes, we get the perspective communities
shown in Fig. 7 where dashed lines represent new added links. After running our
procedure, links are added to the initial networks shown in Figs. 4 and 5. Such new
links help identify perspective communities. In the top left part of Fig. 7 built for
scenario C, we observe that Nodes 2, 3, 5 and 7 form a community even though the
other actors are also active. Other detected communities are {2, 3, 4, 7}, {2, 3, 4, 8}
and {2, 3, 9, 7}. The same reasoning can be done for the top right graph of Scenario
E. In the third graph in the bottom, one can see that Actors 3 and 9 have the most
important number of links with other nodes. These cases are interesting in the sense
that one can focus on perspective communities and leading nodes to take appropriate
real-life decisions about their underlying activities and evolution.
6.4 Parameter Turning
An interesting issue in the validation process is to analyze the impact of varying

the confidence bound on the output. This parameter has the ability to improve
the reliability and efficiency of our procedure since its value can guarantee that the
resulting possibility distribution will dominate the true probability distribution and
64 I. Sarr et al.
Fig. 7 Perspective community evolution. The top left graph represents scenario C while the top
right one is for scenario E and the bottom graph is for scenario D
hence lead to more reliable results. In Fig. 8 one can see the effect of the confidence
level variation on the number of detected links. This result is not surprising but it was
not clear that tuning parameter does make the algorithm a rich stationary process
where the number of detected links does not increase beyond a certain value. Another
result is related to the variation of the link detection threshold, i.e. the variable . We
set this threshold to 70 % but in Fig. 9 we find obviously how the number of detected
links depends on the value of . As the increases, the number of links decreases.
Since the possibility measure lies between 0 and 1, increasing the detection threshold
has the natural effect to reduce the detected links.
7 Conclusion and Future Work
In this paper, we present the premises of a new approach based on user activities over
time to detect community evolution within a social network. We first report snapshots
of the network at different time periods and then we analyze the underlying social
network in order to identify active actors and perspective communities. In fact, nodes
that have a high rate of participation are called the active ones and are considered as
nodes of the perspective communities formed from those nodes and their interactions.
Scenario C
70
Scenario D
Scenario E
60
Number of links detected

Scenario F
50
40
30
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Variation of the confidence bounds (1alpha)
Fig. 8 Tradeoff between the value of alpha and the number of detected links when the threshold
is set to 70 %
300
Scenario C
Scenario D
250 Scenario E
Scenario F
Number of links detected
200
150
100
50
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Variation of the detection threshold
Fig. 9 Tradeoff between the detection threshold and the number of links detected for the set of 10
TWs for scenario C
Our approach can be useful to show central actors. It can also highlight how
using perspective communities defined over time may increment the information
flow circulation. In fact, beside the fact that our approach tracks the evolution of
the network, it gives a basic way to figure out churners. Churn detection in its early
stage is very fruitful since it gives more flexibility to companies to apply appropriate
incentives to keep their customers. Furthermore, mapping perspective communities to
an (initial or important) network adds new links that improve the network accessibility
66 I. Sarr et al.
and hence, the information flow circulation. These benefits combined with the low
complexity of our algorithms let us argue that our approach is promising.
We plan to carry out a set of new experiments to assess the performance of the
proposed approach and its accuracy regarding churn detection and social influence
identification. Presently, we assume that all activities have the same importance.
However, ongoing work is conducted to differentiate activities within a window. Fur-
thermore, we plan to provide a way to estimate the reasonable size of time windows,
and to study the correlation between users interaction in several time windows.
Finally, we are collecting data from various networks in order to find perspective
communities that emerge from the superposition of several networks.
Acknowledgments The third author acknowledges the financial support of the Natural Sciences
and Engineering Research Council of Canada (NSERC).
Appendix A: Inferring Possibility Distribution from Probability

Distribution
A consistency principle between probability and possibility can be stated in a non-

formal way [29]: what is probable should be possible. This requirement can then
be translated via the inequality:
P(A) (A) A (7)
where P and are, respectively, a probability and a possibility measure on the

domain . In this case, is considered as dominating P.
Transforming a probability measure into a possibilistic one then amounts to choos-
ing a possibility measure in the set (P) of possibility measures dominating P. This
should be done, by adding a strong order preservation constraint, which ensures the
preservation of the shape of the distribution:
pi < p j i < j i, j {1, . . . , q}, (8)
where pi = P({E i }) and i = ({E i }), i {1, . . . , q}. It is possible to search

for the most specific possibility distribution verifying (7) and (8). The solution of
this problem exists, is unique and can be described as follows. One can define a strict
partial order P on represented by a set of compatible linear extensions (P) =
{lu , u = 1, L}. To each possible linear order lu , one can associate a permutation u
of the set {1, . . . , q} such that:
u (i) < u ( j) (u (i) , u ( j) ) lu , (9)
The most specific possibility distribution, compatible with the probability distribution
p = ( p1 , p2 , . . . , pq ) can then be obtained by taking the maximum over all possible
permutations:
i = max pj (10)
u=1,L
{ j| u1 ( j)u1 (i)}
The permutation is a bijection and the reverse transformation 1 gives the rank
of each pi in the list of the probabilities sorted in the ascending order. The number
L of permutations depends on the duplicated pi in p. It is equal to 1 if there is no
duplicate pi , i and for this case P is a strict linear order on .
Appendix B: Inferring Possibility Distribution for Classes

of Activities
Let nk denote the number of observations (activities) of class k in a sample of size

N . Then, the random vector n = (n 1 , . . . , nK ) can be considered as a multinomial
distribution with parameter p = ( p1 , p2 , . . . , p K ). A confidence region for p at
level 1 can be computed using simultaneous confidence intervals as described in
[19]. Such a confidence region can be considered as a set of probability distributions.
It is proposed to characterize the probabilities p = ( p1 , p2 , . . . , p K ) of generat-
ing the different classes by simultaneous confidence intervals with a given confidence
level 1 . Here, pk represents the probability of generating the class of events Ak .
From this imprecise specification, a procedure for constructing a possibility distrib-
ution is described, insuring that the resulting possibility distribution will dominate
the true probability distribution in at least 100(1 ) of the cases.
Since the probabilities p of generating classes are unknown, we can build
confidence intervals for each one of them. In interval estimation, a scalar population
parameter is typically estimated as a range of possible values, namely a confidence
interval, with a given confidence level 1 .
To build confidence intervals for multinomial proportions, it is possible to find
simultaneous confidence intervals with a joint confidence level 1 . The method
attempts to find a confidence region Cn in the parameter space p = ( p1 , . . . , pK )
K
[0; 1] K | pi = 1 as the Cartesian product of K intervals [ p1 , p1+ ] . . . [p +
K , pK ]
i=1
such that we can estimate the coverage probability with:
P( p Cn ) 1 (11)
We can use the Goodman [9] formulation in a series of derivations to solve the
problem of building the simultaneous confidence intervals.
A = 2 (1 /K , 1) + N (12)
68 I. Sarr et al.
where 2 (1 /K , 1) denotes the quantile of order 1 /K of the chi-square

K
distribution with one degree of freedom, and N = n i , denotes the size of the
i=1
sample. We have also the following quantities:
Bi = 2 (1 /K , 1) + 2n i , (13)
n i2
Ci = , (14)
N
i = Bi2 4 ACi , (15)
Finally, for each class of activities A K the bounds of the confidence intervals are
defined as follows: 1 1

B 2
B + 2
[ pi , pi+ ] = i
i i i
, (16)
2A 2A
It is now possible, based on these above interval-valued probabilities, to compute

the most possibility distributions of a class dominating any particular probability
measure. Let P denote the partial order induced by the intervals [ pi ] = [ pi , pi+ ]:
(i , j ) P pi+ < p j (17)
This partial order may be represented by the set of its compatible linear extensions
(P) = {lu , u = 1, L}, or equivalently, by the set of the corresponding permutations
{u , u = 1, L}. Then, for each possible permutation u associated with each linear
order in (P), and each class Ai , we can solve the following linear program:

i u = max pj (18)
p1 ,..., p K
{ j| u1 ( j)u1 (i)}
under the constraints:

K

pi = 1
i=1 (19)

p pk pk+ k {1, . . . , K }

k
pu (1) pu (2) pu (K )
Finally, we can take the distribution of the class Ak dominating all the
distributions u :

i = max i u i {1, . . . , K } (20)
u=1,L
Complexity
The complexity of our computational procedure is related to the discover of the

possibility degrees of the K classes. To solve this problem, the conceptually simplest
approach is to generate all the linear extensions compatible with the partial order
induced by the probability intervals, and then to solve the associated linear programs
(i.e. Eq. (10)). However, this approach is unfortunately limited to small values of K
(e.g., K < 10) due to the complexity of the algorithms generating linear extensions
of O(L), where L is the number of linear extensions. Even for moderate values of
K , L can be very large (K ! in the worst case) and generating all the linear extensions
and solving the linear programs soon becomes intractable. A new formulation of the
solution can be derived to reduce considerably the computations. This formulation is
based on several steps. First, all the linear programs to be solved will be grouped in
different subsets; then, an analytic expression for the best solution in each subset will
be given; and lastly, it will be shown that it is not necessary to evaluate the solution
for every subset. A simple computational algorithm will be derived (see [19] for
more details). The actual complexity might actually be close to O(|Pi |) where Pi
denotes the set of indices of the classes with a rank possibly, but not necessarily
smaller than i .
References
1. Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social

networks: membership, growth, and evolution. In: Proceedings of the 12th ACM SIGKDD
international conference on knowledge discovery and data mining, KDD06, pp 4454
2. Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social
network. In: WWW, pp 463470
3. Brdka P, Saganowski S, Kazienko P (2012) Ged: the method for group evolution discovery
in social networks. Soc Netw Anal Min 3(1):114
4. Crandall D, Cosley D, Huttenlocher D, Kleinberg J, Suri S (2008) Feedback effects between
similarity and social influence in online communities. In: Proceedings of the 14th ACM
SIGKDD international conference on knowledge discovery and data mining, KDD08, ACM,
pp 160168
5. Dubois D, Prade H, Sandri S (1991) On possibility/probability transformations. In: Proceed-
ings of the fourth international fuzzy systems association world congress (IFSA91), Brussels,
Belgium, pp 5053
Natl Acad Sci USA 99(12):78217826
8. Goldberg MK, Magdon-Ismail M, Thompson J (2012) Identifying long lived social communi-
ties using structural properties. In: ASONAM, pp 647653
9. Goodman LA (1965) On simultaneous confidence intervals for multinomial proportions. Tech-
nometrics 7(2):247254
10. Karnstedt M, Hennessy T, Chan J, Hayes C (2010) Churn in social networks: a discussion
boards case study. In: Proceedings of the 2010 IEEE second international conference on social
computing, SOCIALCOM10. IEEE Computer Society, pp 233240
70 I. Sarr et al.
11. Kashoob S, Caverlee J (2012) Temporal dynamics of communities in social bookmarking

systems. Soc Netw Anal Min 2(4):387404
12. Kautz HA, Selman B, Shah MA (1997) The hidden web. AI Mag 18(2):2736
13. Kautz HA, Selman B, Shah MA (1997) Referral web: combining social networks and collab-
orative filtering. Commun ACM 40(3):6365
14. Lakkaraju H, McAuley J, Leskovec J (2013) Whats in a name? Understanding the interplay
between titles, content, and communities in social media. In: Seventh international AAAI
conference on weblogs and social media. AAAI Publications
15. Lancichinetti A, Fortunato S, Kertsz J (2009) Detecting the overlapping and hierarchical com-
munity structure in complex networks. New J Phys 11(3):033015
16. Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking
diameters. ACM Trans Knowl Discov Data 1(1):2
17. Leskove J, Huttenlocher D, Kleinberg J (2010) Predicting positive and negative links in online
social networks. In: Proceedings of the 19th international conference on world wide web,
WWW10, pp 641650
18. Marsden PV (2005) Recent developments in network measurement. In: Carrington PJ, Scott J,
Wasserman S (eds) Models and methods in social network analysis. Cambridge University
Press, New York, pp 830
19. Masson MH, Denoeux T (2006) Inferring a possibility distribution from empirical data. In:
Proceedings of fuzzy sets and systems, pp 319340
20. Matsuo Y, Mori J, Hamasaki M, Ishida K, Nishimura T, Takeda H, Hasida K, Ishizuka M (2006)
Polyphonet: an advanced social network extraction system from the web. In: Proceedings of
the 15th international conference on world wide web, WWW06. ACM, pp 397406
21. Newman MEJ (2004) Detecting community structure in networks. Eur Phys J B-Condens
Matter Complex Syst 38(2):321330
22. Newman MEJ (2004) Fast algorithm for detecting community structure in networks. Phys Rev
E 69(6):066133
23. Palla G, Barabasi AL, Vicsek T (2007) Quantifying social group evolution. Nature 446:664667
24. Scott J (1991) Social network analysis: a handbook. Sage, London
25. Sun Y, Han J (2012) Mining heterogeneous information networks: principles and method-
ologies. Synthesis lectures on data mining and knowledge discovery. Morgan and Claypool
Publishers, San Rafael
26. Tantipathananandh C, Wolf TB, Kempe D (2007) A framework for community identification in
dynamic social networks. In: Proceedings of the 13th ACM SIGKDD international conference
on knowledge discovery and data mining, KDD07. ACM, pp 717726
27. Toivonen R, Kovanen L, Kivel M, Onnela JP, Saramki J, Kaski K (2009) A comparative study
of social network models: network evolution models and nodal attribute models. Soc Netw
31(4):240254
28. Wei F, Qian W, Wang C, Zhou A (2009) Detecting overlapping community structures in
networks. World Wide Web 12:235261
29. Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. In: Fuzzy sets and systems,
pp 328
30. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338353
Study of Influential Trends, Communities,
and Websites on the Post-election Events
of Iranian Presidential Election in Twitter
Seyed Amin Tabatabaei and Masoud Asadpour
Abstract The Iranian presidential election and its post-events was the most engaging
topic of the year 2009 among Twitter users. In this paper, we study the social net-
work among the users that were engaged in that topic during an 18 month period
of observation. We analyze the content of tweets that were published in English or
Persian by Iranian people or others around the world and extract the most trending
topics in critical days. We also study the sub-communities.
Keywords Iranian election Twitter Social network analysis Content analysis

Trend analysis
1 Introduction
Twitter website, launched in 2006, offers a social networking and micro-blogging
service. It offers the users a service to send and receive short messages called tweets.
Tweets are text-based messages of up to 140 characters, which are visible on the
website or can be accessed through third-party applications. The rate of publication
on Twitter is more than one million messages per hour. At first, the idea was to
indicate personal status for friends. But, these days, it is used in various forms of
posts from political news to produce information, e.g., short phrases, URLs, and
direct messages to other users.
Especially before and during elections, political atmosphere is clearly seen on
the tweets posted by many users. In addition, political meetings are arranged and
announced to supporters meanwhile.
The 10th Iranian presidential election was one of the most important political
events in Iran, after revolution in 1979. This election was held on 12 June 2009, with
incumbent Mahmoud Ahmadinejad running against three challengers:
S.A. Tabatabaei M. Asadpour (B)

Social Networks Lab, School of Electrical and Computer Engineering,
University of Tehran, Tehran, Iran
e-mail: asadpour@ut.ac.ir
S.A. Tabatabaei
e-mail: tabatabaei@alumni.ut.ac.it

DOI 10.1007/978-3-319-12188-8_4
72 S.A. Tabatabaei and M. Asadpour
M. Mousavi: An Iranian reformist politician, artist and architect who served as the
last Prime Minister of Iran, from 1981 to 1989.
M. Karoubi: An influential Iranian reformist politician, democracy activist. He
was chairman of the parliament, from 1989 to 1992 and 2000 to 2004.
M. Rezaei: An Iranian politician, economist and former military commander.
Rezaei was the Iranian Revolutionary Guard Corps chief commander for 16 years
(19811997).
According to the official result, Ahmadinejad won the election by more than
two-thirds of votes. However, Mousavi and other candidate did not accept the results;
they ask their supporters to hold peaceful demonstration. They could hold some
demonstrations in the large cities of Iran. The 13 June situation was described as the
biggest unrest since the 1979 revolution. Mousavi urged for calm and asked that his
supporters refrain from acts of violence. However, the struggle between the security
forces and protesters changed to violence after some days of unrest. The government
tried to push back the demonstrations. Some opposing politicians were arrested.
The protesters used social networking or social media websites such as Facebook,
YouTube, and Twitter to organize their meetings and rallies. To control the situation,
some Internet services went down and Short Message Service (SMS) was blocked
by the authorities.
Meanwhile, Twitter postponed its upgrade for some hours in order to let people
cover news on Iranian election.1 Facebook launched its support for Persian lan-
guage earlier than schedule.2 Google released its Persian translator before the
schedule.3 Iranian election was deemed the most engaging topic of the year. The
terms #iranelection, Iran, and Tehran were among the top trending topics of 2009
in Twitter.4
Here, we try to analyze the tweets that were published about Iranian election from
3 months before the election to 15 months after it. We study the social network among
the users and analyze the content of tweets.
The rest of this paper is organized as follows: in the next section, previous works
are reviewed. In Sect. 3, we explain our data collection method. We look at the
dynamics of user registration in Twitter and we find the critical days in post-election
events according to the number of tweets per day. In Sect. 4, we analyze the trending
keywords and in the next section we study the most influential websites that were
cited in tweets. In Sect. 6, we take a look at the social network among users and their
communities. Finally, the conclusion and future works will come.
1 Down Time Rescheduled. The official Twitter blog. [online] http://blog.twitter.com/2009/06/

down-time-rescheduled.html.
2 Launching Facebook in Persian. The Official Facebook Blog. [online] http://www.facebook.com/
blog.php?post=97122772130.
3 Google translates Persian. The official Google Blog. [online] http://googleblog.blogspot.com/
2009/06/google-translates-persian.html.
4 Top Twitter Trends of 2009. The Official Twitter Blog. [online] http://blog.twitter.com/2009/12/
top-twitter-trends-of-2009.html.
Study of Influential Trends, Communities, and Websites on the Post-election Events . . . 73
2 Previous Works
Various studies report the important role of Social Networking websites on the
political events in different countries [2, 4, 10, 11].
Reference [1] measures the degree of interaction between 40 liberal and
conservative blogs over the period of two months, and their effects on U.S. elec-
tion 2004.
Reference [7] with the help of some Persian natives categorizes the Persian
weblogs, find the main poles and study the relationship among different poles. Ref-
erence [12] introduces a new dataset on Persian blogs and analyzes the network.
Different works have been done especially on Twitter. Reference [13] analyzes
more than 100,000 tweets mentioning parties or politicians prior to the German
federal election, 2009.
After Iran presidential election at 2009, more researchers were attracted to Iran
events and Persian social network [9, 14]. In our previous work [8], we studied the
role of Twitter on that election and events after the election. In this paper, we have
focused more on the content of tweets.
3 Data Collection
Our dataset consists of 1,375,510 tweets from 6,721 users, which contain iranelection
tag. They have been published in a period of 3 months before the election up to
15 months after it (totally 18 months). The following information about users is
accessible in Twitter: id, name, number of followers, number of friends and account
creation date. Also, the following information about tweets is accessible: id, owner
user, body text, creation date.
Figure 1 shows the histogram of the number of users tweets. More than two
thousands (2,128) users have just one tweet with iranelection tag, and 603 users
have two. Also, there is a user with 6,826 tweets. In order to be clear, the horizontal
axis shows only the users who have written less than 500 tweets with #iranelection.
All of tweets in our dataset are either in Persian or English. Based on that, we
categorize all users into two groups: (1) Persian natives (P-Users): Users who have
published at least one tweet in Persian (4,634 Users). A P-Users may have written
tweets either in two languages or all in Persian. (2) Foreign Users (EN-Users): Users
who do not have any Persian tweet, and publish their tweets all in English (6,722
Users).
3.1 Network Growth

We studied the evolution of the activity of users interested in Iranian presidential
election in order to find whether election and protests have been influential in this
regard or not. Figure 2 shows the number of users that have joined Twitter on specific
months before and after the election.
Fig. 1 Histogram of the number of users tweets (loglog scale)
Fig. 2 Number of users signed up to Twitter in each month. Most of users have signed up in March
(beginning of the new year in Persian calendar), April, May and June (month of election)
The figure clearly shows that most of users have signed up in March (beginning
of the new year in Persian calendar), April, May and June (month of election), 2009.
Considering Iranian election was held on June 12, 2009, some interested users signed
up for Twitter on early months in order to diversify their sources of information. On
May 23, 2009, Iranian government started filtering Twitter. That might be why the
number of new users in this month is a bit smaller than April as newcomers did not
know how to use anti-filtering software. The number of new users reached its peak at
June and then started declining until the next March and reached a negligible number.
Note that the figure does not mean the Iranian users are not interested anymore
to Twitter. Since we have focused only on the tweets about presidential election and
Fig. 3 Number of users signed up in June 2009. Starting from the day of election, users joined
Twitter with an accelerating rate
post-election events, the figure means only #iranelection issue was not interesting
anymore to the Iranian community, not the whole Twitter.
In order to find the reason behind joining in Twitter from Iran, we took a closer
look at the daily rate of sign-ups. Figure 3 shows that, starting from the day of elec-
tion, users joined Twitter with an accelerating rate until four days after the election.
The acceleration might be because text-messaging services were down on mobile
networks during the day of election. Therefore, people started using Twitter along
with other social networking sites, like Facebook, to send news about election to the
outside world. Micro-blogging services provided a fast way for protesters to share
their observation and information and possibly to organize the next protests.
The largest peak of the diagram corresponds to the mass rally of protestors on June
15. After this day, the rate of new users suddenly dropped until 19th. On Friday, June
19, 2009, which was a weekend in Iran, Ayatollah Khamenei (The supreme leader)
made a hardline speech at Friday prayers. On Saturday (the first day of the week
in Persian calendar), June 20, the new users increased a bit. On this day, opposition
movement (green movement) continued their protests, in response to the invitation
of two defeated candidates, Mousavi and Karroubi. The other crucial event of this
day was a meeting of Irans powerful guardian council, which had invited the three
defeated candidates to express their complaints. Then, the number of new users
declined more and more.
Fig. 4 Number of tweets posted on each day. Peaks of the graph correspond to critical events
4 Trend Analysis
In this section we try to find out what has happened on the most important days. To
specify whether something has happened on a specific day or not, we look at the rate
of tweet publication and find out the most prolific days. In order to find out what has
happened on these days we extract trending keywords of that day.
4.1 User Activity
To find out important days, we measure the activity of users in this network on
different days. Figure 4 shows the total number of published tweets per day. Peaks
of the graph correspond to critical events. Among them, the marked ones will be
explained below and their trending keywords are extracted.
The first week after the Iranian presidential election was the most prolific period
for protesters. (1) Tweet publication rate started to increase on June 12, the day of
election and reached its maximum on June 20 and 21. One day after the speech
of the supreme leader in Friday prayers, June 19, Mousavi insisted on election
annulment, and a rally took place in Tehran. Neda Aqa Sultan was killed and news
and movie about her death spread over the media. (2) A rally in memory of stu-
dent protests of July 9, 1999 took place. (3) Friday prayers was held by Hashemi
Rafsanjani. Supporters of both reformist and conservative parties took part in this
event. (4) The 4th peak corresponds to Qods day rally on September 18. Although it
was an annual rally in support of Palestinian people, protesters came to streets and
made their objections to the government crackdown. (5) The Students day rally was
held on November 4. (6) One month later, on Scholars day, university students held
a protest against the government policies. (7) On Ashura, which is the most impor-
tant religious event in Shia religion, there was a rally in support of leaders of green
movement, which finally was led to violence. (8) On February 11th, there was a mass
rally in support of government in which protesters failed to show their disagreement
with a crackdown. (9) The last peak corresponds to the election anniversary.
4.2 Trending Keywords
In this section we focus on the trend of the day. To do this, we analyzed tweets which
were published in each day, and extract their keywords. Our purpose is to show the
relationship between the keywords of tweets and events that happened in that specific
day. To do this, we changed TF-IDF method [6] to adjust with our purpose. In this
section, first, TF-IDF method is explained briefly; then the changes we have made
are explained. Table 1 shows the results of this analysis.
TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a
real-valued measure which is used for keyword extraction. Its value reflects how
important a word or term (t) is to a document (d) in a corpus5 (D).
The value of TF-IDF(D, d, t) increases proportionally with the number of times
term (t) appears in the document (d), but is offset by the frequency of documents of
corpus (D) that contain the word (t).
TF-IDF(D, d, t) = TF(d, t) IDF(D, d, t) (1)
where TF(d, t), term frequency, is defined as the number of times a given term (t)
appears in document (d); and IDF(D, d, t) is defined as:
log|D|
IDF(D, d, t) = (2)
|d D : t d|
where |D| shows the total number of documents in the corpus, and | d D : t d |
shows the number of documents that contain term t.
The value of TF-IDF is low for words with low term frequency, and also for
words with high document frequency (i.e. stop words like a, the and of). On
the contrary, TF-IDF is maximized by high term frequencies (in the given document)
and low document frequency of the term in the whole collection of documents. So,
we can say the words with high value of TF-IDF are those words that appear many
times in a document but appear rather few times in other document i.e. keywords of
that document.
In order to specify trends of tweets which was published in a day, we have used
a customized version of TF-IFDF method:
5 A corpus is a collection of documents.

Table 1 Trending keywords of important days

Peak Keywords Description
1 Neda Neda was killed
Overlay Some users show their support to the Green Revolution (GR) by adding
GR a green overlay to their avatar in Twitter and ask others to join them
Support
Please
2 Amirabad A rally in memory of student protests of July 9, 1999 took place nearby
North the dormitory of Univ of Tehran at Amirabad, North Kargar Ave.
Kargar Some people gathered in Keshavarz Blvd
Keshavarz Some students of AmirKabir Univ. of Tech. gathered in Vila St.
AmirKabir for protest
Vila
3 Shadi Shadi Sadr a right activist woman and founder of the website
Sadr WomenOfIran was beaten and taken away
WomenOfIran Sohrab Aarabi one of the protesters had been killed some days ago
Sohrab Rumors about rape of Taraneh Mousavi spread over the media
Taraneh
4 QD Qods Day rally was held
Karimkhan Protesters clash with police in Karimkhan St., near 7Tir and Valiasr
Valiasr Squares in Tehran
Tir
Clash
5 Seven Protesters clashed with police on the streets of Tehran, especially in
Square 7-Tir square
Riot There were unconfirmed reports that shots were fired and some people
Injured were injured
Shooting Karroubis son confirmed his father has been injured
Tir
6 Valiasr Security forces clash with protesters in EnqelabSq. near the entrance to
Enqelab Univ. of Tehran and in Valiasr Sq. near Amirkabir (Polytechnic)
Entrance University.
Polytechnic Majid Tavakkoli, an activist was arrested
Surround
Amirkabir
Tavakkoli
7 Nephew Nephew of Mousavi was killed on day of Ashura
Ashura Clash between police and protesters happened on Hafez bridges and
Bridge Mirdamad St.
Mirdamad A police station was set to fire
Station
Hafez
(continued)
Table 1 (continued)
Peak Keywords Description
8 Sadeqiyeh Clash between people and security forces around Sadeqiyeh Sq. in
Aryashahr Aryashahr district, and near Enqelab Sq. happened
Eshraqi Zahra Eshraqi, grand-daughter of Ayatollah Khomeini was arrested
Granddaughter and released shortly
Squares
Enqelab
9 Square People marche silently on sidewalks of Vanake, Valiasr Squares and
Vanak other places to show their opposition
Valiasr
Sidewalk
1. We do not have the concept of document here. However, we append all tweets
published in the same day and consider them as a single document.
2. Trending keywords usually continue to appear in the tweets of the succeeding
days, for a long time (until the interest of public to the trend vanishes). For
example, Neda is a term that is used almost in all days after Nedas death. Since
it appears in many documents, if we use the usual TF-IDF method, its value of
TF-IDF would not be high; and it would not be considered as a keyword. Whereas,
at least for the first day that it was used, it should be considered as a trending
keyword. To overcome this problem, we specify an overlapping time-windows of
30 days during (the focused day and 29 days before that) which TF-IDF method
is applied.
So here, to calculate the value of TF(d, t), all tweets published on a specific day
are considered as the document d. And, to calculate the value of IDF(D, d, t) the
tweets of that day and the ones of 29 days before it are considered as corpus D.
By using the explained method, TF-IDF values are calculated for all terms that
appear in the tweets published in a day. And, terms with highest value of TF-IDF are
considered as the trending keywords of that day. Table 1 shows the trending keywords
of the important days mentioned earlier (Fig. 4). A description about what happened
on those days explains the relation between the keywords and the events happened
that day.
5 Influential Websites
In this section, around 400,000 URLs, which were cited in the tweets, are analyzed. As
mentioned, tweets can link to other websites and online contents e.g. news agencies
and social media. Because of the restriction on the number of characters in each
tweet, URLs are usually shortened by the URL shortening services like bit.ly or
Table 2 The most popular websites referenced in the tweets

Rank Website #Tweets #Users Rank according to #Users
1 youtube.com 29,393 1,833 2
2 twitpic.com 16,408 1,380 3
3 twubs.com 11,101 390 24
4 google.com 9,799 1,329 4
5 facebook.com 8,943 1,258 5
6 twitter.com 5,729 1,135 6
7 formspring.me 4,526 519 15
8 foozools.com 4,216 52 420
9 twitlonger.com 3,577 493 21
10 hamsedyeiran.blogfa.com 3,385 11 2,146
11 iran.whyweprotest.net 3,358 502 19
12 fun140.com 3,267 855 7
13 solaleh7.blogspot.com 3,014 20 1,124
14 helpiranelection.com 2,647 2,038 1
15 cnn.com 2,135 700 10
16 rahesabz.net 2,028 347 29
17 bbc.co.uk 2,015 717 8
18 friends.myspace.com 1,975 709 9
19 reuters.com 1,870 499 20
20 feeds.feedburner.com 1,788 130 133
21 lolquiz.com 1,717 638 11
22 tinyurl.com 1,634 513 17
23 legacy.com 1,511 47 460
24 flickr.com 1,489 524 14
25 nytimes.com 1,479 561 13
26 guardian.co.uk 1,418 623 12
27 myloc.me 1,397 514 16
28 payvand.com 1,346 318 33
29 solaleh8.blogspot.com 1,281 16 1,460
30 friendfeed.com 1,216 352 27
tinyurl.com. We tried to find the main URLs in these cases; however some of them
were no longer valid. Here we only report the valid ones.
Table 2 shows the 25 top referenced websites along with the number of tweets and
the number of users who had mentioned them. The first rows of the table is occupied
by important websites like YouTube (for coverage of videos from the events), Twitter
and related sites like TwitPic, Twubs, and TwitLonger (for their rapid information
diffusion potential), Google (for its news services), and Facebook and Formspring
(for social networking).
The first web site that belongs to a Persian group is Foozools, followed by
HamsedayeIran, a Persian weblog. It is interesting to note that these sites have been
used by a small number of users. This means, a small group of users have tried to
exploit this situation and advertise their favorite website through abusing the iran-
election tag (these kind of websites are highlighted). Hamsedyeiran is similar to
Solaleh7 in its content and they all belong to an armed terrorist group called as
Monafeqin.
The top websites that specifically address the green movement are HelpIran-
Election.com, Iran.WhyWeProtest.net, and RaheSabz.net. Finally the news agencies
and newspapers like CNN, BBC, Reuters, NY Times, and Guardian come.
5.1 Popular Websites for English-Speaking Users
Table 3 shows the most popular websites (according to the number of users who
cited them in their tweets) among English-speaking users, along with the number
of users who have used them. Highlighted columns show websites that are popular
only among English-speaking users.
HelpIranElection.com is a website that encouraged tweeter users to change their
avatar to have green overlay or green ribbon (green was the official color of the
movement).
5.2 Popular Websites for Persian-Speaking Users
Table 4 shows the most popular websites (according to number of users who refer to
them in their tweets) among Persian users, along with the number of English-speaking
users who refer to them. Highlighted columns show websites, which are popular only
among Persians.
The top-most web sites in this table are almost same as Table 3 except
Rahesabz.net. It is one of the most popular news websites related to the green move-
ment. It started its work a few days after the presidential election (June 20, 2009).
This website is written in Persian, so it is not surprising that English-speaking users
did not refer to it.
6 Follower-Followee Network
Users of twitter can follow other users; also, he/she may be followed by others.
The graph in Fig. 5 shows the follower-followee relationship among the users in our
dataset. Nodes correspond to users. Size of a node is proportional to the number of
followers the user have; and its color shows the users language. Nodes that have
Table 3 The most popular web sites among English-speaking users

Rank according Website #EN-Users #P-Users Rank according
to #EN-Users to #P-Users
1 helpiranelection.com 1,732 306 28
2 youtube.com 485 1,348 1
3 twitpic.com 353 1,027 2
4 google.com 310 1,019 3
5 twitter.com 268 867 5
6 facebook.com 253 1,005 4
7 twitition.com 175 205 54
8 cnn.com 146 554 9
9 guardian.co.uk 138 485 11
10 bbc.co.uk 138 579 7
11 fun140.com 137 718 6
12 friends.myspace.com 133 576 8
13 lolquiz.com 132 506 10
14 iran.whyweprotest.net 129 373 20
15 nytimes.com 128 433 15
16 iran.greenthumbnails.com 125 188 64
17 tinyurl.com 116 397 19
18 flickr.com 103 421 17
19 huffingtonpost.com 97 416 18
20 digg.com 90 235 42
21 pbs.org 81 310 27
22 reuters.com 75 424 16
23 online.wsj.com 72 322 23
24 formspring.me 71 448 14
25 twubs.com 71 319 24
26 gr88.tumblr.com 56 222 46
27 trackitdown.net 55 156 87
28 myloc.me 54 460 12
29 twitspam.org 50 130 112
30 wikipedia.org 48 202 57
link to each other are placed closer. To visualize this network, we used ForceAtlace2
(Ref. [5]) layout of Gephi6 (Ref. [3]) open-source software.
It is clear from the graph that users are divided into two big communities according
to their language. This is not surprising. The users, who are placed between the
6 http://gephi.github.io.
Table 4 The most popular web sites among Persian-speaking users

Rank according Website #P-Users #EN-Users Rank according
to #P-Users to #EN-Users
1 youtube.com 1,348 485 2
2 twitpic.com 1,027 353 3
3 google.com 1,019 310 4
4 facebook.com 1,005 253 6
5 twitter.com 867 268 5
6 fun140.com 718 137 11
7 bbc.co.uk 579 138 10
8 friends.myspace.com 576 133 12
9 cnn.com 554 146 8
10 lolquiz.com 506 132 13
11 guardian.co.uk 485 138 9
12 myloc.me 460 54 28
13 twitlonger.com 450 43 33
14 formspring.me 448 71 24
15 nytimes.com 433 128 15
16 reuters.com 424 75 22
17 flickr.com 421 103 18
18 huffingtonpost.com 416 97 19
19 tinyurl.com 397 116 17
20 iran.whyweprotest.net 373 129 14
21 rahesabz.net 340 7 321
22 amazon.com 324 41 38
23 online.wsj.com 322 72 23
24 twubs.com 319 71 25
25 etsy.com 315 36 47
26 friendfeed.com 314 38 42
27 pbs.org 310 81 21
28 helpiranelection.com 306 1,732 1
29 tumblr.com 302 20 107
30 iranian.com 294 48 32
two communities, have an important role in terms of translating and communicating

the events and news from Iran to abroad and vice versa.
In the periphery of the graph, many English speaking and a few Persian speaking
users are seen. These users are disconnected from the core; and they occasionally
participated in the discussions. These users have weak links to the core graph but
since they are numerous and connected to other communities they play a big role in
Fig. 5 Follower-followee network: Persian and English speaking users are shown in green and
blue, respectively
the spread of news to the outside world. In the next section we will take a deeper
look into the core communities.
6.1 Political Groups
In this section, we take a look at the users that have somehow supported the three
political groups: Monafeqin, Jebheh Mosharekat, and Mojahedine Enqelab. Monafe-
qin, as explained earlier, is an armed terrorist group, based in Iraq. Jebheh Mosharekat
is a reformist group very close to Khatami, the former president of Iran. Mojahedine
Enqelab is another reformist group.
In this subsection, we find the users who support these political groups. Then,
the community of supporters of these groups are compared to each other. To do this,
we found all users who had published at least one tweet in support of one of those
three groups. In order to specify which tweet is in support of which group, we first
collected the keywords that were related to those groups (e.g. name of the group
and name of the famous members of group). Then we found the users who had used
those keywords in their tweets. Finally, to clarify the opinion (positive or negative)
of user about the group, we read some of the tweets of the users that contained the
mentioned keywords. If a user had published at least one tweet in support of a group
we marked him/her as a supporter.
Fig. 6 a The core communities: Green Persian speaking users, Blue English speaking users,
b supporters of Monafeqin, c supporter of Jebheh Mosharekat, d supporters of Mojahedine Enqelab
For better visualization, the core community of Fig. 5 is magnified in Fig. 6a.
Figure 6bd show supporters of Monfaeqin, Jebheh Mosharekat, and Mojahedine
Enqelab respectively. It can be seen that supporters of Monafeqin are a few small
nodes congregated in one place. However supporters of the two other groups are
scattered in the whole Persian speaking user community and consist of many impor-
tant (big-size) nodes. These two groups have lots of supporters in common. Both of
these groups supported Mousavi (the defeated candidate) in the election.
7 Conclusion
In this paper we studied the social network among users of Twitter who were
interested in Iranian Presidential election and its post-events. By analyzing the num-
ber of users which signed up to Twitter in different months and days, we saw that
the restriction that Iranian government put on media during the protests moved the
interested people to online social media and social networks in order to diversify their
sources of information. Some activists used these media to organize their protests
and to communicate with the outside world for help and sympathy. Meanwhile some
small groups tried to abuse this opportunity and advertise their website by sending
spam.
On the other hand, by using a customized version of TF-IDF method, the trending
keywords of tweets which were published in each day were extracted. Results showed
a strong relationship between the published tweets and the occurred events in the
each day.
The top URLs that appeared on the tweets showed social networking and social
media websites were the most influential websites. Also, we perceived that two big
communities (Persian and English speaking users) helped in communicating the
news, events and messages to abroad and vice versa. We also took a look at some
sub-communities and found out some of them. Although a small minority, were
too prolific and less influential, some other sub-communities were dispersed in the
network, being followed by many other users and were more influential. In future
we would like to investigate the spread of information in the network and find out
how content might affect the rate of spread of tweets.
Acknowledgments We would like to thank Kaveh Ketabchi for his helps on collecting the dataset.
References
1. Adamic LA, Glance N (2005) The political blogosphere and the 2004 US election: divided they
blog. In: Proceedings of the 3rd international workshop on link discovery. ACM, pp 3643
2. Albrecht S, Lbcke M, Hartig-Perschke R (2007) Weblog campaigning in the German bun-
destag election 2005. Soc Sci Comput Rev 25(4):504520
3. Bastian M, Heymann S, Jacomy M et al (2009) Gephi: an open source software for exploring
and manipulating networks. ICWSM 8:361362
4. Drezner DW, Farrell H (2008) Introduction: blogs, politics and power: a special issue of public
choice. Public Choice 134(12):113
5. Jacomy M, Heymann S, Venturini T, Bastian M (2011) Forceatlas2, a graph layout algorithm
for handy network visualization. Paris, p 44. http://www.medialab.sciences-po.fr/fr/
publications-fr
6. Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval.
J Doc 28(1):1121
7. Kelly J, Etling B (2008) Mapping Irans online public: politics and culture in the persian
blogosphere. Berkman center for internet and society and internet and democracy project.
Harvard Law School
8. Ketabchi K, Asadpour M, Tabatabaei SA (2013) Mutual influence of Twitter and postelection
events of Iranian presidential election. Procedia-Soc Behav Sci 100:4056
9. Khonsari KK, Nayeri ZA, Fathalian A, Fathalian L (2010) Social network analysis of Irans
green movement opposition groups using Twitter. In: 2010 International conference on
advances in social networks analysis and mining (ASONAM). IEEE, pp 414415
10. Koop R, Jansen HJ (2009) Political blogs and blogrolls in canada: forums for democratic
deliberation? Soc Sci Comput Rev 27(2):155173
11. McKenna L, Pole A (2008) What do bloggers do: an average day on an average political blog.
Public Choice 134(12):97108
12. Qazvinian V, Rassoulian A, Shafiei M, Adibi J (2007) A large-scale study on persian weblogs.
In: Proceedings of LINKKDD
13. Tumasjan A, Sprenger TO, Sandner PG, Welpe IM (2010) Predicting elections with Twitter:
What 140 characters reveal about political sentiment. ICWSM 10:178185
14. Zhou Z, Bandari R, Kong J, Qian H, Roychowdhury V (2010) Information resonance on
Twitter: watching Iran. In: Proceedings of the first workshop on social media analytics. ACM,
pp 123131
Entanglement in Multiplex Networks:
Understanding Group Cohesion
in Homophily Networks
Benjamin Renoust, Guy Melanon and Marie-Luce Viaud
Abstract The analysis and exploration of a social network depends on the type of
relations at play. Homophily (similarity) relationships form an important category
of relations linking entities whenever they exhibit similar behaviors. Examples of
homophily networks examined in this paper are: co-authorship, where homophily
between two persons follows from having co-published a paper on a given topic;
movie actors having played under the supervision of the same movie director; mem-
bers of a entrepreneur network having exchanged ideas through discussion threads.
Homophily is often embodied through a bipartite network where entities (authors,
movie directors, members) connect through attributes (papers, actors, discussion
threads). A common strategy is then to project this bipartite graph onto a single-type
network. The resulting single-type network can then be studied using standard tech-
niques such as community detection or by computing various centrality indices. We
revisit this type of approach and introduce a homogeneity measure inspired from past
work by Burt and Schtt. Instead of considering a projection in a bipartite network, we
consider a multiplex network which preserves both entities and attributes as our core
object of study. The homogeneity of a subgroup depends on how intensely and how
equally interactions occur between layers of edges giving rise to the subgroup. The
measure thus differentiates between subgroups of entities exhibiting similar topolo-
gies depending on the interaction patterns of the underlying layers. The method
is first validated using two widely used datasets. A first example looks at authors
of the IEEE InfoVis Conference (InfoVis 2007 Contest). A second example looks
at homophily relations between movie actors that have played under the direction of
a same director (IMDB). A third example shows the capability of the methodology
B. Renoust (B) G. Melanon

CNRS UMR 5800 LaBRI, INRIA Bordeaux Sud-Ouest,
Campus Universit Bordeaux I, Talence, France
e-mail: benjamin.renoust@labri.fr; renoust@gmail.com
G. Melanon
e-mail: guy.melancon@labri.fr
M.-L. Viaud B. Renoust
Institut National de LAudiovisuel (INA), Paris, France
e-mail: mlviaud@ina.fr

DOI 10.1007/978-3-319-12188-8_5
90 B. Renoust et al.
to deal with weighted homophily networks, pointing at subtleties revealed from the
analysis of weights associated with interactions between attributes.
Keywords Group cohesion Homophily Entanglement index Bipartite graph

Community reliability
1 Introduction
The analysis and exploration of a social network depends on the type of relations at
play. Borgatti [7] had proposed a type taxonomy organizing relations in four possi-
ble categories, among which homophily (also referred to as similarity) links actors
exhibiting similar attributes such as membership in a club or interest group [28].
These types of ties do not represent actual social ties themselves, but might lead
to a higher probability of a tie to develop between the members sharing similar
attributes. Examples are networks of co-author, where homophily between two per-
sons follows from co-authorship; networks of movie actors having played under the
supervision of the same director; or networks of members having exchanged ideas
through discussion threads, for instance.
The second type of ties are social relationships that can be affective relationships
such as friendship) usually spanning over time. The third type captures joint inter-
actions observed through discrete events such as calling each other or travelling
together. The last type of ties describes flow (tangible or intangible) between entities
(migrants moving between places, air traffic passengers between airports, etc.). This
paper focuses on networks induced from homophily relations.
Homophily is often embodied through a bipartite network where entities (authors;
movie actors; members) connect through attributes (papers; directors; discussion
threads). Guillaume and Latapy [17] advocate bipartite graphs as being universal
models for complex networks, hence offering additional motivations to use of these
graphs to describe homophily relations. Indeed attributes of different natures can
be also seen as another type of entities interacting together across the edges of the
homophily network.
When dealing with bipartite graphs, a common strategy is to project them onto a
single-type network with entities of a same type. Edges are sometimes weighted based
on how much entities interact through attributes. The resulting single-type network
often tends to have high edge density, with a propensity to contain cliques (depending
on the affiliation data used to build the bipartite graph) [17]. It may nevertheless be
studied using standard techniques such as community detection using edge density,
or the computation of various centrality indices.
Such study of the bipartite projection can however hinder subtle characteristics
of the original data since it can create relationships that do not exist (Fig. 1), hence
inducing many cliques that may not be relevant. Many different attributes can also
generate such cliques as illustrated in Fig. 3. One option from [29] is the computation
of a one-mode projection from the most significant edges, but it still presents a loss of
Entanglement in Multiplex Networks . . . 91
Fig. 1 A side effect of the bipartite projection: we start from a multiplex network (on the left)
associating entities (nodes) through different attributes (edges in color), then convert it into a bipartite
network (middle) with the right (round shape) entities corresponding to adjacent edges in the first
network, and finally project the bipartite network onto another network (right). We can observe the
apparition of new edges. Note that the right multiplex network could be considered as an entity-
similarity multiplex network
information. Our methodology proposes to directly study the multiplex networksas

defined in [10, 24]and remains compatible with data modeled with a bipartite
network. In this model, the different edge attributes refer to different edges across
different layers of the multiplex network.
Referring to the work of Manski [27], we take the notion of a group as a
central paradigm guiding the analysis of homophily networks. Numerous authors
have indeed confronted homophily to many social behaviors or phenomenon (influ-
ence, contagion, information diffusion, e.g.) [1, 3, 37] questioning Manskis group
effect as the driving force explaining the observed phenomenon.
Taking inspiration from past work by Burt and Schtt [9], this paper introduces a
novel use of a node index along with two multiplex network measures, supporting the
interactive inspection of a group in a homophily social network as a means to question
the drivers of its internal cohesion. The key idea we exploit is to look at attributes
and investigate how they interact. That is, although the focus of the analysis is on
entities, cohesion of a group is measured through interactions taking place between
attributes involved between actors in the group of entities.
From now on, entities will refer to elements of a, a A while attributes will
refer to elements b, b B. When considering a network of co-authors linked
through keywords (indexing papers), entities will correspond to authors while key-
words will be seen as attributes (see Sect. 3.3). When considering a network where
movie directors are linked through movie actors they have directed, entities will map
to movie directors while movie actors will be considered as attributes (see Sect. 3.1).
The notion of group here is rather abstract and can be either user-defined or com-
puted using a variety of methods, from data clustering to community detection using
modularity criterion, for instance. Although advances have been made on that front
in the past decades [15, 21] no algorithm or solution imposes itself as being superior
in all situations. Questioning this notion of group can help in understanding and
validating the output of algorithms, which is a challenging analysis task.
Our paper contributes an approach designed to help users evaluate the reliability
of a proposed group structure. Because similarity between entities is most often mea-
sured based on co-occurrences of attributes, we provide a means to simultaneously
work on two networks derived from the original homophily multiplex network or
bipartite graph: one directly linking entities, and the other directly linking attributes.
The notion of a group we consider here depends on the context: it may be a cluster
computed from any algorithm, a subset of entities selected by a user, or the result of
a query on a network, for instance.
This paper extends the previous ASONAM publication [34] and our work
contributes with one node index and two multiplex network measures computed on
any group of entities indicating the overall cohesion of the group measured through
the intensity and homogeneity of interactions of their co-occurring attributes (that
is the entanglement of a multiplex network). We extend this approach to weighted
interactions. Exploring the network, selecting a group or subset of co-occurring
attributes and getting feedback on internal entanglement, analysts can validate the
model implicitly supported by the grouping procedure.
Our method has been validated based on three different datasets, among which
the first two are widely known and used. A first example looks at authors of the
IEEE InfoVis Conference (InfoVis 2004 Contest) [19]. A second example looks at
homophily relations between movie actors that have played under the direction of a
same director (IMDB) [40]. Our third example examines the Edgeryders community
forum [39] where homophily emerges from discussion threads. This last example
shows the capability of our methodology to deal with weighted homophily networks,
pointing at subtleties revealed from the analysis of weights associated with interac-
tions between attributes.
Related work. Bipartite graphs form an important modeling tool in social network
analysis, supporting two-mode concepts [5]. They form an important analytical arti-
fact to study homophily relations [13], and were even claimed as universal mod-
els for complex networks [17]. The literature covers a wide variety of approaches
dealing with different properties of bipartite graphs and homophily networks. An
optional but common strategy consists in projecting the graph inducing relation-
ships between entities of a same type (see [6, 20, 30, 33, 36, 42], for instance),
with the obvious disadvantage of containing lots of cliques, the relevancy of which
can be questioned [14]. Neal [29] recently introduced an approach computing a
one-mode projection the most significant edges based on local likelihood. Latapy
et al. [25] offer to study in a bipartite network the neighborhood overlaps of a node
so that the network would stay connected even without it. Fujimoto et al. [16] studied
network autocorrelations in bipartite network as a way to measure the influence of
nodes of one mode into the formation of edges in the opposite mode. Other research
also focuses on finding bicliques (such as in [5, 32]) which can be suspected to form
cohesive subgroups. Only little work has been yet propose for the study of multiplex
networks, and we can mention the efforts from [10, 24] for bringing a mathematical
formulation of multiplex networks with tensors, although this effort is not focused
on direct use an applications of multiplex networks.
Because of their wide applicability and because they also offer a straightforward
graphical representation of the data, bipartite graphs have been recently used in the
design of a website traffic analysis system [11]. Finally, Kaski et al. [22] studied
homophily in gene networks (similarity in gene expressions) in bio-informatics with
emphasis on the trustworthiness of similarities, which places it close in spirit to our
work.
2 A New Look at Homophily Networks: Introducing

the Entanglement Index in Multiplex Networks
This section takes a closer look at homophily networks and describes the general
framework we use.
As we shall see, cohesion of a group is easier to achieve with smaller groups.
Inspecting a group, in an effort to understand why and how cohesion is embodied
in the group certainly requires to be validated based on user knowledge. This only
makes sense when conducted on small scale groups, gathering hundreds of nodes at
most.
Simple questions come to mind when inspecting a group, such as How can
we assess a group really forms a cluster? How can we make sure all entities of
a cluster really belong to it? Should we suspect the group to contain marginal
(outlier) entities?, What are the attributes that tie the entities together? etc.
A central ingredient we used to answer these questions is a set of metrics that
capture the homogeneity and intensity of interactions between attributes associated
with entities. These metrics can be viewed as an aid to assess of the internal cohesion
of a group.
2.1 Interaction Networks
Our starting point is a set of entities a, a , a , . . . (type A ) with associated attributes

b, b , b , . . . (type B). Figure 2a provides an example where entities are authors
(of papers) and attributes are keywords (indexing papers). This is a typical situation
where a homophily relationship can be inferred for example between authors (having
published a paper). We may build a bipartite network where entities (authors) a, a
A necessarily connect to attributes (keywords) b, b B while there are no direct
links between entities nor attributes (Fig. 2b).
Denote the bipartite entity-attribute network as G = (A B, E) with edges a b
whenever entity a is associated with attribute b (see Fig. 2b). Referring to Opsahl [31],
there is often a primary node set and a secondary node set in bipartite (or two-mode)
networks. For Opsahl [31], the primary node set is responsible for the creation, that
is the secondary node set is characterizing these ties. In a multiplex network, this
(a)
(b)
(c)
(d)
Fig. 2 The initial data in this example is formed of authors with associated keywords A, B, C, . . .
(a) (e.g. keywords indexing papers). This situation is modeled as a bipartite network linking authors
to keywords (b) (authors having published papers with given keyword, see Sect. 3.3). We then
consider the projected author interaction network with keywords as multiple edges (c) from which
we derive a keyword interaction network (d)
secondary node set represents the different layers of interactions. Hence, two other
networks are derived from this entity-attribute network, namely an entity interaction
network GA and an attribute interaction network GB . The entity network is usually
built from the entity-attribute network by projecting paths a b a (linking entities
a, a A through attribute b B) onto an edge a a directly linking entities.
We also need to store the attribute b as a label for the edge a a . Edges in GA are
thus labelled by subsets of attributes (all attributes b, b , . . . collected from triples
a b a , a b a , . . .).
Because we are focusing on entities group cohesion and on attribute co-occurrence,
we filter out some of the edges. Loops are discarded to obtain the entity interaction
network GA = (A , EA ). The resulting network is shown in Fig. 2c.
Note that, in the case of a multiplex network such as an author co-publication
network, the entity interaction network is defined by the multiple relationships across
authors. Going through the bipartite model would imply direct relationships across
authors that are not expected as detailed in Fig. 1. The construction of the entity
interaction network remains the same.
Links in the attribute interaction network GB = (B, EB ) are built from attributes
b that co-occur at least once with another attribute b (through at least two entities).
That is, there must exist at least two paths a b a and a b a to infer the edge
b b in EB . Note that this network is not obtained by projecting paths b a b
onto edges b b . For instance, EB does not contain edges connecting attributes that
only concern a single entity. The resulting network is shown in Fig. 2d. The attribute
interaction network is a central artifact in studying group cohesion.
Figure 3 underlines the nuance we wish to bring into the analysis of homophily
networks. Consider entities (depicted here as pale blue squares) with attributes
A, . . . , E; entities are linked by an edge whenever they share an attribute. Observe
that in both situations the pairwise distance between entities is the same (any
two entities share either one or two attributes) ending in identical topologies of the
attribute network GA . As a consequence, based on pairwise distance, these two
groups are somehow equivalent.
Now, consider the attribute networks (with circle nodes) derived from these two
situations. In the first situation (Fig. 3a), all entities having attribute A gives this
attribute a central positionif there were a reason explaining why these people form
a group, it would certainly rely on the group gathering around A, the other attributes
being somehow accessory. The second situation (Fig. 3b) is much more balanced
(although attributes do not mix as intensely as they could). This small example points
at situations where the analysis may be mislead when solely inspecting the single-type
people network. The attribute interaction network actually is key to understanding
how attributes interact within a group.
As these simple examples show, the inspection of a group of entities with
associated attributes raises several questions. It might be important to know whether
attributes equally map to all entities in the group, for instance. Conversely, a
misleading transitivity effect may be suspected to take place. Indeed, we may
have attributes b, b co-occurring between entities a and a , and attributes b , b
co-occurring between entities a and a , may lead one to believe that b, b , b
Fig. 3 An example underlining the nuance we emphasize by looking at how attributes A, . . . , E

interact. In both figures, the square node graph (left) link type A entities (authors, movie directors,
e.g.) whenever they are linked to a same entity of type B (keywords, movie actors, e.g.). Entities of
type B appear as labels on induced links. The round node graph (right) describes how type B entities
interact, that is when they co-occur as labels on an edge. The type B interaction network clearly
distinguishes the two situations, whereas the projected single-type A networks show identical
topologies. a Centralized interactions. b Cyclic interaction
simultaneously co-occur between all three a, a , a . Although the case can be easily
spotted when only considering a few entities and attributes, the transitivity effect
becomes rapidly confusing as we increase the number of entities and attributes.
We address this issue by looking at how well attributes mix within a group. This is
accomplished using the entanglement index introduced in the forthcoming sections.
This index is computed for each attribute (or layer) b, measuring how homogeneously
and intensely an attribute co-occurs with all other attributes in a group of entities. As
we shall see, global entanglement homogeneity and intensity at the group level can
then be computed from the individual attribute entanglement indices. The definition
of the entanglement index makes it so that optimal homogeneity is reached whenever
attributes have the same entanglement index, that is when all entities have the exact
same associated attributes, and that all attributes equally co-occur within entities;
and the optimal intensity is reached whenever all entities share exactly all attributes.
2.2 Attribute Interaction Matrices and the Entanglement Index
Edges b b EB moreover carry weights n b,b indicating how often attributes

co-occur between entities in the considered group. We also define n b,b to count the
number of edges in EA carrying the attribute b. The matrix NB collecting all these
n b,b entries gives rise to another matrix CB filled with ratios cb,b = n b,b /n b ,b .
The value cb,b may be viewed as computing the (conditional) frequency that an
edge be of type b given it is of type b . We give cb,b another definition, namely cb
the proportion of edges carrying attribute b among all N edges in GA = (A , EA )
such as cb = n b,b /N .
Consider the example in Fig. 2. Starting from authors a A having published
papers with keywords b B (attributes), we build a bipartite graph where authors
a, a link through keywords b whenever a and a have co-authored a paper with
keyword b (Fig. 2b). A single-type graph is obtained by inducing edges between
authors labeled with keywords (Fig. 2c). The resulting keyword interaction network
is shown in Fig. 2d. The matrices NB and CB (built over keywords C, D, E and L)
then read:
3310 0.75 1.00 1.00 0.00
3 3 1 0 1.00 0.75 0.33 0.00
NB =
1 1 3 1 CB = 0.33 0.33 0.75 1.00

0011 0.00 0.00 0.33 0.25
We now wish to compute the entanglement index for each attribute, measuring
how much a attribute b contributes to the overall cohesion of an entity group. This
notion of cohesion is inspired from Burt and Schtts work on relation content in
multiple networks [9].
Denote by the maximum value among entanglement indices b of attributes
b B. In other words, the entanglement index of attribute b is a fraction of ,
namely b = b with b [0, 1]. The entanglement value of an attribute b
is reinforced through interactions with other highly entangled attributes. Having a
probabilistic interpretation of the matrix entries cb,b in mind, we can thus postulate
the following equation which defines the values b .

b = cb,b b (1)
bB
The vector = (b )bB collecting values for all attributes b, thus forms a right
eigenvector of the transposed matrix CB , as Eq. (1) gives rise to the matrix equation

= CB . The maximum entanglement index thus equals the maximum
eigenvalue of matrix CB .
The actual entanglement index values b are of lesser interest; we are actually
interested in the relative b values. Furthermore, we shall see how the entanglement
vector and eigenvalue can be translated into network measures to help understand
entanglement in a group of entities. Hence the entanglement indices for our examples
attributes are:
= 0.63, 0.63, 0.43, 0.12
Notice that two indices are equal, and correspond to keywords C and E.
2.3 Homogeneity and Intensity
This section introduces entanglement intensity I and entanglement homogeneity

I as global network measures. The topology of the attribute interaction network
G B = (B, EB ) provides useful information about how attributes contribute to the
overall cohesion among entities of a group. The focus here is on interactions among
attributes, and aims to reveal how cohesive the group of entities is, considering this
set of attributes.
The archetype of an optimally cohesive entity group is when all entities have the
exact same associated attributes. In that case, the graph GB = (B, EB ) then corre-
sponds to a clique. As a consequence, all matrix entries n b,b coincide, so all entries
in matrix CB equal 1. The maximum eigenvalue of CB then equals = |B|, and all
b coincide. That is, all attributes indeed contribute, and they all contribute equally
to the overall entity group cohesion. The Perron-Frobenius theory of nonnegative
matrices [12, Chap. 2] further shows that = |B| is the maximum possible value
for an eigenvalue of a non-negative matrix with entries in [0, 1].
The Perron-Frobenius holds for irreducible matrices, that is when the graph GB is
connected. Hence, the connected components in GB = (B, EB ) must be inspected
independently. When the matrix CB is irreducible, the theory of non-negative matri-
ces tells us that it has a maximal real positive eigenvalue R, and that the corre-
sponding eigenvector has non-negative real entries [12, Theorem2.6]. We hereafter
assume GB is connected so that CB is irreducible.
Inspired from the clique archetype of an optimally cohesive entity group, we
wish to measure the entanglement the entity group level. We already know that the

eigenvalue is bounded above by |B|, so the ratio I = |B | [0, 1] measures how
intensely interactions take place within the entity group. This ratio thus provides a
measure for entanglement intensity I among all entities with respect to attributes in
B. From our previous example I = 0.31 denoting a low interaction across catalysts.
We also know that the clique situation with equal cb,b matrix entries leads to an
eigenvector with identical entries. This eigenvector thus spans the diagonal space
generated by the diagonal vector 1B = (1, 1, . . . , 1). This motivates the definition of
a second measure providing information about how homogeneously entanglement
distributes among attributes. We may indeed compute the cosine similarity H =
1B ,
||1B |||| || [0, 1] to get an idea of how close the entity group is to being optimally
cohesive. We will refer to this value as entanglement homogeneity H . From our
previous example H = 0.91 denoting a relatively homogeneous but not optimal
distribution of entanglement indices.
A thorough study of the entanglement indices, and the homogeneity and intensity
network indices is out of the scope of this paper (see [35]). Other measures, including
Shannon entropy [38] and Guimeras participation coefficient [18], offer interesting
alternatives to cosine similarity.
2.4 Weighted Interactions
In real-world networks, relationships across entities may not always be considered

as equal, and we often need to utilize weights associated with edges. These weights
might model the intensity of interactions between members of a group, or intensity
of a flow between two entities, for example.
We now wish to consider a weighted entity interaction network GA = (VA , EA ).
That is, GA is equipped with edge weights w : EA R+ (where R+ denotes the
set of reals r 0), hence denoting
the weight of an edge e as we . We extend the map
w to sets and write w(F) = eF we for any subset F EA .
Let us also consider a map : EA 2B where (e) C is the set of all the
different attributes b B that are associated with edge e EA . Whenever b (e),
it means that the edge e bears attribute b. Conversely, 1 (b) E A is the set of
edges bearing attribute b, so whenever e 1 (b), it means that the edge e bears
attribute b.
The quantities n b,b and cb,b may be generalized to a weighted entity interaction
network by setting:

n b,b = w( 1 (b)) = we (2)
e 1 (b)

n b,b = w 1 (b) 1 (b ) (3)
n b ,b
cb,b = (4)
n b ,b
That is, n b,b equals the sum of weights of edges e EA bearing attribute b
B and n b,b equals the sum of weights of edges bearing both attributes b and b .
Because we need to preserve the probabilistic interpretation of cb and cb,b values,
we further set: n b,b
cb = (5)
w(E)
As a consequence, Eq. (5) may be interpreted as the probability that an edge bears
attribute b and Eq. (4) may be interpreted as the conditional probability that an edge
carries b knowing that it already bears b . Observe that considering equal weights
we = 1 for all edges e E coincides with the non weighted version introduced in
the previous section. Using the newly defined quantities cb,b , we may still define the
entanglement index through matrix equation (Eq. (1)).
Note that, unless we filter out edges using a threshold on weights, the shape of
the attribute interaction network remains the same in both situations, weighted and
non-weighted.
3 Case Studies
The case studies we describe in this section aim at showing how the entanglement
indices, and the homogeneity and intensity indices of networks help users explore
social networks and reason about the homophily content. Navigating the network
and getting feedback about these indices, users can question the structure of the
space that binds entities together. The examples are designed to highlight different
aspects of the exploration, each time underlining how the indices contribute to better
understand the group structure of the homophily network. As the examples will show,
the entanglement methodology was embedded in a visual analytics environment
providing sound interactions to help users flexibly select subgroups. While users get
immediate visual feedback about the entanglement values at play, the environment
also allows them to explore the networks, enquire about homogeneity by easily
hopping between the entity and attribute networks.
Roughly speaking, the knowledge users gain after applying a grouping proce-
dure (clustering, community detection) is that a group of entities share a list of
attributes. This is where the entanglement index enters the scene. What does a list
of attributes really mean? Do all entities share all attributes? Do entities more or
less split between attributes? What particular attribute(s) make(s) the split explicit?
In other words, users must be able to elucidate to what extent, and possibly how/why,
the group of entities form a more or less cohesive unit.
Our first use case focuses on the IMBD network [40] gathering movie directors
linked through movie actors (they have directed). Our second use case focuses on an
author/keyword network extracted from the InfoVis 2004 Contest [23]. Our third use
case introduce a user/topic network from a study of the Edgeryders community [39].
All use cases illustrate how the entanglement index, and network homogeneity and
intensity can be used in a visual social network analytics context.
3.1 IMDB
This first use case is built from the Internet Movie DataBase, a largely used
dataset [40]. Auber et al. [2] had visualized a small world subset of the IMDB
co-acting graph. Starting from a small set of star movie actors, we have extracted
the corresponding movie directors to form a bipartite network where movie directors
connect to movie actors they have directed. Applying our methodology we compute
(i) a movie director network (entities), where two directors connect when the set of
movie actors they have directed (attributes) share at least two actors, together with
(ii) the corresponding movie actor interaction network. The data may thus be used
to find cohesive subgroups of movie directors, those whose artistic signature rely on
similar movie casts.
This first example gathers 15 actors and 16 directors (see Fig. 4). A low inten-
sity and medium homogeneity, together with a loosely connected actor interaction
network topology suggest that actors and directors roughly split into two commu-
nities. The director network has medium homogeneity that corresponds to a quite
balanced distribution of actors among them. Homogeneity is not optimal: the direc-
tors did not individually direct each of these actors although, as a group, they did
direct all of these actors. The low values of the network level measures readily indi-
cate the need to dig further into the network and try to nuance the cohesion of
this group. Roughly speaking, low intensity follows from the fact that most directors
have directed only a small number of actors relatively to the whole set.
As can be seen from Fig. 4 (bottom), the two communities of actors are connected
through Robert Duvall, and the two communities of directors are connected through
Sidney Lurnet. Apart from Robert Duvall, the bottom right community of actors is
formed around Marlon Brando, Al Pacino, Jeremy Irons, Jack Nicholson, etc. The
top left community of actors is formed around Sharon Stone, Harvey Keitel, Samuel
Lee Jackson, Leonardo DiCaprio, Meryl Streep, etc. Clearly, there is a generation
gap between those two communities of actors with Robert Duval filling the gapjust
as Sidney Lurnet does it in the director network.
The community of actors located in the top left part of the panel correspond to a
different group of directors (connecting to the previous group through Sidney Lurnet).
It gathers Spike Lee, Jim Jarmusch, Martin Scorsese, Woody Allen and others. This
community has similar intensity but higher homogeneity when compared to the
overall network. This means these actors have equal influence within this group and
better capture altogether the artistic signature of these directors as a group.
The upper left subgroup in the director network (see Fig. 5) actually divides into
three overlapping cliques. Two cliques reach maximal homogeneity and intensity
(the exact same actors have all played under their direction). The third clique (Bruce
Beresford, Jim Jarmusch, Barry Levinson, and Sidney Lurnet)selected in the top
panel of Fig. 5focuses on Ellen Barklin and Sharon Stone. It has lower homogeneity
and intensity indices: they dont mix that well with the other actors.
This use case thus underlines the fact that although a group involves a well
identified and distinct set of attributes (movie actors), the cohesion of the considered
group may rely only on a subset of these attributes. Additionally, group cohesion
must not solely rely on the topology of the projected single-type network obtained
from the original bipartite network.
3.2 Hopping Between the Entity and Attribute Networks

The previous example readily show how the attributes entanglement indices, and
the homogeneity and intensity measures may be used to inspect homophily networks
and assess cohesion in subgroups of entities. The synchronized dual view we use
combines two distinct but complementary networks: the networks of entities GA
and the interaction network of attributes GB .
Finding the correspondence from a set of entities selected from GA to attributes
in B is straightforward, as it suffices to select the desired subset of entities: we then
recompute a new matrix C B based on the induced subgraph of GA . Observe however
Fig. 4 IMDBdirectors appear on top; the actors interaction network is displayed at the bottom.
Selecting a group of directors highlights the corresponding actors, with node size mapped to their
entanglement index. This group of directors shows low homogeneity and intensity. We can clearly
see that the distribution of actors is unbalanced, partly because Sharon Stone plays by far a central
role in the interactions between directorsthe directors all have, at some point, directed her
that the synchronization is asymmetric. Indeed, retrieving entities of type A from a

set of attributes in B is a different matter. Two distinct questions may be asked when
querying a subset of attributes in B B:
Fig. 5 A group of directors (top) and the corresponding actors they co-directed (bottom, high-
lighted) with node size mapped to their entanglement index. This clique of 4 directors shows higher
homogeneity and intensity than the selected group on Fig. 4
Which entities a A bear at least one attribute b B?

Which entities a A bear all attributes b B?
Moreover, what relationships take place between the retrieved entities?
Interestingly, these questions are placed in Lee et al.s taxonomy [26] half-way
between topology-based tasks on adjacency, and attribute-based tasks on links.
The second question often helps to narrow down results from the first question.
Given these questions it is then straight forward to propose the two corresponding
boolean operators:

OR : VB VA with B OR(B) = 1 (b) A ,
bB

AND : VB VA with B AND(B) = 1 (b)A ,
bB
where B B. Observe that the induced subgraph in GA is not necessarily con-

nected.
Typically, when using a node-link view of these networks, the selection of a
set of entities should automatically trigger the selection of the relevant attributes
and compute the corresponding entanglement, homogeneity and intensity values.
This is illustrated in Fig. 5, where a set of movie directors has been selected (top
panel). Movie actors that played under their direction, here seen as attributes of
movie directors, are highlighted (right panel). The corresponding homogeneity and
intensity, restricted to these four selected directors, are displayed as a background of
the selection lasso, while the actual values are reported in a side panel. The size of
movie actors nodes corresponds to their entanglement index: a larger node indicates
a movie actor weighs more in bringing these movie directors together as a group.
Quite naturally, results of a query in one network can be used to feed a new query.
Typically, after the application of the AND operator to identify a subset of entities
in A sharing all the selected attributes, the query is expanded to see what other
attributes are at play. The forthcoming use cases provide examples (see Fig. 12, for
instance).
As a matter of fact, the proposed mode of interactions falls into Yis taxonomy
Selection tasks [41]. Incidentally, their flexibility supports Bujas Posing queries [8]
task. Obviously, the proposed environment supports Making comparisons, a central
task in all data analysis task taxonomy.
3.3 InfoVis 2004 Contest
Our second example concerns data of a different nature, where keywords (attributes)
link to authors (entities), showing that the notion of entanglement can actually apply
to a wide variety of application domain.
We selected a subset of the InfoVis 2004 Contest dataset gathering papers
published at the IEEE InfoVis symposium over the period 19942004 [23]. The
data we consider are authors indexed by keywords gathered from papers they pub-
lished. We thus compute a bipartite graph where authors link to keywords. To some
extent, with respect to Borgattis taxonomy of relations [7], this network could be
considered as an interaction network since co-authorship indeed involves direct con-

tact with collaborators.
When we consider authors and keywords, groups may form because authors are
socially very closeworking at the same institution or having graduated from the
same universityor just formed an opportunistic association around trendy topics.
That is, co-publication is after all a social activity. We took this aspect in consideration
by making sure that authors were connected through a keyword only when they indeed
had co-published a paper on that topicnot just because they both had published a
paper on that topic.
We show how our approach helps to solve two tasks of the InfoVis 2004 Contest:
Where does a particular author/researcher fit within the research areas?
What, if any, are the relationships between two or more or all researchers?
The author-keyword bipartite graph gives rise to a keyword interaction network
GB and an author social network GA . Note that co-authorship relationships make of
this network a natural multiplex network and authors that share the same keywords
can be disconnected. The full social network GA contains about 1,000 authors and
breaks into several connected components. We will focus on the component lead
by Woodruff, Olston and Stonebraker (see [23, leftmostpartofFig. 4]) gathering 16
authors (see Fig. 6top).
The answer to the first question is straightforward. Selecting a single author,
its associated keywords are pushed to the foreground in the keyword network,
while positioned in the context of neighbor topics. The social network displays the
co-authors of any selected author.
The whole network can be similarly inspected author by author. Although this is
useful because it provides fine-grained information on the network, it is lengthy and
tiresome and cannot reasonably be performed on larger networks. This brings us to
the second task requiring a more elaborated exploration strategy. In our case, we may
take benefit of the apparent community structure of the social network. Conversely,
we may select a subset of keywords and look at authors who have published on these
topics to see how homogeneous a community they form, for instance.
The topology of the author network (Fig. 6top) clearly shows three authors
as central actors (A. Woodruff, M. Stonebraker and A. Aiken) at the intersection
of two different cliques. Their associated keywords form a large clique covering a
large part of the keyword network (Fig. 6bottom). The entanglement indices (node
sizes) widely vary among keywords explaining why homogeneity is low, moreover
suggesting that each of these three authors have her/his own set of topics.
Selecting the authors that are part of the top clique in the social network (Paxson,
Wisnovsky, . . .), except those central actors leaves us with a subset of authors with
optimal intensity and homogeneity: they all co-published on the exact same topics.
The same is true if we select the authors that are part of the bottom clique (except
the central authorsOlston, Spalding, . . .).
We may also select two marginal authors sitting on the left side of the social
network (Baldonaldo and Kuchinsky) and observe that they link to keywords located
out of the Woodruff clique keyword subsets. Strikingly enough, none of these
Fig. 6 The InfoVis 2004 Contest data gives rise to a keyword interaction network (bottom) coupled
with an author social network (top). The three selected authors hold a central position in the social
network (top). Their co-publications cover a wide spectrum of topics as shows the clique of keywords
in the bottom image. Entanglement measures, although good, are however not optimal: they did
not pairwise co-published on all these topics. We may indeed suspect each of them to have distinct
co-authors in the network
sub-communities seem to address the topics portals and data visualization located at
the bottom left of the keyword network. Grasping these two keywords, we find that
they solely concern Woodruff and Olston. Leapfrogging the selection to Woodruff
and Olston, we then see the additional topics these two authors have in common.
Observe that, logically, these topics are marginally positioned with respect to the
main clique (Fig. 7top).
This second use case pointed at fully cohesive subgroups where authors have
co-published papers on the exact same topics. This also suggest that the analysis
may be conducted either from the actor (author) network or the attribute (keywords)
network. Going back and forth between these two perspectives seems a fruitful
strategy to get the most out of the entanglement index and the dual GA GB
representation.
3.4 Comparative Results from the InfoVis 2004 Contest
A full comparison with the results of the InfoVis 2004 Contest would require an
extended study of the whole the dataset. Many of the presented results emphasized
trends over the 10 year period observed, which is why here we only focused on
a smaller excerpt from the results of [23]. In our use case, instead of presenting
quantitative results over the different authors, we have presented specificities across
authors relationships.
We also applied on the excerpt the widely used Louvain clustering algorithm [4]
returning three communities (see Fig. 8). The first community regroups Kuchinsky,
Landay, Wang Baldonado and Woodruff, which presents clearly two disconnected
components in the attribute interaction graph, suggesting two sub-communities
within. The second community regroups Allen, Chen, Paxson, Su, Taylor and Wis-
novsky, with I = 0.82 and H = 0.91 suggesting unbalanced collaborations as
we discussed previously. The third community regroups Chu, Ercgovac, Lin, Olston,
Spaldin and Stonebraker, with optimal values I = 1 and H = 1, confirming
the cohesion of this community. Finally, even if Louvain has returned fairly cohesive
communities, the entanglement analysis suggests to dig for more specific interactions,
particularly in the case of disconnected components across attribute relationships.
Comparing entanglement measures with known measures can be also challenging.
Since they are computed for a multiplex network, they do not really correspond to
either traditional network measures or bipartite networks. We will assume that we
have the two separated entity interaction network and attribute interaction network.
Hence, we can only compare entanglement intensity (I = 0.33) and homogeneity
(H = 0.72) with global entity interaction network measures such as density
(d = 0.48) and average clustering coefficient (cc = 0.91). A proper evaluation
would compare those measures over a large number of different networks with varied
characteristics. More interestingly, we can compare the entanglement indices with
node measures on the attribute interaction network as in Fig. 9, and confirm the
differences among these statistics.
Fig. 7 Browsing around obvious sub-communities of authors, the keywords portals and data
visualization never pop up. Directly selecting them in the keyword network brings two co-authors
up front: Woodruff and Olston (top). Selecting these authors shows their common topics of interest
to be marginally positioned with respect to the main clique (bottom)
Fig. 8 Top three communities identified by the Louvain community detection algorithm. Bottom the
disconnected attribute interaction network corresponding to the community in orange (Kuchinsky,
Landay, Wang Baldonado and Woodruff ), suggesting that two sub-communities correspond to this
group
Fig. 9 Comparisons of the entanglement indices with traditional measures on the attribute interac-
tion network, for a better comparison the different values have been normalized. Top left betweenness
centrality. Top right degree. Bottom left Page Rank. Bottom right clustering coefficient. If no clear
correlation can be observed on this excerpt, the measures clearly display many differences
Although the above results do not qualify as a full scale quantitative evaluation of
the results of the entanglement analysis, they illustrate how the entanglement index,
homogeneity, and intensity, stand out from traditional network measures (Fig. 9).
3.5 Edgeryders
This last use case presents a situation with a relevant use of our weighted model, and
brings also forward how we can take advantage of the AND and OR operators.
We study here the Edgeryders community [39]. The data represents users
participating to discussion threads on various topics. Each topic corresponds to a
participation campaign lead by the Edgeryders leaders; campaigns took place one
after the other. The topic 0Undefined has been used for preliminary or out-of-scope
discussions. During each campaign (topics 19), the Edgeryders leaders designed
and implemented different policies to engage users in participating to the debate.
Within the network, opinion leaders accordingly promote participation into the top-
ics. Participation to a topic is weighted for each user in terms of effort measured as
the length of a text (number of words) produced in one piece of conversation. A topic
never closes, and users can participate to every topic by either starting a new thread
or replying to an existing comment. The network is being used by the Edgeryders

leaders to:
evaluate the impact of their policy campaigns and especially see whether partici-
pation in given topics triggered interest in other topics;
evaluate the overall participation of members in exchanging ideas over the forum.
The data, in its original forms, describes a multiplex network of users, on which
each edge is one piece of conversation between two users concerning one specific
topic. We have adapted this network to fit our model, where users u A are entities
and topics t B are attributes. The data gathers 254 users exchanging ideas around
9 topics. Now, each user u produces an effort towards a topic t (measured as the
total number of words written on that topic). We may thus consider weights on edges
e = {u u } by defining w(e) as the sum of the efforts of both users, u and u , on all
topics. This weight, in a sense reflects the overall involvement of users u, u towards
each other. Obviously this should be taken into account when analyzing this social
network. For a group to be cohesive, not only should users have exchanged ideas on
the same topics but they should have put comparable efforts in participating to the
debate. participate to many conversations, we need to consider the effort brought by
individual conversations as weights. Following this model, a user interaction (i.e. an
edge in the actor network) will be weighted by the sum of their mutual efforts.
Note that, similarly to the InfoVis 2004 Contest example, we are looking at the
homophily of an interaction network: two users are linked only if they have been
discussing on a same topic and have been directly conversing together (which can
be traced by looking at replies).
Starting with the user network as shown in Fig. 10, we can see that opinion leaders
are heavily dragging the edges (the 5 most connected nodes drag 26 % of the edges,
with the rest of the nodes averaging their degree to 3.2). Although showing a few local
denser areas, the user network topology does not present any obvious community
structures. A deeper examination shows that those denser areas are composed of
nodes mostly related to one or two leader nodes. The topic interaction network
being a clique, all topics interact together at some point, suggesting to have a closer
look at the entanglement values.
The use of weights lead to a better interpretation of the network structure. For
example, without weights we cannot distinguish the case in which two users are
heavily contributing to two topics from the case in which they only lightly contribute.
Using weighted edges, entanglement intensity and homogeneity are respectively
equal to I = 0.14 and H = 0.94. Without weights, intensity shows as high as
0.40 (while homogeneity remains more or less the same), which actually ignores
the heavy participation of some users on multiple subjects. Figure 11 confirms that
the entanglement indices between the weighted and non-weighted situations (and
the ranking of topics according to these indices) are radically different. However, the
overall distribution of indices remains close, and consequently does the homogeneity
since it is a cosine measure. The inclusion of weights in the network leads to a more
subtle interpretation of the entanglement measures as it includes the notion of how
Fig. 10 The user interaction network (left): node size on users is mapped to their degree; notice that
a few nodes have very high degree (opinion leaders) while other nodes have very low degrees. The
topic interaction network (right): the network forms a clique, meaning that all topics pairwise inter-
act. The entanglement indices indicate however that topics 1, 2 and 4 concentrate most interactions
while topics 0, 5, and 8 only marginally interact with other topics
much effort has been mutually spent on different topics. Obviously, not considering
weights in this network leads to an incorrect interpretation of the network activity.
We can easily retrieve five leaders (the entity nodes of higher degree), by looking
at the collaborations that concerned all topics (i.e. by selecting all topics, with the
AND operator), which are user 4, 10, 64, 468, and 857. Leapfrogging to this selec-
tion of users (see Fig. 12), we can have a deeper look at their mutual efforts. Inten-
sity and homogeneity are very high (0.76/0.95, against 0.14/0.94 in an unweighted
context) which we could expect from opinion leaders. They have worked together
Fig. 11 The two barcharts above help compare the entanglement indices from the weighted network
(right) and non-weighted network (left). The comparison emphasizes how considering or not the
weights can have a strong impact on reading the relative entanglement indices. As can be seen, all
topics are assigned a different entanglement value (except for the topics with extremal values
topics 1 and 5). The balance between entanglement indices does not radically change, but the
participation of each topic to the networks cohesion radically differ
homogeneously on all topics, except for topics 0 (Undefined which is marginal) and 8
(Resilient which was a concluding debate). Notice from the topic interaction network
in Fig. 12 that no interaction between these two topics emerged from leadersmost
probably because those topics are indeed marginal.
Using the same process, we can now answer Edgeryders leaders questions.
We may process one topic at a time. Selecting a topic t, we retrieve the subset of
users who have participated in t. We may then identify other topics they have mutu-
ally participated in (which could be related to the corresponding policy campaign).
A variety of facts can be extracted:
topic 3 and topic 7 clearly dominate the mutual efforts of contributors;
closer examination reveals strong ties between topic 1 and 2;
topics 0 and 8 gather a majority of users who have pairwise co-participated as well
to other topics;
users who participated to topic 5 developed similar efforts to all other topics.
The use case we have just presented thus advocates how weights can be integrated in
our framework to offer a finer interpretation of cohesion and entanglement indices. It
also highlights how the use of the OR and AND operators between the two networks
GA and GB can help to narrow reasoning over the network when the topology is not
sufficient to understand its structure.
This paper addressed the issue of assessing cohesion in groups from homophily
networks mixing entities and attributes into a multiplex view of a bipartite net-
work. Our approach considers splitting the multiplex network into two single-type
Fig. 12 A first selection of all topics (left) have highlighted the five most influential users (middle).
Leapfrogging to these users let us understand how they have been mutually collaborating to the
different topics (right). Note that the first selection, made using the AND operator, returns the lowest
intensity and homogeneity values (0/0) since no pair of users have contributed together to all topics.
This underlines the need to leapfrog the selection since we still have 5 users who have contributed
to all topics. Notice that except for topics 0 and 8 they have all contributed equally. Notice also the
absence of highlighted edge between topic 0 and 8 indicating that no pair of the selected users have
both contributed to those topics together
networks used in conjunction when analyzing the homophily relations between enti-
ties. To answer this question, we have defined the entanglement, a notion of how
attributes intertwine entities edges. We have measures entanglement indices on
attributes, together with the homogeneity and intensity indices computed on any
subset of entities.
These attributes can be used to question the cohesion of a group of entities,
where optimal cohesion requires that entities simultaneously involve the exact same
attributes, and maximum intensity occurs when entities cover all available attributes.
A group of lower or unbalanced entanglement indeed requires more careful analysis,
and typically leads to the discovery of subgroups or regions locally showing higher
entanglement. An entanglement-based search the networks often leads to the identifi-
cation of outlier entities that can then be discarded, or on the contrary brought forward
to understand the network activity. A close examination of the attribute interaction
network also helps the identification of core attributes from which entities form a
cohesive unit.
The case studies clearly show the relevance of questioning the attribute
entanglement of entities to potentially confirm the community structure derived from
edge density, for instance. They focused on small size examples for sake of read-
ability. This limitation is but apparent, as using the interaction network occurs after
entities have been indexed and grouped. Although a query might return hundreds (or
thousands) of entities, we may expect the grouping procedure to form much smaller
groups before closer examination occurs. We also suspect that larger samples gather
larger attribute sets, typically leading to less tangled attribute interactions and less
cohesive entity groups.
Our second case study suggests our approach applies to other types of networks
modeled using a bipartite graph, namely interaction relations. The initial comparative
results encourage us to extend our approach to the study of multivariate networks.
Indeed, since the entanglement measurement actually considers a multiplex network
of interacting entities A , with attributes B corresponding to families of edges.
Our third use case has brought forward the important nuance in taking into account
weighted entities interactions. We are exploring possibilities to further extend the
ways we can incorporate weights in our model, and then fully embrace the weighted
multiplex model, possibly with the help of De Domenico et al.s formulation [10]. For
example, entities of type B may not be equal (some may weigh more than others),
and the interaction through a same entity of type B across two different pairs of
entities of type A may weigh differently. These are design choices we suspect may
depend on the nature and/or on the size of the dataset and the questions our users are
seeking answers for.
These structures being rather complex to manipulate, the use cases we have shown
underline the increase in usability when our approach is embedded in a visual and
interactive environment. The interactions we have used enable a quick back-and-
forth search in the data, putting users as close as possible to their own questions on
the original data.
Further studies would cover optimized implementation and performance studies,
with comparative results on a larger number of networks and measures. Further work
also include examining strategies to automatically identify entity and attribute subsets
with optimal (or maximum) homogeneity and/or intensity, suggesting potential areas
of interest in the network under study. These problems, however, will inevitably bring
us to combinatorial optimization problems, and we may expect to have no choice but
to rely on heuristics to avoid typical algorithmic complexity issues.
Acknowledgments We would like to thank the European project FP7 FET ICT-2011.9.1
Emergence by Design (MD) Grant agreement no: 284625.
References
1. Aral S, Muchnik L, Sundararajan A (2009) Distinguishing influence-based contagion from

homophily-driven diffusion in dynamic networks. Proc Natl Acad Sci 106(51):2154421549
2. Auber D, Chiricota Y, Jourdan F, Melanon G (2003) Multiscale navigation of small world
networks. In: IEEE symposium on information visualisation. IEEE Computer Science Press,
pp 7581
3. Bakshy E, Rosenn I, Marlow C, Adamic L (2012) The role of social networks in information
diffusion. In: 21st international conference on world wide web. ACM, pp 519528
4. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities
in large networks. J Stat Mech: Theory Exp 2008(10):P10008
5. Borgatti SP (2012) Two-mode concepts in social network analysis. In: Meyers RA (ed) Compu-
tational complexitytheory, techniques, and applications. Springer, New York, pp 29122924
6. Borgatti SP, Everett MG (1997) Network analysis of 2-mode data. Soc Netw 19(3):243269
7. Borgatti SP, Mehra A, Brass DJ, Labianca G (2009) Network analysis in the social sciences.
Science 323(5916):892895
8. Buja A, Cook D, Swayne DF (1996) Interactive high-dimensional data visualization. J Comput
Graph Stat 5(1):7899
9. Burt R, Scott T (1985) Relation content in multiple networks. Soc Sci Res 14:287308
10. De Domenico M, Sol-Ribalta A, Cozzo E, Kivel M, Moreno Y, Porter MA, Gmez S, Arenas
A (2013) Mathematical formulation of multi-layer networks. arXiv preprint arXiv:1307.4977
[physics.soc-ph]
11. Didimo W, Liotta G, Romeo SA (2011) A graph drawing application to web site traffic analysis.
J Graph Algorithms Appl 15(2):229251
12. Ding J, Zhou A (2009) Nonnegative matrices, positive operators and applications. World Sci-
entific, Singapore
13. Easley D, Kleinberg J (2010) Networks in their surrounding contexts. In: Networks, crowds, and
marketsreasoning about a highly connected world. Cambridge University Press, Cambridge,
pp 77106
14. Everett MG, Borgatti SP (1998) Anal Clique Overlap Connect 21(1):4961
15. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3D5):75174

16. Fujimoto K, Chou CP, Valente TW (2011) The network autocorrelation model using two-
mode data: affiliation exposure and potential bias in the autocorrelation parameter. Soc Netw
33(3):231243
17. Guillaume JL, Latapy M (2005) Bipartite graphs as models of complex networks. Lecture
Notes in Computer Science, vol 3405. Springer, pp 127139
18. Guimera R, Mossa S, Turtschi A, Amaral LAN (2005) The worldwide air transportation net-
work: anomalous centrality, community structure, and cities global roles. Proc Natl Acad Sci
USA 102(22):77947799
19. InfoVis 2004 Contest. http://www.cs.umd.edu/hcil/iv04contest/
20. Jackson MO (2010) Social and economic networks. Princeton University Press, Princeton
21. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651666
22. Kaski S, Nikkila J, Oja M, Venna J, Toronen P, Castren E (2003) Trustworthiness and metrics
in visualizing similarity of gene expression. BMC Bioinform 4(1):48
23. Ke W, Borner K, Viswanath L (2004) Major information visualization authors, papers and
topics in the ACM library. In: IEEE symposium on information visualization 2004. IEEE
24. Kivel M, Arenas A, Barthelemy M, Gleeson JP, Moreno Y, Porter MA (2013) Multilayer
networks. arXiv preprint arXiv:1309.7233
25. Latapy M, Magnien C, Vecchio ND (2008) Basic notions for the analysis of large two-mode
networks. Soc Netw 30(1):3148
26. Lee B, Plaisant C, Parr CS, Fekete JD, Henry N (2006) Task taxonomy for graph visualization.
In: Proceedings of the 2006 AVI workshop on beyond time and errors: novel evaluation methods
for information visualization. ACM, pp 15
27. Manski CF (1993) Identification of endogenous social effects: the reflection problem. Rev Econ
Stud 60(3):531542
28. McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social net-
works. Annu Rev Sociol 27(1):415444
29. Neal Z (2013) Identifying statistically significant edges in one-mode projections. Soc Netw
Anal Mining pp 110
30. Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45:167256
31. Opsahl T (2013) Triadic closure in two-mode networks: redefining the global and local clus-
tering coefficients. Soc Netw 35(2):159167
32. Peeters R (2003) The maximum edge biclique problem is np-complete. Discret Appl Math
131(3):651654
33. Podolny JM, Baron JN (1997) Resources and relationships: social networks and mobility in
the workplace. Am Sociol Rev 62(5):673693
34. Renoust B, Melanon G, Viaud ML (2013) Assessing group cohesion in homophily networks.
In: Advances in social network analysis and mining (ASONAM) 2013. ACM/IEEE, Niagara
Falls, Canada, pp 149155
35. Renoust B, Melanon G, Viaud ML (2013) Measuring group cohesion in document collections.
In: IEEE/WIC/ACM international conference on web intelligence
36. Robins G, Alexander M (2004) Small worlds among interlocking directors: network structure
and distance in bipartite graphs. Comput Math Organ Theory 10(1):6994
37. Shalizi CR, Thomas AC (2011) Homophily and contagion are generically confounded in obser-
vational social network studies. Sociol Methods Res 40(2):211239
38. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(379
423):623656
39. The EdgeRyders community. http://edgeryders.eu/
40. The internet movie database (IMDB). http://www.imdb.com
41. Yi JS, ah Kang Y, Stasko JT, Jacko JA (2007) Toward a deeper understanding of the role of
interaction in information visualization. IEEE Trans Vis Comput Graph 13(6):12241231
42. Zhou T, Ren J, Medo M, Zhang Y (2007) Bipartite network projection and personal recom-
mendation. Phys Rev E 76(4):046115
An Elite Grouping of Individuals
for Expressing a Core Identity Based
on the Temporal Dynamicity or the Semantic
Richness
Billel Hamadache, Hassina Seridi-Bouchelaghem and Nadir Farah
Abstract New analysis dimensions in social network analysis tend towards more
realistic social graph models feeding new studies and interesting phenomena. Based
on dynamic or semantic dimension, more meaningful and informative results can be
harvested. A social network can be dominated by a core region depending on cen-
tralized or decentralized information sharing, social interactions lifetime and even
orientations developed by network actors. This is an underlying social structure
addressed by the raised question in this paper aiming to strengthening the signif-
icance of a core identity through the dynamic behavior or a semantic character of
collectivities. The temporal dynamic aspect is proposed in priority to be formalized
through a topological dynamic model as an evolutionary process. The aim is to find
a resistant grouping, playing a central role describing a first identity for a cores
infrastructure in time. The semantic aspect is proposed to be a strengthening ele-
ment for such identity. In this study we propose that the feeling of belonging issued
topologically from such grouping durability allows to deduce an implicit seman-
tic nature. However, the study shows that the interactions diversity or interests of
actors in a richer static semantic model will be more explicit to identify a semantic
character of such region. In this paper, we address an identity of a core structure
significantly expressed through an elite grouping of individuals between the topo-
logical dynamic or the static semantic: Internally through the collectivity durability,
the common implicit or explicit semantic character and externally from the strategic
positioning on the communication flows in time or by approximating semantically
different semantic regions in the network.
Keywords Temporal dynamic networks Semantic model Network core Elite

grouping
B. Hamadache (B) H. Seridi-Bouchelaghem N. Farah

Laboratory of Electronic Document Management LabGED, Badji Mokhtar Annaba University,
P.O. Box 12, 23000 Annaba, Algeria
e-mail: Hamadache@labged.net
H. Seridi-Bouchelaghem
e-mail: seridi@labged.net
N. Farah
e-mail: farah@labged.net

DOI 10.1007/978-3-319-12188-8_6
120 B. Hamadache et al.
1 Introduction
A social network (SN) is a social structure emanated from interactions among

individuals, organizational structures, physical proximities, etc. Nowadays, an online
social network (OSN) is an enlargement of socializing phenomena on the web. Unlike
the traditional web, new tracks are opened for this generation to socialize through the
growing popularity of new information, communication and collaboration technolo-
gies. Different social media facilitate the creation of social relations among people
based on acquaintance, family and associative relations, general or professional inter-
ests, activities, etc. [22]. The aim is not only to create links and bring peoples (on
social platforms and applications), but the social dimension appears important by
enhancing the organization performances. For instance: social interactions among
company employees or collaborations within an online learning environment which
can add a learning social aspect (collaborative learning) for increasing the cogni-
tive level of learners (computer supported collaborative learning (CSCL) [1, 2]).
The emergent online and organizational social networks are proliferating and attract
many researchers from academia, government and industry worlds [20, 21]. This is
the best data source for the social network analysis (SNA) and mining (SNAM), an
intersection point between sociology and computer science, inspired from data min-
ing and classically based on theoretical foundations of graph mining. The SN(s) are
intuitively modeled by non-random graphs [18] preserving particular characteristics,
as it can include underlying social structures such as possible core structures.
New analysis dimensions are currently required in order to provide more infor-
mative answers to the man socialization, in front of the variety of explicit or implicit
social data [18] (emails exchanges between company employees, collaborative learn-
ers, recommendation systems [16] etc.). Accordingly the analytical studies should
not be limited to evaluate the performances of routine techniques on large classical
social representations within a structural and static framework. The aim of this paper
is within the scope of new trends targeting to exploit more informational richness
in more realistic SN models. We have to deepening our understanding and identify-
ing significantly social phenomena and underlying structures (a core region). This
will be useful to provide more informative and meaningful answers feeding business
strategies, decisions, etc.
Within different systems core notion refers to a tough, solid, central, denser or an
inner part giving the system existence, its character, etc. This is the central part in
a network, having a high influence on the communication flows between the other
nodes. Depending on the centralization or decentralization degree, a core structure
inside a SN can inherit these structural indicators. We believe that this network region
composed by a subset of individuals should present a particular identity. However,
traditional conceptions are focused only on inner parts densely connected within
classical graph models. Therefore, we need a revolutionized conception. Beyond
this static and structural representation, the available information on the social entity
behavior interacting in the network and the involved semantic behind should be con-
sidered. Once an actor is involved in the network, it is likely to change its interactions
An Elite Grouping of Individuals for Expressing a Core Identity . . . 121
in time by creating or deleting relations with others. This has a direct impact on its
positioning in the network and equally on its probable affiliation to one or more
social groupings. This fact is one of reasons explaining why the overall structure is
determined by structures at local level in a social graph. The temporal change is in
fact animated by many factors influencing the corresponding actor behavior. Such
factors may have semantic origins (a semantic dimension) including the connec-
tions causality, the positivity to the socialization (influenced by social media tools),
relationship types, interests, etc., for an actor. Accordingly, the temporal dynamic
behavior or an involved semantic of social entities are an informational richness to
exploit by our contribution in order to characterizing and strengthening significantly
a possible underlying core identity.
A core region composed by a subset of individuals will be considered in this
paper as an inheriting structure from a grouping of individuals. The internal cohe-
sion of group is firstly inherited. Topologically, a core identity will capture a particular
dynamic behavior of a group in time. On the other hand, a common and salient seman-
tic character is wanted to be expressed explicitly by such region in a static semantic
configuration. The study in this paper focusing on such underlying structure requires
meta-models of SN(s), where the dynamic and semantic aspects will be separately
processed (dynamic model or semantic model). In first, the temporal dynamic shall be
initiated (modeled) primarily on a topological map so as to identify an infrastructure
of a core region in time. Thus, a SN sample evolving in time linking a set of company
employees (Enron Company) is used to be modeled in the form of a development
process of groups. It links SN imprints (modular configurations) through parameters
of composition stability and centrality of groups in time steps. Therefore, by find-
ing a covering path, we target a durable grouping (resistant) and playing a central
role in time encapsulated inside. It will be a first structural identity characterizing
significantly a core grouping during the observation period. In other side and with
aim of characterizing a core identity semantically strengthened, we believe that this
particular internal dynamic of such grouping of individuals can be implicitly issued
from a semantic orientation of individuals in this topological model. Here, our atten-
tion is focused on the durability phenomenon of a collectivity and how it can be
an image reflecting deeply a feeling of belonging of these composition members.
However, we will adopt in a second step a higher abstraction level in another social
representation in order to investigate an explicit semantic character for a collectivity
and then for a probable core region. In this context, complexity reasons require us in
this paper to process until now the explicit semantic without the temporal dynamic
aspects. Therefore, a richer static and semantic model (RDF graph) of a SN will be
addressed based on some ontological conceptions. Our semantic considerations are
based on the expressivity degree depending on the networked environment and the
information availability. Two cases study will be considered through richer explicit
representations modeling static imprints of two different SN(s). In the first case, the
semantic information will be focused on the relationships type within a collaborative
learning environment. Here, the conception of a semantic graph model (RDF graph)
is carried out. Thereafter, we propose a mapping approach showing how to exploit
the expressivity degree without increasing the computational cost on this RDF social
graph. Practically, our own experimental prototype will be established in order to

perform the mapping and parameterize the proposed analytical processing. In the
second case, the semantic aspect will be expressed by interests of users, described by
tags. The semantic will be manifested through the conception structuring relations
user-tag and tag-tag (Folksonomies) in [9]. In these two cases, we will target how
to deduce a first semantic character for a core structure, inspired from a semantic
detection of groups or a detection of semantic groups. An internal semantic character
can occur by sharing the same relations type or the same interest inside a collectiv-
ity. Moreover, when such structure is an intermediary point between other different
groups having different interests, it can semantically refer to a core region having a
semantic identity. Briefly, between the topological temporal dynamic and the static
semantic aspects in this paper, a core identity inside a SN is wanted to be significantly
acquired by an elite grouping of individuals.
The next section will be dedicated to some related works surrounding the
conceptions of core structure in SN, dynamic and semantic models of SN and the
related aspects. Afterwards, the following parts will be addressed showing how a
core identity can be significantly captured on richer SN models between the struc-
tural temporal dynamics or a static semantics. This will be illustrated by some exper-
imental results in forth section. Finally, the study will be separately criticized and
discussed between the dynamic or semantic models. This will be accompanied by
distinguishing between an internal and external identity of a core structure from
which a topological dynamic and a static semantic richness may hybridize and be
crossed.
2 Related Work
Depending on many elements, it is too hard to identify the subset of individuals

forming a core region in a SN. Furthermore, such underlying social structure is still
an informal notion in this networked environment. From some intuitive conceptions,
the core notion has been addressed like a dense and cohesive part, until some related
frameworks but within a static and structural context. According to one of view-
points, the cohesion has been distinguished by an important links density (stronger
relations) between a subset of nodes having a high degree of coreness [8]. As an
individual centrality concept, the degree of coreness has been introduced relative to
a centroid. Here, the positioning and the behavior of the collectivity formed by such
individuals are not considered. The group concept can be nearer for molding a core
infrastructure. It is in this sense that a core region has been also located as groups
((, )-communities) intersection zone overlapping in a static and dense social rep-
resentation in [30]. However, when such structure is considered as a grouping, the
individual strategic positions cannot realistically refer to the collectivity efficiency
on the network communication. While in fact the role of the whole of group can be
more meaningful for a core structure identity. In term of centrality, this can be derived
from generalization methods extending the individual centrality measures on groups
(group centrality). But, some group centralities [11] are computed based intuitively
on external individuals. While some notions like boundary (faster information shar-
ing inside) will not be considered without for example a modular configuration of
the network.
These are static analytical studies where false or misleading information can be
harvested due to an underestimation or overestimation of cohesion or centrality of
groupings qualified to be a core structure. Social interactions change continuously
and generate consequently a natural temporal dynamicity of a SN in the form of a
development process in time. This can be caused by an endogenous dynamic con-
text resulting from simultaneous influence between behaviors and relations changes
among network actors [3, 7, 13, 19, 24]. The observed changes can be equally pro-
voked by some external events: Twitters change (increase of new accounts number)
during the elections event in Iran in 2009 [19]. Accordingly, the individual affilia-
tion, its role and then that of group are affected in time (chronological affiliation to
groups [23]). In this paper, an identity of a core grouping can be more significantly
expressed on a temporal dynamic dimension focusing on the evolution of collectivity
behavior. This evokes a network configuration study in groups in time and requires
being well formalized in order to show some interesting phenomena (group dura-
bility and development). Different partitioning derivatives of dynamic network are
essentially performed by threading a community discovery on a sequence of network
imprints in time [7, 27, 31]. It should be noted here there are different interpretations
of group or community concept in time (a latent concept). Even in literature there
is not a complete agreement on its definition. In addition, many related measures
notably the modularity (high internal connectivity in a group versus to a low exter-
nal connection), have been extended in time. However even in the recent efforts, a
core structure is not considered in term of collectivity behavior neither its role as a
grouping of individuals within the temporal dynamicity of SN.
In the other side, the social structures are more and more complex, evolving within
multiple contexts. Different relationships, activities, roles, identities across multiple
applications and interests can be developed by a social entity. This is a context
where the heterogeneity is generated (heterogeneous SN) [10]. For example the social
tagging is a phenomenon resulted from labeling activities using tags for expressing
interests (Folksonomy, another source for SN: Interests networks). However, the
analytical studies in SNA are generally structural applied on simple non-typed graph
representations. Here, studies surrounding the structures of cores are not exceptions.
The informational richness in SN can be exploited to obtain generally more significant
results (A semantization of SNA). It is in this sense we target in this paper a significant
core identity based not only on the temporal dynamic but giving also it a semantic
dimension to such identity. Initially, a semantic SN model is required for exploiting
the expressed richness in order to give at first a semantic dimension to a grouping
of individuals. The semantic web technologies are currently seen well adapted as
another additional step for improving the representations quality of SN. Depending
on the expressivity degree, the social data can be semantically structured using typed
graphs: Resource Description Framework (RDF). These are descriptive graphs based
on concepts defined as primitives of ontological models. According to the information
availability, the expressivity degree can be increased. Primitives can describe by using
for example the FOAF ontology (friend of friend), the user account (social entity)
and its basic relations (FOAF: knows). RELATIONSHIP, SIOC and SKOS concepts
are more extended and expressive for describing more specialized relations (rel:
works with, rel: friendOF), published contents and social tagging (tags, specification
or generalization relations between tags skos: narrower, skos: broader) respectively.
Thereby, the analytical studies can be enriched on richer models by parameterizing
statistical and individual measures (centralities, diameter, geodesics, etc.). On the
other hand, it will be very interesting to find a semantic nature of a group from which a
semantic character of a core grouping can be inspired in this paper. For example when
additional information like tags is available, a typed graph (RDF) can semantically
model relations between users, user-tag and tag-tag (structured folksonomy) [9].
This has been based on some ontological models used together in [9]. Accordingly,
the group connectivity has been proposed to be strengthened by the same shared
tag between its members (labeled community or interest community) [9] through a
proposed iterative approach: SemTagP: Semantic tags propagation in [9]. Moreover,
the collectivity spirit has been expressed by the semantic links among tags [9]. Thus,
thematic areas more and more specialized have been identified through communities
labeled by tags representing related topics [9].
However, the semantic processing requires exploiting the RDF graphs richness
which is itself a challenge in SNA. It should be noted that tools and operators
(SPARQL: query language for RDF data [10]) are limited to analyze RDF graphs,
respecting analysis requirements and its topological complexity (centrality measure,
community detection, etc.). Even, there are attempts towards extensions (by adapting
queries on the path notion [9]), the related resolution (projections number on graph:
matching on RDF triples) consumes longer computational time. Therefore, treatment
phases (e.g. in previous cited approach) are candidates to be more expensive. More-
over, even such semantic social representations are enabled to enrich static analytical
studies, the dynamic aspect is not considered. Furthermore, if the collectivity nature
can be strengthened by a semantic character, this is not yet clear for a grouping of
individuals semantically qualified to be a probable core structure inside a SN. The
temporal dynamic behavior must be in foreground to characterize a core identity in a
topological context towards a higher abstraction level allowing a possible semantic
strengthening after.
3 A Particular Dynamic Behavior or a Static

Semantic Character of a Collectivity Expressing
Significantly a Core Identity
According to the information sharing tending to be centralized or decentralized, the

SN can be dominated by an inner part qualified as a core region. Intuitively, it refers
to a central part in the network in the form of an underlying structure. It is not obvious
to affirm the existence of a network core structure and how to identify it after (Fig. 1).
Fig. 1 A possible core

identity significantly acquired
by cohesive grouping of
individuals between structural
dynamicity and a semantic
richness
An identity of a core structure is wanted to be more significantly presented through

a methodology in two steps (time, semantic) beyond its structural and static concep-
tions (Fig. 1). Its internal cohesion will be firstly inspired from the internal nature
of the group concept. The whole concept of social group described by a natural
cohesion can be the best concept for embodying a cohesive region among a subset
of individuals forming a possible structure of a core. This refers to a grouping of
social entities densely connected, sometimes including stronger relations in the case
of weighted graph. The temporal or the semantic information is proposed to be the
strengthening elements of a core identity. In first step, the collectivities behavior in
the network will be addressed through parameters of durability and efficiency on
the network communication flows in time. This is a temporal dynamic dimension on
which we are trying to seek a durable grouping playing a central role proposed to
express a core infrastructure in time. While a higher abstraction level will be adopted
in second step independently of the dynamic context in order to express a semantic
aspect in the social representation. Thus, the relationships diversity or the interests
developed by the network actors will be at the heart of a possible semantization
of a core structure in this paper. Intuitively, the SN representations are known as
explicit under a classical graph model. But it will not be profitable enough to allow
identifying an underlying social structure like a core structure particularly with a
supplementary informational richness (temporal or semantic information). Accord-
ingly, meta-models of SN will be required for studying an identity of a core region
by addressing the temporal behavior or the semantic nature of collectivities.
3.1 A Temporal Dynamic Dimension on a Topological Model
In front of the temporal dynamicity of a SN evolving in time, an identity of a core

structure can be more significantly manifested. During a network observation period,
a cohesive grouping of individuals expressing certain parameters of durability and
an efficient apparent role will be a first significant core character in the network. The
SN dynamic is resulted from local temporal changes of interactions developed by the
Fig. 2 Parameters characterizing a core identity within a structural temporal dynamicity
network actors affecting their positioning and affiliations. Thereby, the dynamic of
the collectivity behavior is more or less influenced. At this level, the pace of change
is not equally the same but it is proposed to be captured through parameters that
should be significantly ordered (Fig. 2).
The persistence is determined when the network observation period is covered by
a stable composition. When a group preserves its composition of individuals during
a time period, its role in centrality terms will be more realistic, based on links of
boundary individuals with the outside. A core region should be the resistance point
embodied by a persistent stable group retaining its all composition (subset of linked
individuals) against the network temporal dynamicity. Once a stable structure is
conserved, its collective influence in term of group centrality can be investigated on
traces sequence in time. A central role played by a larger stable composition in time
will be a first good characterization of an identity of a core structure inside a SN
on a dynamic dimension. This should be supported by a network model explicitly
formalized.
3.1.1 A Topological Dynamic Model
During an observation period divided in time points, the network connectivity, its
centralization or decentralization vary. Therefore, different network modular config-
urations can be obtained in time. Thereby, the SN imprints in time will be considered
in the proposed model as a structure of groups at each time step. Hence, a tempo-
ral weighted graph (TWG) is formalized by linking a sequence of these network
imprints. This is an evolutionary process model [12] where the vertices are cohesive
groups resulting from the network partitioning at each time point. The model arcs
are locally created to link the imprints between two successive time points. Each
arc is created to link exclusively two groups A, B belonging respectively to two
successive partitions (PTi, PTi+1) and having a non-empty overlap. It is based on
a kind of a successive temporal overlap considered as a grouping of individuals
retaining locally its composition between two successive time points. Thereby,
Fig. 3 Layered architecture encapsulating deeply a characterized core identity inside a modeled
evolutionary process of a SN
an arc is weighted by supporting locally parameters related to this corresponding

local stable composition: its size and its centrality. Here, the centrality of an overlap
(a subgroup) is determined through the group centrality (GC) by choosing between
degree, closeness, or betweenness centralities of this grouping of individuals (Fig. 3).
GC T i(A B) + GC T j (A B)
W (A, B) = A B , j = i +1, i = 1 . . . t 1
2
(1)
The covering sequences in this model will be targeted. The aim is to find the heaviest
groups sequence (a critical path) covering the observation time points. It is a nar-
rower context where the weights W(A, B) are generally maximized between each
successive time points. In other words, it includes a succession of temporal over-
laps maximizing locally the combination: local stable composition and centrality:
A succession of larger and more central overlaps. In this sequence, the persistence
will not be ensured unless an overall stable composition is encapsulated inside. This
configuration is schematized in a layered architecture, in which the deepest level is
expressed by a persistent grouping of individuals in time. Accordingly a core charac-
ter is determined by this persistent structure with a particular identity deeply imitated
from the higher layers: From the larger and central overlaps expressed on heavy arcs
in the model. Thus, the core region is clearly identified as an underlying structure
and deeply determined by a central and persistent stable grouping of individuals in
time according to such architecture.
The resistant character and the strategic played role are used to draw a core identity
inside a topological representation of collectivities behavior in SN evolving in time.
This is an infrastructure of core region in time, whereas its corresponding semantic
character is not addressed. Usually, the semantic is generally related to an informa-
tional richness issued from actors animating the SN and the context where they are
surviving. Although the temporal information on the SN dynamic is topologically
represented, it could be equally an important for giving a semantic signification for
some phenomena, particularly when it is well formalized in a temporal dynamic
model (such as TWG). The semantic orientation of collectivities will be addressed in
the next phase. We show how it can be implicitly inspired from the internal dynamic
or explicitly following a higher expressivity degree in a richer SN representation. This
will be beneficial for strengthening semantically the signification of a core identity
according to a higher abstraction level.
3.2 A Semantic Character Inspired Implicitly from the Dynamic

Behavior or Explicitly on a Richer Static Model
Generally, in a rigorous data representation, the processed semantic level depends

on the expressivity degree of the syntax by which these data are organized. Even if
the same concept is used in two different representations, it can refer to a semantic
concept in the first but it is not in the other. Texts collection describing a given source
appears a good example from [28]. This collection has been presented by a graph
model called semantic connecting each two texts (nodes) by a weighted arc based
on a similarity measure. It has been considered as binary semantic measure applied
on words of texts [28]. Accordingly, the measure is based on texts overlap adopted
in this case as a semantic overlap. For each text having certain overlaps, a centrality
measure considered equally as semantic measure is associated [28]. Whereas, our
temporal dynamic context modeling the evolution of social data a through the previ-
ous temporal graph (TWG) is a topological model. It is true that the textual data are
topologically presented by a graph model in [28], whence some analog points can be
deduced compared to our temporal model. These two representations are based on
the same topological concepts: nodes, arcs, weights and even overlaps. Nevertheless,
the graph of textual data has been considered as a semantic representation in [28].
Regardless of temporal information expressed in the TWG model, the same
topological nature is manifested (Table 1) by these two models commonly based
on the overlap concept. However and according to the different natures of used
components from: groups and texts to: social entities and words, the overlap con-
cept acquire a semantic character in this textual data. Such illustration shows that
the graph is not only a topological model representing explicitly the data. Implicit
(semantic) information can be included behind. In the case of a social graph model,
semantic information may equally arise for instance through connections causality,
implicit team formation [22], etc. A good exploitation of such information requires
Table 1 Evolutionary process of a SN modeled by (TWG) versus a semantic graph of texts

A topological dynamic model (TWG) A semantic graph of texts
Nodes Partition groups at each time point Texts
Arcs Connecting 2 groups belonging to 2 Connecting 2 texts overlapping
successive time points conditioned by
no empty overlap
Overlap Between members of 2 groups Between words of 2 texts (semantic
overlap)
Weighting Model weighting including overlap size Weighted by semantic similarity
(similarity measure) measure based on semantic overlap
an explicit enrichment of a SN representation. Before moving to a richer model, a

question is raised on the implicit semantic hidden behind such topological temporal
dynamic model. We target investigating the semantic behind the temporal dynamic
behavior of social entities, collectivities behavior and the layering architecture in
the model. The aim is to strengthening deeply the characterized identity of a core
region by understanding the semantic behind. The durability parameter cannot be
only explained topologically through the persistence and stability concepts. There
may be other implicit arguments justifying this resistant behavior of such grouping
of individuals. Hence, this may be clarified by another example [14] where a collec-
tivity identity (e.g. in a political context) can be determined on a semantic dimension
defined on two orientations. The first is designed by a horizontal orientation based on
the feeling of belonging through the relations among group members and its internal
dynamic [14]. Here, we talk about an implicit semantic side which is indicated from
a topological temporal dynamicity in our proposition. The creation and deleting of
links have just topological effects but express a certain internal orientation of a social
entity to preserve or no its affiliation. Thus, a resistant composition of a grouping of
individuals is determined from a same feeling of belonging in members during the
observation period. Each one interacts with each other (internal dynamic) without be
influential on its affiliation or on the composition. However the semantic dimension
is equally based on vertical orientation [14]. It is more expressive, manifested by the
loyalty or the solidarity degree (moral resources), such as: a common pride due to
the wining of the football championship by a national team or subjective perception
of cultural similarity or even an emotional attachment, etc. [14]. In a SN, it means
more informational richness through different relationships, activities and orienta-
tions developed by social entities which require an explicit representation. Therefore,
a richer graph model become more and more interesting for expressing an explicit
semantic information in order to give a semantic character to a group identity and
why not a semantic identity of a core grouping.
Fig. 4 Towards adopting a semantic character by a core grouping
3.2.1 A Semantic Static Model
Generally, a semantic core is characterized within complex systems, defined in high

level models (Meta-model) based on ontologies, business rules, etc. The semantic
of a core structure is related to the studied SN and the context where it is surviving
(online and organizational SN). It must be preceded by a background based on a
semantic SN model and the way showing how to exploit the expressed richness.
The aim will not be only to enrich the analytical studies (measures), but to extract
essentially a semantic character for a grouping of individuals then an identity of a
core grouping semantically significant (Fig. 4). Therefore, it is not obvious to enchain
these phases. By considering the computational complexities, it is recommended to
proceed without the temporal dynamicity by focusing on a semantic static model of
a SN in this paper.
Firstly, when it comes to study new SN traces (semantic traces), a networked
environment as a collaborative learning environment will be a simple and good moti-
vating illustration in this contribution. The learning social aspect in such environment
is a new paradigm where less attention is focused, particularly on semantic models
of collaborative social interactions. Recently the semantic web technologies appear
powerful tools for expressing the semantic information and even exploit it in OSN.
However, these technologies are limited for structuring only pedagogical resources
or actors in this environment. Although this environment is not an explicit applica-
tion for socializing, implicit SN(s) are generated from the collaborative interactions
between learners. Therefore, such SN is itself of an implicit nature. The semantic
representation of this collaborative SN will be modeled by a typed RDF graph: A
proposed semantic graph based on a simple ontological model describing the social
entity profile and the diversity of developed relations. One of our proposed and
extracted semantic aspects is the influence of collaborative tools on socialization.
Accordingly, the interactions typing will be extracted from the used collaborative
tool (Synchronous or Asynchronous collaboration: CS/CA). This is equally related
to another semantic aspect explaining the fact that the learning social aspect is equally
affected by the collaborative tools. Within such environment, increasing the cognitive
Fig. 5 Mapping from a semantic model (RDF graph) to a direct labeled graph preserving the same
expressivity
level of learner is the primary objective while the positivity of an actor (a learner) to
collaborate (to socialize) then its cognitive level, are influenced.
In a graph mining context, SNA studies are usually facing to the complexity
problems: The computing of centralities based on paths, the discovery of communi-
ties and underlying structures is already complicated on topological representations.
Thereby, by analyzing directly this RDF social graph, the analysis complexity will
be probably increased. In addition, it should be noted that tools treating the RDF
graph while meeting the complex analysis requirements are limited. Accordingly, a
mapping approach is proposed towards an equivalent graph representation (directed
labeled graph) by preserving the same expressed semantic richness. The type of rela-
tions (CS/CA) in RDF graph will be preserved by the labeling function on arcs in
the target representation. Between two actors, the arc orientation is exploited to dis-
tinguish the domain and the range (the trigger of collaboration) of the RDF property
(describing the collaborative interaction). The aim with such mapping is to reduce
the complexity of following studies (e.g. on the semantic of groups). We target a less
expensive processing depending on the expressed richness degree (Fig. 5). Thereby,
the individual analysis measures will be parameterized and different strategic posi-
tions can be detected according to the relations type. A semantic detection of groups
can be possible. Each collectivity can be distinguished by ensuring an internal con-
nectivity by the same link type. At the same time, the individuals can be affiliated
to one, two or more different groups (each one have different type of connectivity),
creating consequently overlapping zones. This intersection is grouping reflecting
Fig. 6 A grouping of individuals sharing the same tag that is semantically the most related with
others tags
semantically a kind of core region having not only an intermediary central role but a
semantic positioning between various links (approximating different communities).
In the other side, the collectivity spirit cannot be only based on the connectivity
and the distinction of its type between the subset of individuals. The collectivity spirit
in group can be more explicitly strengthened through orientations expressed by the
social entities. This will require a higher expressivity degree in a SN representation
(semantic model). A richer model will be based on the available informational rich-
ness: On these orientation and activities. In OSN application, the network is more
explicit and the actor orientations can be announced as interests by tags. This is a
social tagging phenomenon which means that a set of actors describe a set of objects
with a set of tags.
The tags can be semantically related. In this case, the semantic information is not
limited only by relationships diversity between actors but it concerns equally the tag
use (user-tag) and links between tags. A semantic model of SN more enriched will
be required for structuring not only relations individual-individual but structuring
semantically the resulting links individual-tag, tag-tag. Intuitively, an interest com-
munity can be formed by actors sharing the same tag (Fig. 6). The collectivity identity
acquires a semantic character (a common interest) but it should not be topologically
deprived of its connectivity. The internal cohesion is primordial between a subset of
individuals (densely linked) for qualify it as a semantic group (sharing the same tag).
An identity of a core structure will be determined by a grouping of individuals and

semantically inspired from such orientations. The internal collectivity spirit refined
by a sharing interest based a common tag is proposed to be strengthened by a particu-
lar semantic positioning. In other word, we believe that the semantic of core identity
does not only concern an internal aspect. It can be modeled by a cohesive subset of
individuals sharing the same tag which is the most related with other tags: A semantic
crossroad where other different semantic regions in the network are crossed. Regard-
less of the used decomposition approach and the related computing cost, the fusion
between the structural analysis and richer semantic models of SN is required in this
proposition. This allows preserving the core infrastructure based on a group concept
and giving internally or externally a certain semantic dimension to this collectivity
identity in a static context.
4 Experimental Results
An identity of a core region will be studied through illustrative experimental results

on two different data-sets of emergent SN. Beyond models intuitively schematized
by classical graphs (non-random), the multiple contexts within which the SN are
surviving, requires an adapted modeling supporting informational rates to express.
Between two dynamic and semantic aspects, the first network will be represented by
an evolutionary processes model and the second by a semantic static model, based
on schemas addressed in the previous section.
Firstly, the dynamic aspect in SN will be expressed through temporal information
labeling the dynamicity of social entities and their interactions. A sample of an
implicit organizational communication network evolving in time is studied. The
network is generated from mails communications within the Enron Company (The
Enron Energy Corporation [26]). These data have been targeted by the federal Energy
Regulatory Commission during its investigation [15, 17] (between 1999 and 2002)
following an accounting scandal issued from fraudulent manipulations hiding debt
billions in Enron company and causing its bankruptcy in 2001. A network sample
is used, formed by 112 Enron company employees linked by mails exchanges: An
undirected edge is created when one sent mail occurs at least between nodes pair,
during one year. In the form of assumption, the set of nodes is consistent. The network
is modeled by a temporal weighted graph model (TWG) processed under Pajek tool
[46] (Version 3.08). This is an evolutionary process model linking successively the
network imprints. Each imprint is a configuration of cohesive groups resulting from
the network partitioning at each time point (12 time points). The partitioning is driven
through a modularity function insuring the internal cohesion inside collectivities
(Fig. 7).
Following a layering architecture, a narrower context is identified through the
heaviest sequence of groups (A1, . . . , A12), W(Ai, Ai + 1) = 1517.5 (i =
1, . . . , 11), covering the observation period. Between each two successive time
Fig. 7 Generational view of SN under an evolutionary model (TWG)(VOSviewer [29])
points, this sequence is generally formed by groups linked by heavy arcs. Between
8299 % of heaviest arcs are covered (Fig. 8).
In other words, it covers a succession of temporal overlaps among these groups
(Ai A j Ai, A j), within which a persistent structure should be deeply encapsu-
lated inside N Ai A j (j = i + 1, i = 1 . . . 11). The succession contains heavy
weighed arcs maximizing the parameters combination locally expressed in these
weights in time. We have found that when a subset of individuals is surviving (per-
sists) inside such context (Larger and more central overlaps), it can imitate character-
istics by 9597 % of subordination. This is a larger stable composition (deeper layer
(Fig. 9)) having a central role. Because, the groups forming the sequence are gener-
ally the most central structures at each time point and centrality of their successive
overlaps are approximate. Consequently, the centrality of this persistent grouping
inside is generally higher in time (Fig. 9).
Therefore, beyond the internal cohesion, the crossing between the durability of a
larger collectivity and a strategic played role on the network communication flows,
leads to an interesting identity. It characterizes deeply and significantly an infrastruc-
ture of a possible underlying core region inside this topological temporal dynamicity
of SN.
Secondly, a higher expressivity degree will be adopted in a second dataset in
order to show an illustrative semantic dimension without multiplying the complexity
Fig. 8 Arc weights variation on the critical sequence compared to the heaviest arc between each
two successive time points
Fig. 9 Larger persistent grouping of individuals having a central role deeply imitated
analysis. We target a simple semantic character feeding the collectivity spirit and then
a possible semantization of a core identity inside a static picture of an emergent SN.
A collaborative learning environment is another new source of computer-mediated
social interactions, because it tends to adopt a social collaborative mentality between
learners: Computer Supported Collaborative Learning (CSCL) [1, 2]. Here, increas-
ing the cognitive level of learners is the common individual objective. Thereby, the
collaborative social interactions within the learning communities are more oriented
than other social relationships (in social platforms). We can talk about a deeper
semantic aspect behind such interactions and explain it by the fact that the collabora-
tive act is acquired and constrained by social skills of learner and its positivity to the
collaboration. These elements are equally influenced by the nature of used collabora-
tive tools. This means that the collaborative act is also semantically affected through
these tools (social media). Two types of interactions: synchronous or asynchronous
collaboration are distinguished in this paper. Accordingly, a semantic model of a
SN of collaborators learners is generated in the form of a typed RDF graph linking
20 learners. This RDF model is based on a simple ontological model describing
relationships nature (Synchronous or Asynchronous collaboration: CS or CA).
An experimental prototype is established (using JAVA language) for a less
expensive semantization of some analytical studies in front of the expressed rich-
ness. This is an analysis parameterization on these new traces. The prototype is
intended at first to apply the proposed mapping schema from the RDF data (collab-
orations of learners) to a directed labeled graph. It must preserve and transmit the
same semantic information. By using some programming interfaces, RDF relational
data are extracted (using JENA API) and regenerated through nodes and labeled
arcs. The nodes and arcs are in the form of objects (using JUNG: Java Universal
Network Graph API) able to capturing the same expressivity (user profile, labels and
orientation of arcs: CS or CA) and forming the target graph to analyze (Fig. 10).
Thereby, the analysis measures (Centrality measures and even global indicators:
density, diameter, etc.) are parameterized, allowing to detect different individual
strategic positions according to relations type (Fig. 11).
For example, by normalizing the individuals centralities (e.g. betweenness), the
most central actor: learner 18, on the synchronous collaborative communications
flows is not the same in the asynchronous case. Another actor (learner 14) plays the
most intermediary role. However, different central positions (learner 12 or learner
7) can be identified when the interactions nature is not distinguished (non-typed
graph). The collaboration is a symmetric social interaction between tow nodes. But,
it is initiated by a collaboration request. This is supplementary semantic information
added to the relation semantic. It is modeled by the arc orientation, deduced from the
asymmetric RDF properties (domain/ range). Hence, the initiator of the collaborative
interaction among two nodes (transmitter) can be identified (even the receiver of this
request. Thereby, analysis can be still enriched by refined measures. We can compute
the node prestige depending on receptions or sending of collaboration requests which
are occurring.
If the structural potentiality of a social entity varies depending on its relations and
orientations nature, the network connectivity is globally affected. The affiliation to a
Fig. 10 A direct labeled graph describing the relationships nature (From a semantic social network)
Fig. 11 Betweenness Centrality of actors (learners) according to their interactions type

Fig. 12 Semantic character of group described by the same relations type linking its members and
overlap with groups with different type
possible grouping of individuals can be semantically determined, when the tendency

of actors to regroup, is marked by the same relations type in this collectivity. The
collectivity spirit which is already based on the internal cohesion is strengthened
by a certain semantic character. The semantic is manifested by the common nature
of interactions. Regardless, the used decomposition method, two different modular
structures (cohesive groups) in the network is obtained by distinguishing the relations
type (synchronous or asynchronous collaboration group).
Consequently, if this network is organized under different modular configurations,
an actor can be affiliated to different social groupings at the same time (Fig. 12).
Thus, an overlapping zone is emerged between different collaboration regions. It
holds a potential location like a central region (structural hole connecting boundaries
of groups) describing topologically a kind of core structure. In addition, a seman-
tic character is added to such structure by approximating semantically subsets of
individuals having one single collectivity spirit inside this particular SN.
5 Discussion
Depending on the network centralization or decentralization degree, a core character

should be acquired by a subset of individuals forming firstly a cohesive structure.
The whole concept of group (densely connected) is well adapted to mold a core
identity whatever the informational richness modeled either in dynamic or seman-
tic context. We have considered additional information on the temporal behavior
of social entities, their relations, interests, preferences or orientations, etc., in SN
models. The aim was focused on a core identity wanted to be significantly charac-
terized in this paper. However, the availability of temporal or semantic information
on the social data, how to capture it explicitly, are not obvious to facilitate building
more realistic models compared to the static and structural representations. Although
informative and meaningful answers that can be provided from such identification
of a core identity, the complexity of related analytical studies is multiplied. The
underlying nature of a core region is itself a first complication. This can be increased
by studying it (as a collectivity) on dynamic or semantic dimension and in larger
networks. The complexity can be equally worsen by longer observation periods or
more expressive semantic richness (and how to exploit it) in a dynamic or semantic
model respectively. At the same time, an almost continuous dynamic model or a
semantic model massively rich according to very higher abstraction level becomes
anyway unfeasible.
Anyhow, we believe that a core region can be significantly characterized on a
temporal or semantic dimension of a SN (Fig. 13). This identity is proposed to be
inspired from the dynamic behavior of groupings then their successive temporal
overlaps or from their semantic character then their static semantic overlaps (Fig. 13).
An evolutionary process of collectivities is proposed to model a SN evolving in time
by capturing the durability and efficiency of groups. A core identity is significantly
acquired by a cohesive grouping of individuals deeply situated as a larger stable
composition playing central role in time. In this case, many variants are involved:
The chosen optimal time windows resolution, the selected group centrality and how
to quantify it, etc., by which results are affected. However, even a strategic role is
played by this persistent structure, the role stability, the sufficient balance between
parameters and the SN sensitivity to such region will be required and should not
be ignored. We have seen how an implicit semantic orientation can be deduced
from a feeling of belonging causing the durability of this collectivity (an internal
dynamic not influential on the composition). However it was reasoning based on
a topological representation where the semantic of actors and their relations are
not and explicitly expressed. Thus, this does not allow sufficiently investigating a
clearer internal orientation and a semantic character for such region compared to
other regions.
In the other side, a core identity is determined on a semantic dimension (Fig. 13)
from the relationships nature between the social entities or their interests (e.g.
expressed by tags). Therefore, a semantic model of a SN (e.g. an RDF graph)
depending on the expressivity degree is formed based on the adopted abstraction
Fig. 13 Road to characterizing significantly an identity describing internally and externally a core
region through an elite grouping of individuals by bringing closer the structural dynamicity and the
static semantic richness of emergent SN models
level (e.g. the abstraction of the ontological model). The feeling of belonging can
be strengthened, when individuals are involved with the same relations nature inside
a collectivity, or when the same interest (tag) is shared. The character of a core is
semantically proposed to be manifested by region situated as an intersection zone
approximating different semantic identities of groups (e.g. between different rela-
tions or interests): e.g. interests center. Here, the topological internal connectivity
and central positioning of such region is not preserved as an infrastructure. In con-
trast, regardless the expressed semantic richness, how to exploit it and the related
complexity, these semantic models are static social representations aggregating all
network links in a single representation exactly as they appear at the same time.
The temporal information like time ordering [25] of links and then its lifetime
are not considered. Accordingly, a misleading identification of a core region can be
produced, following an over/underestimated parameters of connectivity, collectivity
spirit and group centrality quantification, etc (Fig. 13).
6 Conclusion
Beyond the static and structural analysis framework, a possible core structure inside
a SN surviving in dynamic and richer contexts can be qualified as an elite grouping
of individuals. It should express an identity distinguished by two sides, significantly
characterized on new analysis dimensions. An internal identity based on an internal
cohesion between a subset of individuals evoking a particular dynamic behavior of
the collectivity (durability in time). This internal identity can be strengthened from a
united semantic orientation. While the external face of this identity is topologically
determined from a strategic positioning in time or semantically by crossing between
different semantic regions in the network.
It appears informative for feeding business strategies and decisions, homeland
security, for example studies on P2P networks [16], political networks, social move-
ments, epidemiology [20] and even for investigations on illegal SN hiding fraudulent
behaviors, crime, terrorism [20], etc.
In fact, the temporal or semantic aspects are separately modeled through a
structural dynamic or semantic static model respectively. A larger composition deeply
resistant and playing a strategic role on the communication flows distinguishes a sig-
nificant identity for a cores infrastructure. But it could be still refined when other
parameters (the centrality stability) are considered in time. It may even be informa-
tive for answering to the SN fragility issues in a dynamic context. However, such
structure can be determined in a richer static representation, through a collectivity
sharing the same semantic, situated semantically as an overlapping zone between
different regions. This is one of orientations towards a semantic core of a SN. On
the other hand, we can deduce some rapprochement signs between the two models.
The durability explains a particular internal dynamic guided by a feeling of belong-
ing which illustrates some semantic character. In addition, the external identity is
imitated from overlaps, either inside larger and central successive temporal overlaps
or inside semantic overlaps. Such analytical study on networks at larger scale, for
longer observation periods and higher abstraction level will be another challenge
in front of an increased complexity. Nonetheless, meta-models based on the fusion
between semantic and dynamic aspects lead to produce more expressive dynamic
models. This can be a further step towards characterizing more significant identity
of an underlying cores structure inside SN.
References
1. Abel M-H, Leblanc A (2008) E-MEMORAe2.0: an e-learning environment as learners com-

munities support. Int J Comput Sci Appl (Special issue on new trends on AI techniques for
educational technologies), 5(1):108123
2. Adeline L (2008) Environnement de collaboration et memoire organisationnelle de formation
dans un contexte dapprentissage, Universite de Technologie de Compiegne, These de Doctorat,
Informatique, dir. M.-H.Abel, J.P.Barthes, 03.12.2008
3. Ahn J, Taieb-Maimon M, Sopan A, Plaisant C, Shneiderman B (2011) Temporal visualization
of social network dynamics: prototypes for nation of neighbors. In: Proceedings of social
computing, behavioral-cultural modeling and prediction conference, (November 2010). HCIL-
201028, pp 309316
4. Batagelj V, Mrvar A (1998) Pajekprogram for large network analysis. Connections 21(2):47
57
5. Batagelj V, Mrvar A (2008) Pajek workshop at XXVIII Sunbelt Conference. St. Pete Beach,
Florida, USA, Jan 2227
6. Batagelj V, Mrvar A (2008) Pajekanalysis and visualization of large networks. In: Juenger
M, Mutzel P (eds) Graph drawing software. Mathematics and visualization, Springer, Berlin,
pp 77103 ISBN: 3-540-00881-0. PDF, Springer, Amazon
7. Berger-Wolf TY, Saia J (2006) A framework for analysis of dynamic social networks. In:
Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and
data mining. Philadelphia, pp 523528
8. Borgatti SP, Everett MG (2000) Models of core/periphery structures. Soci Netw 21(4):375395.
Elsevier
9. Ereteo G, Gandon F, Buffa M (2011) SemTagP: semantic community detection in folksonomies.
In: Proceedings of the 2011 IEEE/WIC/ACM international conferences on web intelligence
and intelligent agent technology, WI-IAT 11, vol 1. pp 324331, ISBN: 978-0-7695-4513-4
10. Ereteo G, Gandon F, Buffa M, Corby O (2009) Semantic social network analysis. In: Proceed-
ings of the WebSci 09: society on-line, 1820 Mar 2009, Athens, Greece
11. Everett MG, Borgatti SP (1999) The centrality of groups and classes. J Math Sociol 23(3):181
201
12. Hamadache B, Seridi-Bouchelaghem H, Farah N (2013) Toward characterizing a more signifi-
cant identity of core structure within dynamic social network. In: 2013 IEEE/ACM international
conference on advances in social network analysis and mining (ASONAM 2013), Niagara Falls
Canada, 2527 Aug 2013
13. Jamali M, Haffari G, Ester M (2011) Modeling the temporal dynamics of social rating networks
using bidirectional effects of social relations and rating patterns. In: International world wide
web conference committee (IW3C2), WWW 2011Session: temporal dynamics. 28 Mar1
Apr 2011, ACM, Hyderabad, India. ISBN: 978-1-4503-0632-4/11/03
14. Karolewski IP (2009) Citizenship and collective identity in Europe. Routledge advances in
European politics, Kindle Edition, pp 8385. (Routledge, 24 August 2009, p 260)
15. Klimmt B, Yang Y (2004) Introducing the Enron corpus. In: CEAS conference
16. Lathia N, Hailes S, Capra L (2008) kNN CF: a temporal social network. In: Recsys 08: pro-
ceedings of the 2008 ACM conference (2325 October 2008, Lausanne, Switzerland), on
recommender systems. ASSOC Computing Machinery, pp 227234
17. Leskovec J, Lang K, Dasgupta A, Mahoney M (2009) Community structure in large networks:
natural cluster sizes and the absence of large well-defined clusters. Int Math 6(1):29123
18. McGlohon M, Faloutsos C (2008) Graph mining techniques for social media analysis. In:
International conference on weblogs and social media (ICWSM), Seattle
19. Meeder B, Karrer B, Sayedi A, Ravi R, Borgs C, Chayes J (2011) We know who you followed
last summer: inferring social link creation times in Twitter. In: International world wide web
conference committee (IW3C2), WWW 2011Session: temporal dynamics. 28 March1 April
2011, ACM, Hyderabad, India. ISBN: 978-1-4503-0632-4/11/03
20. Memon N, Alhajj R (2011) Introduction to the first issue of social network analysis and mining
journal, published online: 13 Nov 2010. In: SOCNET (2011), vol 1. Springer, New York, pp
12. doi:10.1007/s13278-010-0016-2
21. Memon N, Alhajj R (2011) Introduction to the second issue of social network analysis and
mining journal: scientific computing for social network analysis and dynamicity, published
online: 29 Mar 2011. Soc Netw Anal Min, vol 1. Springer, pp 7374. doi:10.1007/s13278-
011-0022-z
22. Nettleton DF (2013) Data mining of social networks represented as graphs. Comput Sci Rev
7:134
23. Reda K, Tantipathananandh C, Berger-Wolf T, Leigh J, Johnson AE (2009) SocioScape
a tool for interactive exploration of spatio-temporal group dynamics in social networks. In:
Proceedings of the IEEE information visualization conference (INFOVIS 09), 1116 Oct 2009,
Atlantic City, New Jersey
24. Snijders TAB, Doreian R (2010) Introduction to dynamic social network analysis, introduction
to the special issue on network dynamics. J Soc Netw 32(1):13
25. Tang J, Musolesi M, Mascolo C, Latora V (2010) Characterising temporal distance and reacha-
bility in mobile and online social networks. ACM SIGCOMM Comput Commun Rev 40(1):118
26. Tang J, Musolesi M, Mascolo C, Latora V, Nicosia V (2010) Analysing information flows and
key mediators through temporal centrality metrics. In: Proceedings of the 3rd workshop on
social network systems (SNS 10), 13 Apr 2010, ACM, Paris, France
27. Tantipathananandh C, Berger-Wolf T, Kempe D (2007) A framework for community identifi-

cation in dynamic social networks. In: Proceedings of the 13th ACM SIGKDD international
conference on knowledge discovery and data mining, KDD 07, 1215 Aug 2007, New York,
pp 717726
28. Traub MC, Lamers MH, Walter W (2010) A semantic centrality measure for finding the most
trustworthy account. In: Proceedings of the IADIS international conference informatics, July
2010, Freiburg, Germany, pp 117125
29. van Eck NJ, Waltman L (2012) VOSviewer (Version 1.5.35 Dec 2012). http://www.
vosviewer.com/
30. Wang L, Hopcroft J, He J, Liang H, Suwajanakorn S (2013) Extraction the core structure of
social network using alpha beta community, Int Math 9(1):5881. Published on 1 January 2013.
Taylor and Francis Groups
31. Zhou D, Councill I, Zha H, Lee Giles C (2007) Discovering temporal communities from
social network documents. In: IEEE international conference on data mining (ICDM 2007),
pp 745750
The Power of Consensus: Random Graphs
Still Have No Communities
Romain Campigotto and Jean-Loup Guillaume
Abstract Communities are a powerful tool to describe the structure of complex

networks. Algorithms aiming at maximizing a quality function called modularity
have been shown to effectively compute the community structure. However, some
problems remain: in particular, it is possible to find high modularity partitions in
graph without any community structure, in particular random graphs. In this paper,
we study the notion of consensual communities, or community cores, and show that
they do not exist in random graphs. For that, we exhibit a phase transition based on
the strength of consensus: below a given threshold, all the nodes belongs to the same
consensual community; above this threshold, each node is in its own consensual
community. We compare the results using different quality functions as well as
different models of random graphs, with or without communities.
Keywords Random graphs Overlapping communities Consensual community

Complex networks Community core Modularity
1 Introduction
Complex networks appear in various contexts such as computer science (networks

of Web pages, peer-to-peer exchanges), sociology (collaborative networks), biology
(proteinprotein interaction networks, gene regulatory networks), etc. These net-
works can generally be represented by graphs, where nodes represent entities and
edges indicate interactions between them. For example, a social network can be
The work presented in this paper is an extension of [6].
R. Campigotto J.-L. Guillaume (B)

Sorbonne Universits, UPMC Univ Paris 06, UMR 7606, LIP6, 75005 Paris, France
e-mail: jean-loup.guillaume@lip6.fr
R. Campigotto
e-mail: romain.campigotto@lip6.fr
R. Campigotto J.-L. Guillaume
CNRS, UMR 7606, LIP6, 75005 Paris, France

DOI 10.1007/978-3-319-12188-8_7
146 R. Campigotto and J.-L. Guillaume
represented by a graph whose nodes are individuals and edges represent a kind of
social relationship. Likewise, a proteinprotein interaction network can be modeled
by a graph whose nodes are proteins and edges indicate known physical interactions
between proteins.
An important feature of such networks is that they are generally composed of
highly interconnected sub-networks called communities [13, 30]. Communities can
be considered as groups of nodes which share common properties and/or play similar
roles within the graph. The automatic detection of such communities has attracted
much attention in recent years and many community detection algorithms have been
proposed (see [11] for a survey). Most of these algorithms are based on the maxi-
mization of a quality function known as modularity [25], which measures the internal
density of communities. Modularity maximization is an NP-hard problem [4] and
most algorithms use heuristics. However, even if the Newman-Girvan modularity
is predominant in the context of community detection, other quality functions have
been proposed over the years (see for example [20, 31, 35]) but they have been less
studied in this context.
In random graphs, however, links appear independently of each other, so a strong
inhomogeneity in the density of links on these graphs is not expected. Therefore,
random graphs should not have communities using the previous definition. As shown
in [15], due to fluctuations, it is possible to find partitions with significantly high
modularity in random networks. A good community detection algorithm should
therefore be able to find communities if it is relevant, but also to indicate the absence
of community structure.
1.1 Our Contribution
Here, we assume that, if multiple runs of a non-deterministic community detection

algorithm agree that a given set of nodes belong to a community, then this set is
certainly more significant than a community found by a single run. In the following,
we will show that this definition of consensual communities, or community cores,
allows to make the distinction between real graphs and random graphs in terms of
community structure. More precisely, we will prove that random graphs only contain
trivial consensual communities, i.e. consensual communities containing all the nodes
of the graph or consensual communities containing a single node, in so far as size
of graphs is finite. We will show there is a phase transition between these two states
depending on a resolution parameter for the size of the consensual communities. For
finite graphs, this transition is abrupt but not instantaneous.
Consensual clustering has been introduced in [8, 9] and its application to networks
in [19, 29, 32]. We will also show that this observation is not directly related to
the Newman-Girvan modularity and that other quality functions exhibit the same
behavior. Finally, using a model of random graph with known communities [13],
we will show that depending on the strength of the communities we can go from a
situation where cores are clearly defined to a situation where the graph is random-like.
The Power of Consensus: Random Graphs Still Have No Communities 147
1.2 Organization of the Paper
We provide a general description of algorithms used for detecting consensual

communities in Sect. 2. We then present experimental results on artificial and real net-
works in Sect. 3 and the proof of the absence of non trivial consensual communities
in random graphs in Sect. 4. We finally conclude in Sect. 5.
2 Consensual Communities
Following the works from Diday [8, 9] on consensual clustering of vectors, different
studies have proposed to adapt this method to graphs and to combine different parti-
tions into consensual communities. The common features of these methods consist in
(i) compute different partitions and (ii) combine these partitions to find similarities.
A consensual community is therefore a set of nodes which are frequently classified
in the same community through multiples computations. We will give a more formal
definition later on, mainly to specify the meaning of frequently. The main reason
for using consensual communities rather than classical communities comes from the
fact that most techniques used to compute communities can usually provide more than
one solution. This may come from initial conditions of the algorithms, for instance
the random seed which is generally used in non-deterministic algorithms, or from
the fact that algorithms can depend on the numbering of the nodes, for instance if
they consider nodes in a given order. The landscape of the optimized function can
also be highly non-convex, leading to many local maxima. Given that there are many
local maxima which can be very similar in quality, even if they are structurally very
different, there is no reason to prefer one above another since they all can equally
measure the structure of the network. In the absence of a good way to choose one
partition among all, finding a consensual partition therefore seems to be the good
compromise.
Consensual communities can also provide a deeper insight on the structure of
the network since they summarize many partitions and encode more information on
the structure. They can also erase the defaults of each single partition. The classical
example consists of two cliques (complete graphs) C1 and C2 overlapping on some
nodes C = C1 C2 . Any single run will classify the overlapping nodes of C either
with the nodes of C1 or the nodes of C2 and none of these choices is better than
the other. However, when combining multiple executions, the fact that the nodes
of C belong both to C1 and C2 will clearly appear. For this reason, consensual
communities have already been used in the context of overlapping communities,
for instance in [33]. It has also been shown that consensual communities are more
resilient to modifications of the networks [28] and could therefore be more suitable
to study evolving communities in graphs.
Two main approaches are used to obtain different partitions. The first one consists
in disturbing a given network by rewiring a small fraction of links [17] or changing
slightly the weights on links [12, 27]. The second one, that we are going to use
hereafter, consists in using the non-determinism of some algorithms to obtain differ-
ent partitions. For instance, the Louvain method [3] (among others) can give different
results depending on the order in which nodes are considered by the algorithm. This
has been used in [19, 29] to compute consensual communities and in [32] to com-
pute overlapping ones. A generic version of Louvain is under development, in which
different quality functions can be plugged [5].
2.1 Definitions
Given a graph G = (V, E) with n = |V | nodes, we apply N times a non-deter-

ministic community detection algorithm A to G. At the end of each execution,
each pair of nodes (i, j) V V is classified either in the same community or in
different communities. We keep track of this in a matrix of size n n, which we
denote by PN N
ij = [pij ]nn , where pij represent the fraction of the N executions in
which i and j were classified in the same community. Note that P is a symmetric
matrix (pij = pji ), and we set pii = 0. From PN ij , we create a complete weighted
graph G = (V, V V, W ), where the weight of the link (i, j) is pij . Finally, given
a threshold [0, 1], we remove all links having pij < from G to obtain the
virtual graph with threshold, G . The connected components in the virtual graph
G obtained with a given are called -cores.
We will suppose hereafter that N is large enough, so that P N P . This
hypothesis can be made since previous works have indeed shown a fast convergence
of the P N matrix when N grows [29, 32]. We will therefore concentrate on the
parameter, which has a strong influence on the number and size of the consensual
communities, and furthermore allows to obtain a hierarchical structure of consensual
communities. Indeed 1 -cores are included in 2 -cores if 1 > 2 , i.e. 1 -cores are
sub-consensual communities of 2 -cores.
2.2 Experiments
For our experiments and the proof hereafter, we will use three different quality
functions. First, the classical Newman-Girvan modularity function Q [25], which is
defined by
ki k j

Q= Aij X ij , (1)
2m
i, jV
where
Aij represents the weight of the edge between i and j (0 if ij E), ki =
jV ij is the sum of the weights of the edges attached to node i, X ij = 1 if i and
A

j are in the same community and 0 otherwise, and m = 21 i, jV Aij .
Then, the balanced modularity function B [7] which takes into account both the
links inside communities and the non-links between communities. It is defined as
ki k j
(n ki )(n k j )

B= Aij X ij + Aij X ij , (2)
2m n 2 2m
i, jV i, jV
with Aij = W Aij the non-link between nodes i and j (where W = maxi, jV Aij )
and X ij = 1 X ij .
Finally, the deviation to indetermination function D [1, 16, 21], defined as
ki kj 2m

D= Ai j + 2 X ij . (3)
n n n
i, jV
The non-deterministic algorithm A we use here is a generic version of the Louvain

algorithm. The Louvain algorithm is a local search method which aims at maximiz-
ing the value of the modularity function (for more details, see [3]) and its generic
version allows for other quality functions to be used. The Louvain method is actu-
ally the fastest algorithm to find communities on complex networks (it takes less
than five seconds on networks with more than one million of nodes and edges)1 : it
is therefore well-suited to be run many times (typically with N = 100 or more),
which justifies our choice for this algorithm. We will note Louvain-Modularity (resp.
Louvain-Balanced, Louvain-Deviation) to indicate that we use the generic version
of Louvain with the Newman-Girvan modularity (resp. the balanced modularity, the
deviation to indetermination function).
Figure 1 shows the consensual communities identified by our algorithm on the
Zacharys karate club [34] friendship network using Louvain-Modularity. We can
see on this example that different values of gives different non trivial partitions of
the network. Similar non trivial partitions are found on other real networks.
2.3 Properties of Consensual Communities
We computed consensual communities of complex networks of different sizes from

different domains, including a collaboration network [24] and an email network [14]
and a snapshot of the Internet (created by M. Newman, unpublished). Table 1 sum-
marizes the size of these networks. As Fig. 2 shows, a large threshold, e.g., = 1,
will lead to tiny consensual communities, most of which consisting of only a single
1 Also, an execution takes less than one hour on a network with more than one billion of nodes and
links.
Fig. 1 Consensual communities for Zacharys network using three different thresholds with
Louvain-Modularity. The shape of the nodes (circle/square) is the manual classification made
by E. Zachary. a = 0.32. b = 0.62. c = 1.00
node. On the contrary, with a threshold equal to zero, we have a single consensual
community (if the original graph is connected), and with < 0.5, we gener-
ally have a giant consensual community containing the majority of nodes. When
Fig. 2 Average (left) and maximal (right) size of consensual communities versus threshold.
a Using Louvain-Modularity. b Using Louvain-Balanced. c Using Louvain-Deviation
Table 1 Number of nodes and number of links of the four networks used in this paper
Network Karate club Email Collaboration Internet
Number of nodes 34 1,133 13,861 22,963
Number of links 78 5,451 44,619 48,436
the threshold increases, this giant consensual community will split into smaller
consensual communities. But in the Internet or email network, even with an equal to
1, we still have a large consensual community containing approximately 10 % of the
nodes (see Fig. 2). However, the decrease after the splitting of the single consensual
community up to = 1 is smooth.
This smooth decrease can also be understood through the study of the distribution
of the values inside the P ij matrix. Figure 3 shows the pij distributions for three
networks. We observe that if most pairs are nearly always separated and that a fair
Fig. 3 pij complementary cumulative distribution for three real networks using Louvain-
Modularity
amount are always grouped together, there are also some pairs of nodes which are
sometimes together and sometimes separated. This explains that significant consen-
sual communities appear for a wide range of values of .
These results show that the notion of community consensual communities makes
sense and that they can be used to detect different levels of communities with different
quality functions. We will now show that they can also be used to show the absence
of a real community structure in random graphs.
3 Consensual Communities in Random Graphs
In random graphs, all pairs of nodes have the same probability to be connected.
Hence, they should not have preferential binding inducing specific and identifiable
nodes groups. Therefore, we could conclude that there are no community structure
in random graphs. However, several studies show that it is possible to find partitions
with high modularity in random graphs [15, 26]. Indeed, the links concentration
fluctuates in generated graphs, which means that subsets of nodes with a density
larger than global density can appear. The phenomenon is even more pronounced in
regular or quasi-regular graphs, like trees, torus or grid graphs, in which community
detection algorithms can also find partitions with good modularity [23].
A good algorithm for community detection should indicate the presence or
absence of a community structure and recognize that in random graphs, the commu-
nities which are obtained are not real communities.
We will now show that random graphs do not exhibit any non-trivial consensual
communities structure. For that, we will use two different random graphs models:
the classical Erdos-Rnyi model [10], which is used to mimic the number of nodes
and links only, and the configuration model [2, 22], which also respects the full
degree distribution. We will conclude this section with random graphs with known
community structure generated using the LFR benchmark [18].
3.1 Values of pi j in Random Graphs
First of all, Fig. 4 shows the distribution of pij values for an Erdos-Rnyi random graph
with different values of the number of nodes and the average degree. We observe
a high concentration of pij at an average value (around 0.1 for large graphs using
realistic values of the average degree) which is very different from the distributions
observed on real graphs where the maximum of the distribution is at the zero value
(see Fig. 3). We further observe on Fig. 4b that large values of pij appear. However,
the concentration of values increases both with the size of the network and with the
average degree and these large values are therefore less and less frequent.
This concentration of values implies that even if partitions with a good modularity
can be found in random graphs, these partitions are very different from one another
since most pairs are classified in the same community only once every ten runs.
Therefore, no real similarities can be found.
3.2 Comparison with Real Graphs
To compare more precisely real and random networks, we generated random graphs
from the Erdos-Rnyi model (resp. configuration model) that have the same size and
the same average degree (resp. the same degree distribution) as two real networks.
In Fig. 5, the Erdos-Rnyi model shows no pair of nodes with pij = 0, which means
that all pairs of nodes have been grouped together at least once during 1,000 runs
of the Louvain algorithm, regardless of their position in the network. The same is
observed for the configuration model.
Conversely, there is nearly no pair of nodes which are always grouped together,
except for the leaves (nodes of degree 1) of the network which are always grouped
with their only neighbor. This presence of nodes of degree 1 is very common with
the configuration model since the real networks degree distribution are power-law
shaped and therefore contain many nodes of degree 1. The same is observed for the
Erdos-Rnyi model since the real average degree is small and nodes of degree 1 are
not so uncommon on generated graphs. This explains the small increase observed
for the pij values around 1.
Furthermore, as predicted by the experiments on Erdos-Rnyi random networks
(Fig. 4), the maximum of the values is around 0.1.
There is two direct consequences of this distribution: (i) for very low values of the
threshold, there is a single consensual community comprising all nodes since there
Fig. 4 Distribution of the pij averaged over 100 random Erdos-Rnyi graphs (with the average
degree and n the number of nodes). a = 20 and different values for n. b n = 1,000 and different
values for
is no value close to zero and therefore the virtual graph contains all links and (ii) for
large values of the threshold, the virtual graph contains almost no links and therefore
high threshold consensual communities are reduced to single nodes. Interestingly, in
random networks, there is a sharp transition (see Fig. 6), at a threshold value around
0.4 between the situation where one single consensual community is present and the
intermediate threshold values where several consensual communities are present,
which is not present in real networks.
Fig. 5 pij distribution for two real networks together with Erdos-Rnyi and configuration model
random graphs with the same size. a Email network. b Collaboration network
This phase transition cannot be directly deduced from the previous remarks and
we will use after more arguments to prove its existence.
Fig. 6 Average size of consensual communities versus threshold for a real network and two
random networks generated with the Erdos-Rnyi and the configuration models. a Using Louvain-
Modularity. b Using Louvain-Balanced. c Using Louvain-Deviation
3.3 Random Graphs with Communities
To observe the transition between a graph with clear communities towards a random
graph, we used the four groups test which is a random graph with 4 communities
of 32 nodes [13], generated using [18]. Each node has 16 x links towards its
community and x links outside. For x = 0, the graph is composed of 4 independent
random graphs with high density. Then, when x grows, the communities are less and
less defined and, for x 11.7, the graph is purely random. Finally, above this value,
each node has fewer links towards its community than outside. Classical community
detection algorithms are very successful at identifying communities for small values
of x, up to 6 in general. Above 6, they start to fail in identifying the groups.
Figure 7a shows the significance of consensual communities using Louvain-
Modularity. As we can see:
for x = 5, 4 groups of nodes are clearly identified in the range [0.16, 0.87[ and a
partition in 3 communities (one of 64 nodes and two of 32 nodes) is found in the
range [0.02, 0.16[;
for x = 6, a grouping in two communities (each containing 64 nodes) is obtained
in [0.05, 0.3[, then on is split to three communities in [0.3, 0.55[, and four groups
are obtained in [0.55, 0.6[;
for x = 7, three communities are identified in [0.26, 0.33[ and four are identified
in [0.33, 0.67[;
for x = 8, two groups are found in [0.44, 0.45[, three in [0.45, 0.5[ and four in
[0.5, 0.57[.
Note that these groups are not always the correct groups since few nodes can
be misclassified. We can see on Fig. 7b, c that these phenomena are similar for
Louvain-Balanced and Louvain-Deviation.
The main conclusion is that as the graph is more and more random, the intervals
in which the communities (or merge of communities) are found are narrowing.
4 Existence of a Phase Transition
We recall that for a given threshold , -cores are defined as connected components
of the weighted graph G whose adjacency matrix is P , in which we have deleted
weighted links with a value less than this threshold . In random graphs, we observe
that a small threshold gives one consensual community containing all the nodes of
the graph. Then, after a rapid phase transition (based on the choice of ), we obtain
only trivial consensual communities, each containing a single node.
Now, we give in the sequel arguments to show the existence of this phase transition.
Throughout the proof, we use extensively the fact that graphs are random and thus all
connections appear independently. Assumptions made in some cases may be related
to classical mean field assumptions in statistical physics.
Fig. 7 Average size of consensual communities for a random network with 4 communities of 32
nodes, 16 links per node on average and a variable number of links pointing out of the community.
a Using Louvain-Modularity. b Using Louvain-Balanced. c Using Louvain-Deviation
4.1 Values of pi j for Two Connected Nodes Are Highly

Concentrated Around a Mean Value
Since we are considering random graphs, we can suppose that nodes (and their
neighbors) in the input graph are similar. Thus, regardless of the results of the com-
munity detection algorithm used, nodes will be in expectation in the same community
than a proportion p of their neighbors. Moreover, the random aspect of the graph
implies this proportion p concerns neighbors which have been chosen randomly and
independently for each run of the algorithm. In an equivalent way, we obtain that all
pij are approximately equal to p.
Of course, this argument holds only if we assume that all elements in the graph
are random. Indeed, the existence of correlations or specific properties on nodes can
harm it. This is for instance the case of modularity applied on graphs having very low
average degree. In particular, a node of degree 1 is always placed in the community
of its unique neighbor and the above mentioned argument cannot be applied. The
complete absence of correlations is therefore only valid for large networks with a
sufficiently large average degree.
Figure 8a is an experimentation on a 10,000 nodes random Erdos-Rnyi graph
with different average degrees. We can observe that when the average degree is
increasing, the effects of low degree nodes disappear and the distribution of pij is
much more concentrated.
4.2 Values of pi j for Two Connected Nodes Are Higher than Those
of Two Non-connected Nodes
On Fig. 8a (bottom), we can see that the distribution is in fact composed of two distinct
modes. These two modes correspond respectively to connected pairs of nodes, i.e.
links, and non-connected pairs of nodes. Figure 8b shows the decomposition of these
two distributions. We can see that pij values for connected nodes are higher (after
than for non-connected nodes).
Two nodes i and j not connected and having a nonzero pij were necessarily
classified at least once in the same community. As communities are necessarily
connected subgraphs of the input graph, there exists a path connecting them and
having only nonzero puv , for each nodes u and v belonging to the path. For instance,
i and j can have a common neighbor k such that pik and pjk are positive.
Let us assume to simplify that nodes i and j have a unique common neighbor
k. As the graph is purely random, we can suppose that the probability that i and k
are placed in the same community is pik = p, and the one that k and j are in the
same community is pkj = p. We also suppose they are independent, because edges
linking i, j and k can be inside as well as between different communities, without
any correlation. Thus, to i and j be classified in the same community, these two
events must occur simultaneously. Therefore,
Fig. 8 pij distribution for a random graph with different average degree (5 and 100) and 10,000
nodes. The curve with all pairs is nearly completely overlapped by the two curves, expect for
average degree 5. a Global distribution (all pairs of nodes). b Distinction between connected and
non-connected pairs of nodes
pij = pik pkj = p 2 .
Let us note that these calculations do not make sense in complex networks, since the
independence assumption is clearly unfounded, in particular because of the existence
of strong local correlation as measured by the clustering coefficient.
In the case where nodes i and j have no common neighbor but are connected with
a longer path in the input graph, by using the same reasoning, we have

pij = puv = p t ,
uvP
where P is a shortest path of size t linking i and j. This calculation holds if i and j
have only one common neighbor.
It is easy to compute pij in the case where the two nodes have z nodes in common.
We obtain
pij = 1 (1 p 2 )z ,
that corresponds to 1 minus the probability that i and j are not linked with a common
neighbor. However, if we assume that we have large graphs having low average
degree, the probability of having more than one common neighbor (if we already
have one) is very low.2
For these reasons, we can assume that values of pij are higher for connected pairs
than non-connected pairs.
4.3 Existence of a Phase Transition
If we suppose that all connected pairs (i, j) have pij = p, and that non-connected
nodes u and v have a lower probability of being connected, thus, for a threshold
below p, only pairs of connected nodes provide connectivity, and as all connected
pairs have nearly the same pij , we have only one consensual community containing
all the nodes of the input graph (for large enough values of the average degree,
the graph is connected, otherwise we have as many consensual communities as the
number of connected components).
Conversely, since the distribution of pij values for connected pairs is strongly
centered on the value p, any value of the threshold above p will destroy the consen-
sual communities very quickly and we obtain trivial consensual communities, each
containing only one node.
4.4 The Proportion of Intra-community Links Is Equal to p
Finally, we can compute the value of this threshold. Let us assume that k % of links
are intra-community links. Then, this means that for each execution of the algorithm,
one node u will be put in expectation with k % of its neighbors, or equivalently each
neighbor will be with the given node u for k % of the executions. This value k is thus
the value of pij corresponding to the p that we have used so far.
Computing exactly the value of p is an open problem that seems to be difficult [15].
However, numerical studies (see Fig. 9) show that it decreases with the graph density,
but the exact decrease pattern is quite complex.
2 Assumptions in classical mean field make extensive use of the fact that a random graph whose
size tends to infinity is locally a tree.
Fig. 9 Proportion of internal links for a random graph. a With 1,000 nodes. b With 10,000 nodes
5 Conclusion
We have shown here that consensual communities allow to distinguish graphs with
a real community structure from graphs where the community structure arises from
fluctuations. To do so, we have shown that consensual communities in random
graphs are trivial, containing either all the nodes of the graph or one node each.
These observations have been made using different quality functions optimized using
a generic version of the Louvain algorithm.
Some future works remain to further understand the absence of non-trivial
consensual communities in random graphs. First, it is necessary to compute the
exact value of the threshold as a function of the parameters (size and average degree)
of the Erdos-Rnyi graph. For graphs generated from the configuration model, the
task is more difficult since there are many degree one nodes for which the modularity
function requires that they are placed in the community of their only neighbour. Such
local correlations are harder to take into account.
Another perspective would be to make a similar study on regular graphs, in which
we know that it does not exist community structures. In particular, for regular grids
and torus, previous studies have shown that a high modularity partition can be found,
but the regularity of such network naturally allows many different partitions which
are simply translations of any partition. Intuitively, it means that many high quality
partitions can be found and that should not exist.
Acknowledgments We would like to thank the anonymous referees for their insightful comments
and suggestions, which have helped to improve the presentation of this paper. This work is partially
supported by the DynGraph ANR-10-JCJC-0202 and CODDDE ANR-13-CORD-0017-01 projects
of the French National Research Agency.
References
1. Ah-Pine J, Marcotorchino JF (2007) Statistical, geometrical and logical independences between

categorical variables symposium. In: Proceedings of the international conference on applied
stochastic models and data analysis (ASMDA). Chania, Greece
2. Bender EA, Canfield ER (1978) The asymptotic number of labeled graphs with given degree
sequences. J Comb Theory A 24:296307
3. Blondel V, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in
large networks. J Stat Mech: Theory Exp 2008:P10008
4. Brandes U, Delling D, Gaertler M, Gorke R, Hoefer M, Nikoloski Z, Wagner D (2007) On
finding graph clusterings with maximum modularity. In: Graph-theoretic concepts in computer
science. Springer, Berlin, pp 121132
5. Campigotto R, Conde-Cspedes P, Guillaume JL (2014) A generalized and adaptive method
for community detection. Technical report, Universit Pierre et Marie Curie. arXiv:14062518
6. Campigotto R, Guillaume JL, Seifi M (2013) The power of consensus: random graphs have no
communities. In: Proceedings of the 5th IEEE/ACM international conference on advances in
social networks and mining (ASONAM). Niagara Falls, Canada, pp 272276
7. Conde-Cspedes P, Marcotorchino JF (2013) Comparison of linear modularization criteria of
networks using relational metric. In: 45mes Journes de Statistique, SFdS. Toulouse, France
8. Diday E (1973) The dynamic clusters method and optimization in non-hierarchical clustering.
Optim Tech, pp 241258
9. Diday E (1973) The dynamic clusters method in non-hierarchical clustering. Int J Parallel Prog
2:6188
10. Erdos P, Rnyi A (1959) On random graphs. Publ Math 6:290297
12. Gfeller D, Chappelier J, De Los Rios P (2005) Finding instabilities in the community structure
of complex networks. Phys Rev E 72(5):056135
Natl Acad Sci USA 99(12):78217826
14. Guimer R, Danon L, Diaz-Guilera A, Giralt F, Arenas A (2003) Self-similar community
structure in a network of human interactions. Phys Rev E 68(6):065103
15. Guimer R, Sales-Pardo M, Amaral LAN (2004) Modularity from fluctations in random graphs
and complex networks. Phys Rev E 70(2):025101
16. Janson S, Vegelius J (1982) The J-index as a measure of association for nominal scale response
agreement. Appl Psychol Meas 6:111121
17. Karrer B, Levina E, Newman M (2008) Robustness of community structure in networks. Phys
Rev E 77(4):046119
18. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community
detection algorithms. Phys Rev E 78(4):046110
19. Lancichinetti A, Fortunato S (2012) Consensus clustering in complex networks. Sci Rep 2(336)
20. Mancoridis S, Mitchell B, Rorres C (1998) Using automatic clustering to produce high-level
system organizations of source code. In: Proceedings of the 6th international workshop on
program comprehension, pp 4553
21. Marcotorchino JF (2013) Optimal transport, spatial interaction models and related problems,
impacts on relational metrics, adaptation to large graphs and networks modularity
22. Molloy M, Reed B (1995) A critical point for random graphs with a given degree sequence.
Random Struct Algorithms 6(23):161180
23. de Montgolfier F, Soto M, Viennot L (2011) Asymptotic modularity of some graph classes. In:
ISAAC, pp 435444
24. Newman M (2001) The structure of scientific collaboration networks. Proc Natl Acad Sci
98(2):404409
25. Newman M, Girvan M (2004) Finding and evaluating community structure in networks. Phys
Rev E 69(2):026113
26. Reichardt J, Bornholdt S (2006) Statistical mechanics of community detection. Phys Rev E
74(1):016110
27. Rosvall M, Bergstrom C (2010) Mapping change in large networks. PLoS One 5(1):e8694
28. Seifi M, Guillaume JL (2012) Community cores in evolving networks. In: Proceedings of the
mining social network dynamic 2012 workshop (MSND). Lyon, France, pp 11731180
29. Seifi M, Guillaume JL, Junier I, Rouquier JB, Iskrov S (2012) Stable community cores in
complex networks. In: 3rd international workshop on complex networks. Melbourne, Florida
30. Senshadhri C, Kolda TG, Pinar A (2012) Community structure and scale-free collections of
Erdos-Rnyi graphs. Phys Rev E 85:056109
31. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach
Intell 22:888905
32. Wang Q, Fleury E (2010) Uncovering overlapping community structure. In: 2nd international
workshop on complex networks, pp 176186
33. Wang Q, Fleury E (2009) Detecting overlapping communities in graphs. In: European confer-
ence on complex systems (ECCS). Warwick
34. Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthr
Res 33:452473
35. Zahn CT (1964) Approximating symmetric relations by equivalence relations. SIAM J Appl
Math 12:840847
Link Prediction in Heterogeneous
Collaboration Networks
Xi Wang and Gita Sukthankar
Abstract Traditional link prediction techniques primarily focus on the effect

of potential linkages on the local network neighborhood or the paths between nodes.
In this article, we study both supervised and unsupervised link prediction in networks
where instances can simultaneously belong to multiple communities, engendering
different types of collaborations. Links in these networks arise from heterogeneous
causes, limiting the performance of predictors that treat all links homogeneously. To
solve this problem, we introduce a new supervised link prediction framework, Link
Prediction using Social Features (LPSF), which incorporates a reweighting scheme
for the network based on nodes features extracted from patterns of prominent inter-
actions across the network. Experiments on coauthorship networks demonstrate that
the choice for measuring link weights can be critical for the link prediction task.
Our proposed reweighting method in LPSF better expresses the intrinsic relation-
ship between nodes and improves prediction accuracy for supervised link predic-
tion techniques. We also compare the unsupervised performance of the individual
features used within LPSF with two new diffusion-based methods: Link Prediction
using Diffusion Process (LPDP) and Link Prediction using Diffusion Maps (LPDM).
Experiments demonstrate that LPDP is able to identify similar node pairs, even far
away ones, that are connected by weak ties in the coauthorship network using the
diffusion process; however, reweighting the network has little impact on prediction
performance.
Keywords Link prediction Social features Random walk Collaborative

networks Heterogeneous ties
X. Wang G. Sukthankar (B)

Department of EECS, University of Central Florida, 4000 Central Florida Blvd,
Orlando, FL 32816, USA
e-mail: gitars@eecs.ucf.edu
X. Wang
e-mail: xiwang@eecs.ucf.edu

DOI 10.1007/978-3-319-12188-8_8
166 X. Wang and G. Sukthankar
1 Introduction
In many social media tools, link prediction is used to detect the existence of
unacknowledged linkages in order to relieve the users of the onerous chore of pop-
ulating their personal networks. The problem can be broadly formulated as follows:
given a disjoint node pair (x, y), predict if the node pair has a relationship, or in
the case of dynamic interactions, will form one in the near future [39]. Often, the
value of the participants experience is proportional to the size of their personal net-
work so bootstrapping the creation of social networks with link prediction can lead
to increased user adoption. Conversely, poor link prediction can irritate users and
detract from their initial formative experiences.
Although in some cases link predictors leverage external information from the
users profile or other documents, the most popular link predictors focus on modeling
the network using features intrinsic to the network itself, and measure the likelihood
of connection by checking the proximity in the network [14, 30]. Generally, the
similarity between node pairs can be directly measured by neighborhood methods
such as the number of shared neighbors [24] or subtly measured by path methods [21].
One weakness with network-based link prediction techniques is that the links
are often treated as having a homogeneous semantic meaning, when in reality the
underlying relationship represented by a given link could have been engendered
by different causal factors. In some cases, these causal factors are easily deduced
using user-supplied meta-information such as tags or circles, but in other cases the
provenance of the link is not readily apparent. In particular, the meaning of links
created from overlapping communities are difficult to interpret, necessitating the
development of heterogeneous link prediction techniques.
In the familiar example of scientific collaboration networks, authors usually have
multiple research interests and seek to collaborate with different sets of co-authors
for specific research areas. For instance, Author A cooperates with author B on
publishing papers in machine learning conferences whereas his/her interaction with
author C is mainly due to shared work in parallel computation. The heterogeneity in
connection causality makes the problem of predicting whether a link exists between
authors B and C more complicated. Additionally, Author A might collaborate with
author D on data mining; since data mining is an academic discipline closely related
to machine learning, there is overlap between the two research communities which
indicates that the linkage between B and D is more likely than a connection between
B and C. In this article, we detect and leverage the structure of overlapping commu-
nities toward this problem of link prediction in networks with multiple distinct types
of relationships.
Community detection utilizes the notion of structural equivalence which refers
to the property that two actors are similar to one another if they participate in equiva-
lent relationships [25]. Inspired by the connection between structural equivalence and
community detection, Soundarajan and Hopcroft proposed a link prediction model
for non-overlapping communities; they showed that including community infor-
mation can improve the accuracy of similarity-based link prediction methods [32].
Link Prediction in Heterogeneous Collaboration Networks 167
Since community information is not always readily available, community detection

techniques can be applied to partition the network into separate groups [2]. In this
article, we present a new link prediction framework for networks with overlapping
communities that accounts for the hidden community information embedded in a set
of heterogeneous connections.
When a persons true affiliations are unknown, our proposed method, LPSF [38],
models link heterogeneity by adding weights to the links to express the similarities
between node pairs based on their social features. These social features are calculated
from the network topology using edge clustering [34] and implicitly encode the
diversity of the nodes involvements in potential affiliations. The weights calculated
from the social features provide valuable information about the true closeness of
connected people, and can also be leveraged to predict the existence of the unobserved
connections. In this article, different similarity-based prediction metrics were adapted
for use on a weighted network, and the corresponding prediction scores are used as
attributes for training a set of supervised link prediction classifiers. Experiments on
a real-world scientific collaboration dataset (DBLP) demonstrate that LPSF is able
to outperform homogeneous predictors in the unweighted network.
In Sect. 5, we further compare the performances of unsupervised link prediction
benchmarks used in LPSF with two proposed diffusion-based link predictors (LPDP
and LPDM). Recently, the use of random walk models for solving link prediction
problems in coauthorship networks has attracted interest due to the finding that
researchers are more interested in establishing long-range weak ties (collaborations)
rather than strengthening their well-founded interactions [3]. By capturing the under-
lying proximities of long distant node pairs, LPDP demonstrates its superior link
prediction performance on DBLP datasets.
2 Related Work
The link prediction problem has drawn increased attention over the past few years
[5, 29, 33]. A variety of techniques for addressing this problem have been explored
including graph theory, metric learning, statistical relational learning, matrix fac-
torization, and probabilistic graphical models [17, 18, 35, 39]. This chapter is an
extended version of our prior work on supervised link prediction models [38].
Most link prediction models assume that the links in the network are homogeneous.
In this work, we focus on predicting links in link-heterogeneous networks such as
coauthorship collaboration networks, which can be modeled as networks that contain
different types of collaboration links connecting authors. From a machine learning
point of view, link prediction models can be categorized as being supervised or unsu-
pervised. Hasan et al. studied the use of supervised learning for link prediction in
coauthorship networks [13]. They identify a set of link features that are key to the
performance of their supervised learner including (1) proximity features, such as
keywords in research papers, (2) aggregated features, obtained from an aggregation
operator, and (3) topological features. The combination of these features showed
effective prediction performance on two collaborative network datasets. Popescul

et al. introduced an alternate approach to generating features. First, they represent
the data in a relational format, generate candidate features through database queries,
select features using statistical model selection criteria, and finally perform logistic
regression using the selected features for classification [28]. Unlike these methods,
in this work, our proposed LPSF only utilizes network information and does not use
document properties; we believe that our proposed social features could be used in
conjunction with node features, when they are available, to improve classification
performance.
Unsupervised prediction methods, due to their simplicity, have remained popular
in the link prediction literature but have been shown to be very sensitive to underlying
network properties, such as imbalance in the size of network communities, and
experience difficulty adapting to dynamic interdependencies in the network [18].
Davis et al. proposed an unsupervised extension of the common Adamic/Adar
method to predict heterogeneous relationships in multi-relational networks [8].
Specifically, the proposed multi-relational link prediction (MRLP) method applies a
weighting scheme for different edge type combinations. The weights are determined
by counting the occurrence of each unique 3-node sub-structure in the network,
traditionally termed a triad census. Supervised link prediction is employed after
converting the heterogeneous network into a feature representation.
Sun et al. proposed a path-based relationship prediction model, PathPredict, to
study the coauthorship prediction problem in heterogeneous bibliographic
networks [33]. First, the meta path-based topological features are symmetrically
extracted from the network using measures such as path count and random walk,
around the given meta paths. The meta path captures the composition relation over
the heterogeneous networks. Logistic regression is then used to learn the weights
associated with different topological features that best predict co-author relation-
ships. Lee and Adorna proposed a random walk-based link prediction algorithm on a
modified heterogeneous bibliographic network where all edges across heterogeneous
objects in the network are weighted by using a combination of different importance
measures [16]. Different to their work, our main focus in this article is weighting the
heterogeneous collaboration links between authors.
Relatively few works focus on link prediction tasks in weighted networks. De S
and Prudncio investigated the use of weights to improve the performance of super-
vised link prediction [9]. In their work, they extend eight benchmark unsupervised
metrics for weighted networks, and adopt prediction scores as node pairs attributes
for a supervised classification model. Murata et al. proposed a similar unsupervised
metric that makes use of the weights of the existing links [23]; this outperforms
traditional unsupervised methods especially when the target social networks are suf-
ficiently dense. Experiments conducted on two real-world datasets (Yahoo! Answers
and Windows Live QnA dataset) indicate that the accuracy of link prediction can be
improved by taking weights of links into consideration. In those datasets, the weights
of the links in the network are already available, in contrast to our work where we
calculated the link weights based on node pairs social features extracted from an
unweighted network.
Recently, some researchers started applying random walk models to solve the link
prediction problem. For instance, Backstrom and Leskovec developed a supervised
random walk algorithm that combines the information from the network structure
with node and edge level attributes and evaluated their method on coauthorship net-
works extracted from arXiv. The edge weights are learned by a model that optimizes
the objective function such that more strength is assigned to new links that a random
walker is more likely to visit in the future [3]. However, they only focus on predicting
links to the nodes that are 2-hops from the seed node. Liu et al. proposed a similar-
ity metric for link prediction based on type of local random walk, the Superposed
Random Walk (SRW) index [19]. By taking into account the fact that in most real net-
works nodes tend to connect to nearby nodes rather than ones that are far away, SRW
continuously releases the walkers at the starting point, resulting in a higher similarity
between the target node and the nearby nodes. Apparently this assumption is invalid
in DBLP and other scientific collaboration datasets. Similarly Yin et al. estimated
link relevance using the random walk algorithm on an augmented social graph with
both attribute and structure information [41]. Their framework leverages both global
and local influences of the attributes. Different to their model, our diffusion-based
techniques LPDP and LPDM only rely on the network structural information with-
out considering any nodes local (intrinsic) features. Additionally, the experiments
described in [19] and [41] evaluated the problem of recognizing existent links in the
network rather than predicting future ones.
3 Link Prediction in Collaboration Networks
In this article, we aim to predict future collaborations between researchers by

observing the network at an earlier point of time t as the training sample and pre-
dicting the links to be added to the network during the time interval from time t
to a given future time t . The network we consider consists of the following infor-
mation: (1) a set of N individuals: V = {V1 , . . . , VN }. Each person in the network
can belong to K (K 1) different affiliations (communities). When K = 1, indi-
viduals are partitioned into non-overlapping groups. (2) The connections between
actors are represented by the undirected, network graph G = {V, E}, in which edge
e = (vi , v j ) denotes that vi shares certain relationships with v j . We also assume that
the network is unweighted, which means w(vi , v j ) = 1 for all connected node pairs
(vi , v j ). Given a new pair of nodes in the network, {vm , vn }, our task is to predict
whether there exists a relationship between them.
3.1 Problems of Heterogeneity

Unsupervised link prediction methods mainly fall into two categories: neighborhood
methods, such as Common Neighbors (CN) and Jaccards Coefficient (JC), which
make predictions based on structural scores that are calculated from the connections
in the nodes immediate neighbors, and path methods, such as PageRank, which
predict the links based on the paths between nodes [21]. Essentially, the prediction
score represents the similarity between the given pair of nodes: the higher the score,
the more likely that there exists a connection between them. Using the Common
Neighbors (CN) scoring method, two nodes with 10 common neighbors are more
likely to be linked than nodes with only a single common neighbor.
However, these neighborhood approaches intrinsically assume that the connections
in the network are homogeneous: each nodes connections are the outcome of one
relationship. Directly applying homogeneous link predictors to overlapping commu-
nities can cause prediction errors. A simple example is shown in Fig. 1, where two
types of relationships co-exist within the same network. The solid line represents
the coauthorship of a paper in a data mining conference and the dashed line repre-
sents the activity of collaborating on a machine learning paper. Note that the link
types are hidden from the methodonly the presence of a link is known. Author 1 is
associated with 2 affiliations since he/she participates in both activities. If all interac-
tions were considered homogeneously, the prediction score for linking authors 2 and
6, CN(2, 6), and that for authors 2 and 3, CN(2, 3), under the Common Neighbors
scoring method would be the same, since both node pairs share only one common
neighbor; yet this is clearly wrong. The question now becomes how can we capture
type correlations between edges to avoid being misled by connection heterogeneity?
In the next section, we describe how edges in the network can be analyzed using
edge clustering [34] to construct a social feature space that makes this possible.
3.2 Edge-Based Feature Extraction
The idea of constructing edge-based social dimensions was initially used to address
the multi-label classification problem in networked data with multiple types of
links [34]. Connections in human networks are often the result of affiliation-driven
social processes; since each person usually has more than one connection, the involve-
ments of potential groups related to one persons edges can be utilized as a repre-
sentation for his/her true affiliations. Because this edge class information is not
always readily available in the social media application, an unsupervised clustering
algorithm can be applied to partition the edges into disjoint sets such that each set
represents one potential affiliation. The edges of actors who are involved in multiple
affiliations are likely to be separated into different sets.
In this article, we construct the nodes social feature space using the scalable edge
clustering method proposed in [34]. However, instead of using the social feature
space to label nodes, in this article our aim is to leverage this information to reweight
links. First, each edge is represented in a feature-based format, where the indices of
the nodes that define the edges are used to create the features as shown in Fig. 1.
In this feature space, edges that share a common node are more similar than edges
that do not. Based on the features of each edge, k-means clustering is used to separate
the edges into groups using this similarity measure. Each edge cluster represents
(a) (b) (c)
Fig. 1 A simple example of a coauthorship network (a). The solid line represents coauthorship
of a paper in a data mining conference and the dashed line represents the activity of collaborating
on a machine learning paper. In edge-based social features (b), each edge is first represented by a
feature vector where nodes associated with the edge denote the features. For instance here the edge
13 is represented as [1, 0, 1, 0, 0, 0, 0, 0, 0, 0]. Then, the nodes social feature (SF) is constructed
based on edge cluster IDs (c). Suppose in this example the edges are partitioned into two clusters
(represented by the solid lines and dashed lines respectively), then the SFs for node 1 and 2 become
[3, 3] and [0, 2] using the count aggregation operator. Employing social features enables us to score
26 (cross-affiliation link) lower than 23 even though they have the same number of common
neighbors
a potential affiliation, and a node will be considered as possessing one affiliation as

long as any of its connections are assigned to that affiliation. Since the edge feature
data is very sparse, the clustering process can be significantly accelerated as follows.
In each iteration a small portion of relevant instances (edges) that share features with
cluster centroids are identified, and only the similarity of the centroids with their
relevant instance need to be computed. By using this procedure, the clustering task
can be completed within minutes even for networks with millions of nodes.
After clustering the edges, we can easily construct the nodes social feature vector
using aggregation operators such as count or proportion on edge cluster IDs. In
[34], these social dimensions are constructed based on the nodes involvements in
different edge clusters. Although aggregation operators are simply different ways
of representing the same information (the histogram of edge cluster labels), alter-
nate representations have been shown to impact classification accuracy based on the
application domain [31].
4 Proposed LPSF Framework: Reweighting the Network

+ Supervised Learning Classifier
Most of previous work in link prediction focuses on node-similarity metrics computed

for unweighted networks, where the strength of relationships is not taken into account.
However, proximities between nodes can be estimated better by using both graph
proximity measures and the weights of existing links [9, 23]. Much of this prior
work uses the number of encounters between users as the link weights. However, as
the structure of the network can be highly informative, social dimensions provide an
effective way of differentiating the nodes in collaborative networks [34, 37].
In this article, the weights of the link are evaluated based on the users social
features extracted from the network topology under different similarity measures. For
our domain, we evaluated several commonly used metrics including inner product,
cosine similarity, and Histogram Intersection Kernel (HIK), which is used to compare
color histograms in image classification tasks [4]. Since our social features can be
regarded as the histogram of persons involvement in different potential groups, HIK
can also be adopted to measure the similarity between two people. Given the social
features of person vi and person v j , (SFi , SF j ) X X , the HIK is defined as
follows:

m
K HI (vi , v j ) = min{SFi , SF j }, (1)
i=1
where m is the length of the feature vector.

The closeness of users can also be evaluated by the total number of common link
clusters they associate with. We call this measure Common Link Clusters (CLC).
Section 4.4.1 compares classification performance of these similarity metrics.
4.1 Unsupervised Proximity Metrics
In order to investigate the impact of link weights for link prediction in collaboration
networks, we compare the performances of eight benchmark unsupervised metrics
for unweighted networks and their extensions for weighted networks. The prediction
scores from these unsupervised metrics can further be used as the attributes for
learning supervised prediction models. We detail the unsupervised prediction metrics
for both unweighted and weighted networks in the following sections.
Let N (x) be the set of neighbors of node x in the social network and let Dx be
the degree (the total number of neighbors) of node x. Obviously, in an unweighted
network, Dx = |N (x)|. Let w(x, y) be the link weight between nodes x and y in a
weighted network. Note that in our generated weighted network, the weight matrix
W is symmetric, i.e. w(x, y) = w(y, x).
4.1.1 Number of Common Neighbors (CN)
The CN measure for unweighted networks is defined as the number of nodes with
direct connections to the given nodes x and y:
CN(x, y) = |N (x) N (y)|. (2)
The CN measure is one the most widespread metrics adopted in link prediction,
mainly due to its simplicity. Intuitively, the measure simply states that two nodes
that share a high number of common neighbors should be directly linked [24].
For weighted networks, the CN measure can be extended as:

CN(x, y) = w(x, z) + w(y, z). (3)
zN (x)N (y)
4.1.2 Jaccards Coefficient (JC)
The JC measure assumes that the node pairs that share a higher proportion of common
neighbors relative to their total number of neighbors are more likely to be linked.
From this point of view, JC can be regarded as a normalized variant of CN. For
unweighted networks, the JC measure is defined as:
|N (x) N (y)|
JC(x, y) = . (4)
|N (x) N (y)|
For weighted networks, the JC measure can be extended as:

zN (x)N (y) w(x, z) + w(y, z)
JC(x, y) = . (5)
aN (x) w(x, a) + bN (x) w(y, b)
4.1.3 Preferential Attachment (PA)
The PA measure assumes that the probability that a new link is created from a node x
is proportional to the node degree Dx (i.e., nodes that currently have a high number
of relationships tend to create more links in the future). Newman proposed that the
product of a node pairs number of neighbors should be used as a measure for the
probability of a future link between those two [24]. The PA measure for an unweighted
network is defined by:
PA(x, y) = |N (x)| |N (y)|. (6)
The PA measure extended for a weighted network can be defined as:

PA(x, y) = w(x, z 1 ) w(y, z 2 ). (7)
z 1 N (x) z 2 N (y)
4.1.4 Adamic/Adar Coefficient (AA)
The AA measure is related to Jaccards coefficient with additional emphasis on the

importance of the common neighbors [1]. AA defines higher weights for the common
neighbors that have fewer neighbors. The AA measure for unweighted networks is
defined as:
1
AA(x, y) = . (8)
log(N (z))
zN (x)N (y)
The AA measure extended for a weighted network can be defined as:

w(x, z) + w(y, z)
AA(x, y) = . (9)
log(1 + cN (z) w(z, c))
zN (x)N (y)
4.1.5 Resource Allocation Index (RA)
The Resource Allocation Index has a similar formula as the Adamic-Adar Coefficient,
but with a different underlying motivation. RA is based on physical processes of
resource allocation [26] and can be applied on networks formed by airports (for
example, flow of aircraft and passengers) or networks formed by electric power
stations such as power distribution. The RA measure was first proposed in [42] and
for unweighted networks it is expressed as follows:
1
RA(x, y) = . (10)
|N (z)|
zN (x)N (y)
The RA measure for weighted networks can be defined as:

w(x, z) + w(y, z)
RA(x, y) = . (11)
zN (x)N (y) cN (z) w(z, c)
4.1.6 Inverse Path Distance (IPD)
The Path Distance measure for unweighted networks simply counts the number of
nodes along the shortest path between x and y in the graph. Thus, when two nodes
x and y share at least one common neighbor, then PD(x, y) = 1. In this article, we
adopt the Inverse Path Distance to measure the proximity between two nodes, where
IPD(x, y) = 1/PD(x, y).
IPD is based on the intuition that nearby nodes are likely to be connected. In a
weighted network, IPD is defined by the inverse of the shortest weighted distance
between two nodes. Since IPD quickly approaches 0 as path lengths increase, for
computational efficiency, we terminate the shortest path search once the distance
exceeds a threshold L and approximate IPD for more distant node pairs as 0.
4.1.7 PropFlow
PropFlow [18] is a new unsupervised link prediction method which calculates the
probability that a restricted random walk starting at x ends at y in L steps or fewer
using link weights as transition probabilities. The walk terminates when reaching
node y or revisiting any nodes including node x. By restricting its search within the
threshold L, PropFlow is a local measure that is insensitive to noise in network topol-
ogy far from the source node and can be computed quite efficiently. The algorithm
for unweighted networks is identical to that for weighted networks, except that all
link weights are set equal to 1.
4.1.8 PageRank
The PageRank (PR) algorithm of Google fame was first introduced in [6]; it aims
to represent the significance of a node in a network based on the significance of
other nodes that link to it. Inspired by the same assumption as made by Preferential
Attachment, we assume that the links between nodes are driven by the importance of
the node, hence the PageRank score of the target node represents a useful statistic.
Essentially, PageRank outputs the ranking scores (or probability) of visiting the target
node during a random walk from a source. A parameter , the probability of suffering
to a random node, is considered in the implementation. In our experiment, we set
= 0.85 and perform an unoptimized PageRank calculation iteratively until the
vector that represents PageRank scores converges.
For weighted networks, we adopted the weighted PageRank algorithm proposed
in [10].
PRw (x) w(x)
PRw (x) = + (1 ) N . (12)
L(k) y=1 w(y)
kN (x)
N
where L(x) is the sum of outgoing link weights from node x, and y=1 w(y) is the
total weights across the whole network.
4.2 Supervised Link Predictor
As mentioned in [23], unsupervised link prediction methods exhibit several

drawbacks. First, they can only perform well if the network link topology conforms
to the scoring function a priori. In other words, the assumption is both the links
in the existing network and the predicted links score highly on the given measure.
Second, the ranking of node pairs is performed using only a single metric, and hence
the strategy may completely explore different structural patterns contained in the
network. By contrast, supervised link prediction schemes can integrate information
from multiple measures and can usually better model real-world networks. Most
importantly, unlike in other domains where supervised algorithms require access to
appropriate quantities of labeled data, in link prediction we can use the existing links
in the network as the source of supervision. For these reasons, supervised approaches
to link prediction are drawing increased attention in the community [13, 18, 28].
In this article, we follow a standard approach: we treat the prediction scores from
the unsupervised measures as features for the supervised link predictor. We compare
the accuracy of different classifiers on both unweighted and weighted collaboration
networks.
4.3 Experimental Setup
4.3.1 Multi-relational Dataset
Our proposed method is evaluated on two real-world multi-relational collaboration

networks extracted from the DBLP dataset.1 The DBLP dataset (Table 1) provides
bibliographic information for millions of computer science references. In this article
we only consider authors who have published papers between 2006 and 2008, and
extract their publication history from 2000 to 2008. In the constructed network,
authors correspond to nodes, and two authors are linked if they have collaborated
at least once. The link prediction methods are tested on the new co-author links in
the subsequent time period [2009, 2010]. For the weighted variant, the number of
coauthored publications is used as the weight on each link. Link heterogeneity is
induced by the broad research topic of the collaborative work.
DBLP-A: In the first DBLP dataset, we select 15 representative conferences in
6 computer science research areas (Databases, Data Mining, Artificial Intelligence,
Information Retrieval, Computer Vision and Machine Learning), and each paper is
associated with a research area if it appeared in any conferences listed under that area.
The collaboration network is constructed only for authors who have publications in
those areas.
DBLP-B: In the second DBLP dataset, we select 6 different computer science
research areas (Algorithms & Theory, Natural Language Processing, Bioinformatics,
Networking, Operating Systems and Distributed & Parallel Computing), and choose
16 representative conferences in these areas.
1 http://www.informatik.uni-trier.de/~ley/db/.
Table 1 Data statistics Data DBLP-A DBLP-B

Categories 6 6
# of nodes 10,708 6,251
# of new links 12,741 5,592
# of existing links 49,754 30,130
Network density 9.78 104 1.7 103
Maximum degree 115 72
Average degree 5.2 5.3
Similar DBLP datasets have previously been employed by Kong et al. to evaluate
collective classification in multi-relational networks [15]. In this article, we aim to
predict the missing links (coauthorship) in the future based on the existing connection
patterns in the network.
4.3.2 Evaluation Framework
In this article, the supervised link prediction models are learned from training links
(all existing links) in the DBLP dataset extracted between 2000 and 2008, and the
performance of the model is evaluated on the testing links, new co-author links
generated between 2009 and 2010. Link prediction using supervised learning model
can be regarded as a binary classification task, where the class label (0 or 1) represents
the link existence of the node pair. When performing the supervised classification,
we sample the same number of non-connected node pairs as that of the existing links
to use as negative instances for training the supervised classifier.
In our proposed LPSF model, the edge clustering method is adopted to construct
the initial social dimensions. When conducting the link prediction experiment, we use
cosine similarity while clustering the links in the training set. The edge-based social
dimension in our proposed method, LPSF, is constructed based on the edge cluster
IDs using the count aggregation operator, and varying numbers of edge clusters are
tested in order to provide the best performance of LPSF. The weighted network is
then constructed according to the similarity score of connected nodes social fea-
tures under the weight measure selected from Sect. 4. The search distance L for
unsupervised metrics Inverse Path Distance and PropFlow is set to 5. We evaluate
the performance of four supervised learning models in this article, which are Naive
Bayes (NB), Logistic Regression (LR), Neural Network (NN) and Random Forest
(RF). All algorithms have been implemented in WEKA [12], and the performance
of each classifier is tested using its default parameter setting.
In the DBLP dataset, the number of positive link examples for testing is very
small compared to negative ones. In this article, we sample an equivalent number
of non-connected node pairs as links from the 2009 and 2010 period to use as the
negative instances in the testing set. The evaluation measures for supervised link
prediction performance used in this article are precision, recall and F-Measure.
4.4 Results
This section describes several experiments to study the benefits of augmenting link
prediction methods using LPSF. First, we compare the performance of different
weighting metrics used in LPSF. Second, we evaluate how the number of social
features affects the performance of LPSF. Finally, we examine how several super-
vised link prediction models perform on unweighted and weighted networks, and
the degree to which LPSF improves classification performance under different eval-
uation measures.
4.4.1 Effect of Similarity Measure
A critical procedure in LPSF is reweighting the original networks according to the

similarity of the node pairs social features. Figure 2 shows the F-Measure perfor-
mance of LPSF using different weighting metrics on DBLP datasets. Here the number
of edge clusters is set to 1,000 for all conditions, and different classifiers have been
adopted for the purpose of comparison. We observe that in the DBLP-A dataset, even
though the performance of each weighting metric is mainly dominated by the choice
of classifier, Histogram Intersection Kernel (HIK) and Inner Product perform better
than CLC and Cosine in most cases. HIK dramatically outperforms Cosine in Naive
Bayes by about 20 % and Inner in Logistic Regression for 7 %. The Cosine measure
performs almost equally well for all classifiers but with a relatively low accuracy
unfortunately.
In the DBLP-B dataset, while Inner Product performs well on Random Forest,
HIK outperforms other weighting metrics using the other classifiers. Accordingly,
we select HIK as our default weighting metric in LPSF for the remainder of the
experiments.
Fig. 2 Classification performance of LPSF on the DBLP Dataset using different similarity measures
on nodes social features. The number of edge clusters is set to 1,000, and Histogram Intersection
Kernel (HIK) performs the best in both datasets. a DBLP-A dataset. b DBLP-B dataset
Fig. 3 Classification performance of LPSF using HIK on the DBLP Dataset with varying number
of social features, using different supervised classifiers. a DBLP-A dataset. b DBLP-B dataset
4.4.2 Varying the Number of Social Features
Here, we evaluate how the number of social features (edge clusters) affects the link
prediction performance of LPSF, and Fig. 3 shows the corresponding classification
accuracy under the F-Measure metric. In the DBLP-A dataset, Naive Bayes and
Random Forest are relatively robust to the number of social features while Logis-
tic Regression and Neural Network perform better with a smaller number of social
features (less than 500). Similarly in the DBLP-B dataset, LPSF demonstrates bet-
ter performance with fewer social features. Therefore we set the number of social
features to 300 and 500 for the DBLP-A and DBLP-B datasets respectively.
4.4.3 Supervised Link Prediction: LPSF Reweighting
Figures 4 and 5 display the comparisons between LPSF and the baseline methods
on the DBLP datasets using a variety of supervised link classification techniques,
against both the unweighted and weighted supervised baselines. The same features
are used by all methods, with the only difference being the weights on the network
links. In this article, we compare the proposed method LPSF with alternate weighting
schemes, such as the number of co-authored papers, as suggested in [9]. We see that in
both DBLP datasets, Unweighted, Weighted and LPSF perform almost equally under
Precision, though LPSF performs somewhat worse for some classifiers (Random
Forest and Naive Bayes). When considering the number of collaborations between
author pairs, the Weighted method slightly improves upon the performance of the
Unweighted method.
The proposed reweighting (LPSF) offers substantial improvement over both the
Unweighted and Weighted schemes on Recall and F-Measure in both datasets. In
the DBLP-A dataset, LPSF outperforms the unweighted baseline the most dramati-
cally on Logistic Regression, with about 23 % improvement and 40 % on Recall and
F-Measure respectively. In the DBLP-B dataset, LPSF shows the best performance
Fig. 4 Comparing the classification performance of supervised link prediction models on

unweighted and weighted DBLP-A networks using Precision, Recall and F-Measure. The proposed
method (LPSF) is implemented using 300 edge clusters and the HIK reweighting scheme. Results
show that LPSF significantly improves over both unweighted and weighted baselines, especially
under Recall and F-Measures
Fig. 5 Comparing the classification performances of supervised link prediction models on

unweighted and weighted DBLP-B networks using Precision, Recall and F-Measure. The proposed
method (LPSF) is implemented using 500 edge clusters and the HIK reweighting scheme. Results
show that LPSF significantly improves over both unweighted and weighted baselines, especially
under Recall and F-Measures
using Neural Network with accuracy improvements over baselines for 13 % on Recall
and 30 % on F-Measure.
LPSF calculates the closeness between connected nodes according to their social
dimensions, which captures the nodes prominent interaction patterns embedded in
the network and better addresses heterogeneity in link formation. By differentiating
different types of links, LPSF is able to discover the possible link patterns between
disconnected node pairs that may not be determined by the Unweighted and simple
Weighted method, and hence exhibits great improvement on Recall and F-Measure.
Since LPSF can be directly applied on the unweighted network, without considering
any additional node information, it is thus broadly applicable to a variety of link
prediction domains.
4.4.4 Supervised Link Prediction: Choice of Classifier
Figures 4 and 5 compare the performance of different supervised classifiers for link
prediction. We found that the performance of the classifiers varies between datasets.
Logistic Regression, Naive Bayes and Neural Network exhibit comparable perfor-
mance. Somewhat surprisingly, Random Forest does not perform well with LPSF.
We also observe that LPSF using Naive Bayes will boost the Recall performance
over baseline methods at the cost of lower Precision. Therefore Logistic Regression
and Neural Network are a better choice for LPSF in that they improve the Recall
performance without decreasing the Precision. Using the traditional weighted fea-
tures [9] does not help supervised classifiers for link prediction to a great extent.
As discussed above, reweighting the unweighted collaboration network using our
proposed technique, LPSF, performs the best.
5 Unsupervised Diffusion-Based Link Prediction Models
Traditional unsupervised link prediction methods aim to measure the similarity for
a node pair and use the affinity value to predict the existence of a link between
them. The performance of link predictor is consequently highly dependent on the
choice of pairwise similarity metrics. Most widely used unsupervised link predictors
focus on the underlying local structural information of the data, which is usually
extracted from the neighboring nodes within a short distance (usually 1-hop away)
from the source. For instance, methods such as Common Neighbors and Jaccards
Coefficient calculate the prediction scores based on the number of directly shared
neighbors between the given node pair. However, a recent study of coauthorship
networks by Backstrom and Leskovec shows that researchers are more interested
in establishing long-range weak ties (collaborations) rather than strengthening their
well-founded interactions [3]. Figure 6 shows the distance distribution of newly col-
laborating authors between 2009 and 2010 in the DBLP datasets. We discover that in
both datasets the majority of new links are generated by a node pair with a minimal
distance equal to or greater than two. This poses a problem for local link predictors
which ignore information from the intermediate nodes along the path between the
node pair.
In the past few years, the diffusion process (DP) model has attracted an increasing
amount of interest for solving information retrieval problems in different domains
[11, 36, 40]. DP aims to capture the geometry of the underlying manifold in a
weighted graph that represents the proximity of the instances. First, the data are rep-
resented as a weighted graph, where each node represents an instance and edges are
weighted according to their pairwise similarity values. Then the pairwise affinities are
re-evaluated in the context of all connected instances, by diffusing the similarity val-
ues through the graph. The most common diffusion processes are based on random
walks, where a transition matrix defines probabilities for walking from one node
to a neighboring one, that are proportional to the provided affinities. By repeatedly
making random walk steps on the graph, affinities are spread on the manifold, which
in turn improves the obtainable retrieval scores. In the context of social network
data, the data structure naturally leads to graph modeling, and graph-based methods
have been proven to perform extremely well when combined with Markov chain
techniques. In the following sections, we will explore the effectiveness of diffusion-
based methods on solving link prediction problems. The next section introduces the
Fig. 6 Probability distribution of the shortest distance between node pairs in future links (between
2009 and 2010) in the DBLP datasets. Distances marked as 0 are used to indicate that no path
can be found that connects the given node pair. a DBLP-A dataset. b DBLP-B dataset
diffusion process model (DP) and an embedding method based on diffusion processes,
diffusion maps (DM). Our proposed diffusion-based link prediction models (LPDP
and LPDM) are discussed in Sects. 5.1 and 5.2.
5.1 Diffusion Process
We begin with the definition of a random walk on a graph G = (V, E), which
contains N nodes vi V , and edges ei j E that link nodes to each other. The
entries in the N N affinity matrix A provide the edge weights between node pairs.
The random walk transition matrix P can be defined as
P = D 1 A (13)
where D is a N N diagonal matrix defined as:

deg(i) if i = j
di j = (14)
0 otherwise
and deg(i) is the degree of the node i (i.e., the sum over its edge weights). The
transition probability matrix P is a row-normalized matrix, where each row sums
up to 1. Assuming f0 , a 1 N dimensional vector of the initial distribution for a
specific node, the single step of the diffusion process can be defined by the simple
update rule:
ft+1 = ft P (15)
Therefore, it is possible to calculate the probability vector ft after t steps of random

walks as
ft = f0 P t (16)
where Pt is the power of the matrix P. The entry f jt in ft measures the probability
of going from the source node to node j in t time steps.
The PageRank algorithm described in Sect. 4.1 is one of the most successful
webpage ranking methods and is constructed using a random walk model on the
underlying hyperlink structures. In PageRank, the standard random walk is modified:
at each time step t a node can walk to its outgoing neighbors with probability or will
jump to a random node with probability (1 ). The update strategy is as follows:
ft+1 = ft Pt + (1 )y (17)
where y defines the probabilities of randomly jumping to the corresponding nodes.

The PageRank algorithm iteratively updates the webpages ranking distribution (f)
until it converges. One extension of the PageRank algorithm is random walk with
restart (RWR) [27], which considers a random walker starting from node i, who will
iteratively move to a random neighbor with probability and return to itself with
probability 1 . In the RWR update, y in Eq. 17 is simply a 1 N vector with the
ith element equal to 1 and others to 0.
The diffusion process can further be extended to different independent instances
by updating the probability matrix as follows:
Wt+1 = Wt Pt + (1 )Y (18)
where W is a N N matrix that represents the local relationships (weights) between

different instances. For networked data, the adjacency matrix A can be directly used
as W, and P can be formed by normalizing matrix W such that its rows add up to 1.
Similarly, the N N matrix Y consists of N personalized row vectors y.
In the literature, a number of diffusion models have been proposed by tuning
the functions for W for different application domains [11, 27, 36]. Our studies also
reveal the choice of diffusion scheme has a substantial impact on the link prediction
accuracy. In this article, we adopt the updating scheme used for Random Walk with
Restart in Eq. 18. To apply the diffusion model on the link prediction problem, we
calculate the prediction score for a given node pair (i, j) based on the corresponding
entries in the final diffusion matrix:
(t) (t)
LPDP(i, j) = Wi j W ji (19)
(t)
where Wi j is the corresponding (i, j) entry in Wt . Note that Wt is not necessarily
a symmetric matrix, meaning Witj = W tji .
5.2 Diffusion Maps
The diffusion maps technique (DM), first introduced by Coifman and Lafon, applies
the diffusion process model toward the problem of dimensionality reduction; it
aims to embed the data manifold into a lower-dimensional space while preserving
the intrinsic local geometric data structure [7]. Different from other dimension-
ality reduction methods such as principal component analysis (PCA) and multi-
dimensional scaling (MDS), DM is a non-linear method that focuses on discovering
the underlying manifold generating the sampled data. It has been successfully used
on problems outside of social media analysis, including learning semantic visual
features for action recognition [20].
(t)
As discussed in the previous section, in diffusion models, each entry Wi j indicates
the probability of walking from i to j in t time steps. When we increase t, the diffusion
process moves forward, and the local connectivity is integrated to reveal the global
connectivity of the network. Increasing the value of t raises the likelihood that edge
weights diffuse to nodes that are further away in the original graph. From this point
of view, the Wt in the diffusion process reflects the intrinsic connectivity of the
network, and the diffusion time t plays the role of a scaling factor for data analysis.
Subsequently, the diffusion distance D is defined using the random walk forward
probabilities pit j to relate the spectral properties of a Markov chain (its matrix, eigen-
values, and eigenvectors) to the geometry of the data. The diffusion distance aims
to measure the similarity of two points (Ni and N j ) using the diffusion matrix Wt ,
which is in the form of:
(t) (t)
(Wiq W jq )2
[D (t) (Ni , N j )]2 = (20)
(Nq )(0)
q
where (Nq )(0) is the unique stationary distribution which measures the density of
the data points.
Since calculating the diffusion distance is usually computationally expensive,
spectral theory can be adopted to map the data point into a lower dimensional space
such that the diffusion distance in the original data space now becomes the Euclidean
distance in the new space. The diffusion distance can then be approximated with
relative precision using the first k nontrivial eigenvectors and eigenvalues of Wt
according to

k
[D (t) (Ni , N j )]2 (ts )2 (vs (Ni ) vs (N j ))2 (21)
s=1
where tk > t1 . If we use the eigenvectors weighted with as coordinates on the

data, D (t) can be interpreted as the Euclidean distance in the low-dimensional space.
Hence, the diffusion map embedding and the low-dimensional representation are
given by
Table 2 Algorithm: diffusion maps on unweighted networked data

Objective: given a weighted graph W with N nodes, embed all nodes into a k-dimensional space
1. Create Markov transition matrix P by normalizing matrix W such that each row sums to 1
2. Compute diffusion matrix Wt at diffusion time t using Eq. 18
3. Perform eigen-decomposition on Wt , and obtain eigen-value s and eigenvectors vs , such
that Wt vs = s vs
4. Embed data by DM using Eq. 22
t : Ni {t1 v1 (Ni ), t2 v2 (Ni ), . . . , tk vk (Ni )}T (22)
The diffusion map t embeds the data into a Euclidean space in which the distance
is approximately the diffusion distance:
[D (t) (Ni , N j )]2 t (Ni ) t (N j ) 2 (23)
The diffusion maps framework for the proposed method Link Prediction using
Diffusion Maps (LPDM) is summarized in Table 2. LPDM defines the link predic-
tion score for a given node pair (Ni , N j ) by the diffusion distance, D (t) (Ni , N j ),
between them.
5.3 Evaluation Framework
In this article, we evaluate the performance of our proposed diffusion-based link

prediction models (LPDP and LPDM) on the same DBLP datasets mentioned in
Sect. 4.3.1, and compare them with the eight unsupervised baselines listed in Sect. 4.1.
Similar to the LPSF model, LPDP and LPDM can be applied on the weighted net-
works constructed with the edge clustering method. In the later section, we compare
the performance of LPDP and LPDM on both unweighted and weighted DBLP net-
works. We use cosine similarity while clustering the links in the training set. Then
the edge-based social dimension is constructed based on the edge cluster IDs using
the count aggregation operator. We tested the algorithms with various numbers of
edge clusters, and report the one offering the best performance of LPDP and LPDM.
The similarity scores of the connected nodes social features are measured using
the Histogram Intersection Kernel, which are then used to construct the weighted
network. The search distances L for unsupervised metrics Inverse Path Distance and
PropFlow are set to 7 and 11 for the DBLP-A and DBLP-B datasets respectively.
We sample the same number of non-connected node pairs as that of the existing
future links to be used as the negative training instances. The Area Under the Receiver
Operating Characteristic curve (AUROC) is a standard measure of accuracy that
relates the sensitivity (true positive rate) and specificity (true negative rate) of a
classifier. In this article, we report the performance of all unsupervised link prediction
methods using AUROC.
5.4 Results
We conduct several experiments for evaluating the performance of the diffusion-based

link predictors. First, we evaluate the link prediction performance of LPDP and
LPDM on the unweighted DBLP datasets under different model parameter settings,
such as the damping factor and diffusion time t. For LPDM, we also examine how
different sizes of the embedded diffusion spaces affect its link prediction perfor-
mance. Additionally, we compare the diffusion-based link prediction models with
other unsupervised benchmarks on both unweighted and weighted networks.
5.4.1 Effects of Diffusion Time on LPDP
As mentioned before, in diffusion processes, the diffusion time t controls the amount
of weight likelihood that diffuses between long distance node pairs. The higher the
value of t is, the more likely the link weights are to diffuse to the nodes that are
further away. Figure 7 shows the effect of varying diffusion time on the LPDP link
prediction accuracy for the DBLP dataset. In this experiment, we fix the value of
to 0.9 which offers LPDP the best performance. We discover that setting t to a higher
value does not guarantee higher link prediction accuracy. LPDP performs best when
Fig. 7 Link prediction performance (AUROC) of LPDP with fixed damping factor = 0.9 and
varying diffusion time (t) on unweighted DBLP-A and DBLP-B datasets. LPDP performs best on
both datasets when t = 15
Fig. 8 AUROC accuracy of LPDM on DBLP datasets with varying damping factor and embedded
space size. The diffusion time t for LPDM is set to 100 and 60 for DBLP-A and DBLP-B dataset
respectively. a DBLP-A dataset. b DBLP-B dataset
t = 15, yielding an AUROC accuracy 84.61 and 85.49 % on DBLP-A and DBLP-B
datasets respectively.
5.4.2 Effects of Damping Factor and Embedded Space Size on LPDM
Here, we evaluate how the size of the embedded space and the value of the damping
factor affect the link prediction performance of LPDM. Figure 8 shows the corre-
sponding classification accuracy measured by AUROC. The diffusion time t has an
insignificant effect on the performance of LPDM, and the results we report here are
based on setting t to 100 and 60 for DBLP-A and DBLP-B respectively. In both
datasets, a lower damping factor yields higher accuracy, and LPDM demonstrates
the best performance when equals 0.55 and 0.65 on DBLP-A and DBLP-B respec-
tively. Note that in Eq. 18, a lower results in a reduced probability of exchanges
between a node and its connected neighbors. Our results reveal that the size of the
embedded diffusion space greatly affects the performance of LPDM. Here we report
experimental results for embedded diffusion space dimensions ranging from 1 and
100. As shown in Fig. 8, the diffusion maps technique is able to identify semanti-
cally similar nodes by measuring distance on an embedded space with a much smaller
dimensionality. LPDM exhibits the best performance (79.61 and 79.08 %) when the
size of the embedded space equals 25 and 15 on DBLP-A and DBLP-B respectively.
5.4.3 Comparing Unsupervised Link Prediction Methods
In Sect. 4.4.3, we evaluate our supervised link classifier LPSF which employs an
ensemble of unsupervised measures as features. These unsupervised measures can
themselves be used for classification, although we do not expect an individual fea-

ture to be competitive with the supervised combination. Here, we compare these
unsupervised measures with our proposed diffusion-based measures LPDP and
LPDM on unweighted and reweighted graphs. Tables 3 and 4 summarize the link
prediction performance (AUROC) of individual unsupervised features on DBLP. We
make several interesting observations.
First, we note that among the individual features, PA is by far the best performer.
This is because PAs model for link generation is a particularly good fit to the DBLP
network structure and real-world academic publishing. It is true that highly pub-
lished authors generate many more publications than their less prolific peers and will
also seek to collaborate with other highly influential (high degree) authors in the
future. Hence the richer get richer phenomenon definitely exists in coauthorship
networks. Since the preferential attachment model is already a good match for the
academic publishing domain, reweighting the links does not improve link prediction
performance; in fact, performance drops slightly. This highlights the sensitivity of
unsupervised classifiers to the link prediction domain.
Second, we observe that methods that rely on information gathered from node
pairs directly connected neighbors, such as CN, JC, AA and RA, perform poorly
with accuracies only slightly above 50 %. This result is not unexpected, given that
the authorship distribution shown in Fig. 6 reveals that DBLP authors are more likely
to form future collaborations with authors with whom they share longer range ties.
By collecting structural information from all nodes in the path, IPD, PropFlow, PR,
LPDP and LPDM significantly improve the link prediction performance. Further-
more, in both the DBLP-A and DBLP-B datasets, the models that incorporate the
random walk technique (PR, LPDP and LPDM) outperform the other two methods
(IPD and PropFlow). LPDP performs the best among the three with an AUROC accu-
racy of 85.49 and 84.61 % on DBLP-A and DBLP-B datasets respectively. Unfortu-
nately the diffusion maps in LPDM are not able to capture the semantically similar
Table 3 Link prediction accuracy of individual (unsupervised) classifiers on the DBLP-A dataset
AUROC (%) PA AA CN JC RA IPD PropFlow PageRank LPDP LPDM
Unweighted 86.68 50.95 50.95 50.95 50.20 77.46 77.52 82.54 85.49 79.61
Weighted 85.16 50.95 50.95 50.95 50.20 80.06 79.71 85.61 83.08 80.43
Performance is evaluated on both unweighted networks and weighted networks constructed using
social context features. Note that the reweighting scheme does not always improve accuracy at the
individual feature level
Table 4 Link prediction accuracy of individual (unsupervised) classifiers on the DBLP-B dataset
AUROC (%) PA AA CN JC RA IPD PropFlow PageRank LPDP LPDM
Unweighted 87.97 52.15 52.15 52.14 50.66 77.09 76.98 83.60 84.61 79.08
Weighted 87.11 52.15 52.15 52.15 50.66 76.23 76.66 87.14 80.11 80.09
Performances are evaluated on both unweighted networks and weighted networks constructed using
social context features. Note that the reweighting scheme does not always improve accuracy at the
individual feature level
nodes after the diffusion process which results in inferior performance to LPDP.
LPDMs performance is worse than LPDP by around 5 %, while still performing
better than IPD and PropFlow. This might be because the diffusion process after
t diffusion time steps is good enough to capture the underlying similarity between
nodes at farther distances using the node similarity extracted from the final diffusion
matrix.
Third, Tables 3 and 4 also include the comparison results of different unsupervised
link predictors on weighted DBLP networks constructed using edge cluster infor-
mation. On one hand, we found that in methods such as CN, JC, AA and RA, the
weighting scheme does not affect the corresponding link prediction accuracy much.
On the other hand, the weighting scheme helps to improve the performance of IPD,
PropFlow, PageRank as well as LPDM by around 23 %. On both weighted datasets,
PageRank performs best among all unsupervised features. It is also surprising that
LPDP performs poorly on the weighted network, reducing the accuracy by 2 % on
the DBLP-A dataset and 4 % on the DBLP-B dataset.
In summary, we observe that the reweighting scheme yields dramatic
improvements in LPSF which integrates the first eight features listed in Table 3 in a
supervised setting; however, it fails to boost the unsupervised performance of individ-
ual features. As mentioned in [22], the utility of using weights in link prediction is a
somewhat controversial issue. Some case studies have shown that prediction accuracy
can be significantly harmed when weights in the relationships were considered [22].
Our experiments reveal a more nuanced picture: although link weights (using the
proposed approach) may not generate a large improvement for some individual
unsupervised feature-level techniques, employing an appropriate choice of link
weights (e.g., using LPSF) in conjunction with a supervised classifier enables us
to achieve more accurate classification results on the DBLP datasets.
Weights based on node pairs social features extracted from an unweighted
network. higher similarity between the target node and the nearby nodes. Appar-
ently this assumption is invalid in DBLP and other scientific collaboration datasets.
Similarly Yin et al. estimated link relevance using the random walk algorithm on
an augmented social graph with both attribute and structure information [41]. Their
framework leverages both global and local influences of the attributes. Different to
their model, our diffusion-based techniques LPDP and LPDM only rely on the net-
work structural information without considering any nodes local (intrinsic) features.
Additionally, experiments in [19, 41] are conducted on evaluating the existent links
in the network rather than predicting the future links.
6 Conclusion
In this article, we investigate the link prediction problem in collaboration networks

with heterogeneous links. Most commonly-used link prediction methods assume
that the network is in unweighted form, and treat each link equally. In this article,
we proposed a new link prediction framework LPSF that captures nodes intrinsic
interaction patterns from the network topology and embeds the similarities between
connected nodes as link weights. The nodes similarity is calculated based on social
features extracted using edge clustering to detect overlapping communities in the net-
work. Experiments on the DBLP collaboration network demonstrate that a judicious
choice of weight measure in conjunction with supervised link prediction enables us
to significantly outperform existing methods. LPSF is better able to capture the true
proximity between node pairs based on link group information and improves the
performance of supervised link prediction methods.
However, the social features utilized effectively by the supervised version of LPSF
are less useful in an unsupervised setting both with the raw proximity metrics and our
two new diffusion-based methods (LPDP and LPDM). We observe that in the DBLP
dataset researchers are more likely to collaborate with other highly published authors
with whom they share weak ties which causes the random-walk based methods (PR,
LPDP and LPDM) to generally outperform other benchmarks. Even though the
reweighting scheme greatly boosts the performance of LPSF, it does not always
have significant impact on its corresponding unsupervised features. In conclusion
we note that any weighting strategy should be applied with caution when tackling
the link prediction problem.
Acknowledgments This research was supported in part by NSF IIS-08451.
References
1. Adamic L, Adar E (2003) Friends and neighbors on the web. Soc Netw 25(3):211230
2. Ahn YY, Bagrow JP, Lehmann S (2010) Link communities reveal multi-scale complexity in
networks. Nature 466:761764
3. Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending
links in social networks. In: Proceedings of the fourth ACM international conference on web
search and data mining, pp 635644
4. Barla A, Odone F, Verr A (2003) Histogram intersection kernel for image classification. In:
Proceedings 2003 international conference on image processing, vol 3, III-513-16
5. Benchettara N, Kanawati R, Rouveirol C (2010) Supervised machine learning applied to link
prediction in bipartite social networks. In: Proceedings of the international conference on
advances in social network analysis and mining, pp 326330
6. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput
Netw ISDN Syst 30(17):107117
7. Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21(1):530
8. Davis D, Lichtenwalter R, Chawla NV (2012) Supervised methods for multi-relational link
prediction. Social network analysis and mining, pp 115
9. de S HR, Prudncio RBC (2011) Supervised link prediction in weighted networks. In: Inter-
national joint conference on neural networks (IJCNN), pp 22812288
10. Ding Y (2011) Applying weighted pagerank to author citation networks. CoRR abs/1102.1760
11. Donoser M, Bischof H (2013) Diffusion processes for retrieval revisited. In: Proceedings of
IEEE conference on computer vision and pattern recognition (CVPR), pp 13201327
12. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data
mining software: an update. SIGKDD Explor Newsl 11(1):1018
13. Hasan MA, Chaoji V, Salem S, Zaki M (2006) Link prediction using supervised learning. In:
Proceedings of the SDM workshop on link analysis, counterterrorism and security
14. Jin EM, Girvan M, Newman MEJ (2001) The structure of growing social networks. Phys Rev
E 64:046132
15. Kong X, Shi X, Yu PS (2011) Multi-label collective classification. In: SIAM international
conference on data mining (SDM), pp 618629
16. Lee JB, Adorna H (2012) Link prediction in a modified heterogeneous bibliographic network.
In: Proceedings of international conference on advances in social networks analysis and mining
(ASONAM), pp 442449
17. Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am
Soc Inf Sci Technol 58(7):10191031
18. Lichtenwalter RN, Lussier JT, Chawla NV (2010) New perspectives and methods in link predic-
tion. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery
and data mining, pp 243252
19. Liu W, Lu L (2010) Link prediction based on local random walk. EPL (Europhys Lett) 85(5)
20. Liu J, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance.
In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp
461468
21. Lu L, Zhou T (2011) Link prediction in complex networks: a survey. Phys A 390(6):11501170
22. L L, Zhou T (2009) Role of weak ties in link prediction of complex networks. In: Proceedings
of the ACM international workshop on complex networks meet information and knowledge
management, pp 5558
23. Murata T, Moriyasu S (2007) Link prediction of social networks based on weighted proximity
measures. In: Web intelligence, pp 8588
24. Newman M (2001) Clustering and preferential attachment in growing networks. Phys Rev E
64(2):025102
25. Newman MEJ (2004) Detecting community structure in networks. Eur Phys J B - Condens
Matter Complex Syst 38(2):321330
26. Ou Q, Jin YD, Zhou T, Wang BH, Yin BQ (2007) Power-law strength-degree correlation from
resource-allocation dynamics on weighted networks. Phys Rev E 75:021102
27. Pan JY, Yang HJ, Faloutsos C, Duygulu P (2004) Automatic multimedia cross-modal correlation
discovery. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge
discovery and data mining, pp 653658
28. Popescul A, Popescul R, Ungar LH (2003) Statistical relational learning for link prediction.
In: IJCAI workshop on learning statistical models from relational data
29. Pujari M, Kanawati R (2012) Tag recommendation by link prediction based on supervised
machine learning. In: Proceedings of the international conference on weblogs and social media
30. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc,
New York
31. Sen P, Namata G, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T (2008) Collective classifi-
cation in network data. AI Mag 29:93106
32. Soundarajan S, Hopcroft J (2012) Using community information to improve the precision of
link prediction methods. In: Proceedings of the international conference on the world wide
web, pp 607608
33. Sun Y, Barber R, Gupta M, Aggarwal CC, Han J (2011) Co-author relationship prediction
in heterogeneous bibliographic networks. In: Proceedings of the international conference on
advances in social networks analysis and mining, pp 121128
34. Tang L, Liu H (2009) Scalable learning of collective behavior based on sparse social dimen-
sions. In: Proceedings of international conference on information and knowledge management
(CIKM)
35. Taskar B, Wong MF, Abbeel P, Koller D (2003) Link prediction in relational data. In: Neural
information processing systems
36. Wanga J, Lia Y, Baib X, Zhanga Y, Wangc C, Tang N (2011) Learning context-sensitive
similarity by shortest path propagation. Pattern Recognit 44(1011):23672374
37. Wang X, Sukthankar G (2011) Extracting social dimensions using Fiedler embedding. In:
Proceedings of IEEE international conference on social computing, pp 824829
38. Wang X, Sukthankar G (2013) Link prediction in multi-relational collaboration networks.

In: Proceedings of the IEEE/ACM International conference on advances in social networks
analysis and mining. Niagara Falls, Canada, pp 14451447
39. Xiang EW (2008) A survey on link prediction models for social network data. Sci technol
40. Yang X, Koknar-Tezel S, Latecki LJ (2009) Locally constrained diffusion process on locally
densified distance spaces with applications to shape retrieval. In: Proceedings of IEEE confer-
ence on computer vision and pattern recognition (CVPR)
41. Yin Z, Gupta M, Weninger T, Han J (2010) A unified framework for link recommendation
using random walks. In: 2010 international conference on advances in social networks analysis
and mining (ASONAM), pp 152159
42. Zhou T, L L, Zhang YC (2009) Predicting missing links via local information. Eur Phys J B
- Condens Matter Complex Syst 71(4):623630
Characterization of User Online Dating
Behavior and Preference on a Large Online
Dating Site
Peng Xia, Kun Tu, Bruno Ribeiro, Hua Jiang, Xiaodong Wang,
Cindy Chen, Benyuan Liu and Don Towsley
Abstract Online dating sites have become popular platforms for people to look for
romantic partners, providing an unprecedented level of access to potential dates that is
otherwise not available through traditional means. Characterization of the user online
dating behavior helps us to obtain a deep understanding of their dating preference
and make better recommendations on potential dates. In this paper we study the user
online dating behavior and preference using a large real-world dataset from a major
online dating site in China. In particular, we characterize the temporal behavior,
message send and reply behavior of users, study how users online dating behaviors
correlate with various user attributes, and investigate how users actual online dating
behaviors deviate from their stated preferences. Our results show that on average
a male sends out more messages but receives fewer messages than a female. A
female is more likely to be contacted but less likely to reply to a message than a
male. The number of messages that a user sends out and receives per week quickly
P. Xia (B) C. Chen B. Liu

Department of Computer Science, University of Massachusetts Lowell,
1 University Ave, Lowell, MA 01854, USA
e-mail: pxia@cs.uml.edu
C. Chen
e-mail: cchen@cs.uml.edu
B. Liu
e-mail: bliu@cs.uml.edu
K. Tu B. Ribeiro D. Towsley
Department of Computer Science, University of Massachusetts Amherst, Amherst, MA, USA
e-mail: kuntu@cs.umass.edu
B. Ribeiro
e-mail: ribeiro@cs.umass.edu
D. Towsley
e-mail: towsley@cs.umass.edu
H. Jiang X. Wang
Product Division, Baihe.com, Beijing, China
e-mail: jianghua@baihe.com
X. Wang
e-mail: xdwang@baihe.com
DOI 10.1007/978-3-319-12188-8_9
194 P. Xia et al.
decreases with time, especially for female users. Most messages are replied to within
a short time frame with a median delay of around 9 h. Many of the user messaging
behaviors align with notions in social and evolutionary psychology: males tend to
look for younger females while females place more emphasis on the socioeconomic
status (e.g., income, education level) of a potential date. The geographic distance
between two users and the photo count of users play an important role in their
dating behavior. We show that it is important to differentiate between users true
preferences and random selection. Some user behaviors in choosing attributes in a
potential date may largely be a result of random selection. We also find that while
both males and females are more likely to reply to users whose attributes come closest
to the stated preferences of the receivers, there is significant discrepancy between a
users stated dating preference and his/her actual online dating behavior. We further
characterize how users actual dating behavior deviate from their stated preference.
These results can provide valuable guidelines to the design of a recommendation
engine for potential dates.
Keywords Online dating User attributes User behavior analysis Recommen-

dation Temporal analysis
1 Introduction
Computer-based matchmaking was pioneered by Operation Match at Harvard

University and Contact at MIT in mid-1960s [17]. Based on the responses to a
personality questionnaire, a computer program tried to match a user with compatible
dates. Three decades later, starting in the mid-1990s, with the increasing ubiquity of
the Internet connectivity and wide-spread use of the World Wide Web, online dating
sites have emerged as popular platforms for people to look for potential romantic
partners.
The rise of online dating has fundamentally altered the dating landscape and
profoundly impacted peoples dating life. It offers an unprecedented level of access to
potential romantic partners that is otherwise not available through traditional means.
According to a recent survey,1 40 million single people (out of 54 million) in the US
have signed up with various online dating sites such as Match.com, eHarmony, etc.,
and around 20 % of currently committed romantic relationships began online, which
is more than through any means other than meeting through friends. A study [13]
conducted by Match.com and Chadwick Martin Bailey shows similar results, besides
that they also find that more than twice as many marriages occurred between people
who met on an online dating site than met in bars, at clubs, and other social events
combined in 2010.
An online dating site allows a user to create a profile that typically includes the
users photos, basic demographic information, behavior and interests (e.g., smoking,
1 http://statisticbrain.com/online-dating-statistics.
Characterization of User Online Dating Behavior and Preference . . . 195
drinking, hobbies), self-description, and desired characteristics of an ideal partner.

Some sites require a user to complete a personality questionnaire for evaluating the
persons personality type and using it in the matching process. After creating a profile,
a user can search for other peoples profiles based on a variety of user attributes,
browse other user profiles, and exchange messages with them. Many sites provide
suggestions on compatible partners based on proprietary matching algorithms.
There is often considerable discrepancy, or dissonance (a concept in social
psychology), between a users stated preference and his or her actual dating behavior
[4]. Therefore, it is important to understand users true dating preferences in order to
make better dating recommendations. The message send and reply actions of a user
are strong indicators for what he/she is looking for in a potential partner and reflect
the users actual dating preferences.
In this paper we study how user online dating behavior correlates with various user
attributes using a real-world dataset obtained through a collaboration with baihe.com,
one of the largest online dating sites in China with a total number of 60 million
registered users. In particular, we address the following research questions:
Temporal behaviors: How often does a user send and receive messages and how
does this change over time? How long does it take a recipient to reply to a message
he/she received?
Send behaviors: What is the relationship between the attributes of initiators and
recipients of the initial messages? How does user messaging behavior differ from
random selection? How do users actual online dating behaviors deviate from their
stated preferences?
Reply behaviors: How does the reply probability of a message correlate with
various attributes of the sender and receiver? How does the reply probability
depend on the extent to which the senders attributes match the receivers stated
preferences?
Main findings: Our study provides a firsthand account of the user online dating
behaviors based on a large dataset obtained from a large online dating site (baihe.com)
in China, a country with a very large population and unique culture. On average, a
male sends out more messages but receives fewer messages than a female. A female
is more likely to be contacted but less likely to reply to a message than a male. The
number of messages that a user sends out and receives per week quickly decreases
with time, especially for female users. On average a female sends out 37 messages
and receives 18 messages in the first week, and in the eighth week these numbers
drop to 7 and 4 messages respectively. A male sends out 17 messages and receives 4
messages in the first week, and in eighth week the numbers drop to 15 and 2 messages
respectively. Most messages are replied to within a short time frame with a median
delay of around 9 h.
Many of our results on user messaging behavior align with notions in social
and evolutionary psychology [1, 3, 12]. Males tend to look for younger females
while females place more emphasis on socioeconomic status such as the income and
education level of a potential date. As a male gets older, he searches for relatively
196 P. Xia et al.
younger and younger women. A female in her 20s is more likely to look for older
males, but as a female gets older, she becomes more open towards younger males.
In addition to the above findings, we observe that geographic distance between
two users plays an important role in online dating considerations: 46.5 % of the initial
messages occurred between users in the same city, and for messages that cross the
city boundaries, the volume quickly decreases as users live farther apart. Females
are more likely than males to send and reply to messages between distant big cities.
Profile photos affect male and females messaging behaviors differently. Females
with a larger number of photos are more likely to invite messages and secure replies
from males, but the photo count of males does not have as significant effect in
attracting contacts and replies.
Our results also show that it is important to differentiate between users true
preferences and random selection. Some user behaviors in choosing attributes in a
potential date may be a result of random selection. For example, while it appears that a
male tends to look for females shorter than he is and a female tends to look for males
taller than she is, the message send and reply behaviors of both genders closely
approximate those resulting from random selection, showing that these behaviors
may result from random selection rather than users true preferences.
Our results also indicate a significant discrepancy between a users stated dating
preference and his/her actual online dating behavior. A fairly large fraction of mes-
sages are sent to or replied to users whose attributes do not match the sender or
receivers stated preferences. Females tend to be more flexible than males in deviat-
ing from their stated preferences when sending and replying to messages. For both
males and females, out of the population of users that send messages, replies are
more likely to go to users whose attributes come closest to the stated preferences of
the receivers. We further characterize how users actual dating behavior deviate from
their stated preference. For both male and female users, when they send messages
to people who do not satisfy their stated age requirement, younger users are more
likely to send messages to people older than their stated age preference, while users
of older age group (especially males) become more likely to send messages to people
younger than their stated preference. Similarly, users of lower height are more likely
to send messages to people taller than their stated preference, while taller users are
more likely to send messages to people lower than their stated preference.
In summary, our results reveal how user message send and reply behaviors
correlate with various user attributes, how these behaviors differ from random selec-
tion, and how users actual online dating behavior deviates from their stated prefer-
ences. These results on users dating preferences can provide valuable guidelines to
the design of recommendation engine for potential dates.
The rest of the paper is structured as follows. Section 2 presents an overview of
previous studies on the data analysis of online dating sites. Section 3 describes the
dataset that we obtained from a major online dating site in China. Section 4 describes
the temporal characteristics of users online dating behavior. Users message send
and reply behaviors are studied in Sect. 5. We discuss our main results in Sect. 6.
Finally, we conclude the paper in Sect. 7.
2 Related Work
Fiore et al. [6] analyzes peoples online dating messaging behavior and find them
consistent with predictions from evolutionary psychology, women state more restric-
tive preferences than men and contact and reply to others more selectively. Lin and
Lundquist [11] studied how race, gender, and education jointly shape interaction
among heterosexual Internet daters. They find that racial homophily dominates mate
searching behavior for both men and women. However, this is not the case of Chinese
online daters where the overwhelming majority of users are of the same race. Finkel
et al. [5] states that online dating has fundamentally altered the dating landscape by
offering an unprecedented level of access to potential partners and allowing users to
communicate before deciding whether to meet them face-to-face. On the other hand,
the authors also argue that there is no strong evidence that matching algorithms pro-
mote better romantic outcomes than conventional offline dating. Part of the problem
is that the main principles underlying these algorithms (typically similarity but also
complementarity) are much less important to relationship well-being than online
sites are willing to assume. He et al. [7] proposes two rules (potentials-attract and
likes-attract) to predict user mate choice and their results imply that likes-attract rule
(based on users actual behavior) works better than potentials-attract (based on users
stated preference), which is consistent with our observation to some extent. Interest-
ing on-the-fly statistics of OKcupid users can be found at the OkTrends blog [14].
Hitsch et al. [9] shows that in online dating there is no evidence for user strategic
behavior shading their true preference. Both male and female users have a strong
preference for similarity along many (but not all) attributes. US users display strong
same-race correlations. There are gender differences in mate preferences; in particu-
lar, women have a stronger preference than men for income over physical attributes.
In their follow-up work [8] they show that stable matches obtained through the
Gale-Shapley algorithm are similar to the actual matches achieved by the dating site,
which are also approximately efficient.
The collaborative filtering algorithm has proved an effective approach in building
recommendation system based on users activity history. Zhao et al. [21] and Cai
et al. [2] take the matching of both the tastes and attractiveness between two users into
account, and show that the method can effectively improve the performance of user
recommendation in online dating. Learning users actual dating preference based on
their attributes has become a popular methodology in recent studies of reciprocal
recommendation system. Pizzata et al. [16] proposes a content-based algorithm to
calculate compatibility scores between two users based on their attributes and activity
history for recommendation in online dating sites. Li and Li [10] considers both local
utility (users mutual preference) and global utility (overall bipartite network), and
proposes a generalized framework for reciprocal recommendation in online dating
sites. Tu et al. [18] proposes a two-side matching framework for online dating recom-
mendations and design an Latent Dirichlet Allocation (LDA) model to learn the user
preferences from the observed user messaging behavior and user profile features.
In [19], Xia et al. extract user-based features from user profiles and graph-based
198 P. Xia et al.
features from user interaction history, and use a machine learning framework to
predict user replying behavior in online dating network.
In a recent study [20], we investigated how users online dating behavior correlates
with various user attributes. In this paper, we further extend our previous work by
studying how users online dating behavior deviates from random selection as well
as their stated preference.
3 Dataset Description
We report on a dataset taken from baihe.com, a major online dating site in China.
It includes the profile information of 200,000 users uniformly sampled from users
registered in November 2011. For each user, we have his/her message sending and
receiving traces (who contacted whom at what time) in the online dating site and the
profile information of the users that he or she has communicated with from the date
that the account was created until the end of January 2012.
A users profile provides a variety of information including users gender, age,
current location (city and province), home town location, height, weight, body type,
blood type, occupation, income range, education level, religion, astrological sign,
marriage and children status, number of photos uploaded, home ownership, car own-
ership, interests, smoking and drinking behavior, self introduction essay, among
others. Each user also provides his/her preferences for potential romantic partners in
terms of age, location, height, education level, income range, marriage and children
status, etc.
Of the 200,000 sampled users, 139,482 are males and 60,518 are females,
constituting 69.7 and 30.3 % of the total number of sampled users respectively.
The dataset includes people from 34 countries and all of the provinces and munic-
ipalities (cities directly under the jurisdiction of the central government including
Beijing, Shanghai, Tianjin, Chongqing), and special administrative region (Hong
Kong, Macau) in China. Figure 1 illustrates the user geographical locations (at city
level) within China and the inter-city communications between users. Intra- and
inter-city messages constitute 46.5 and 53.5 % of the total message volume in our
data, respectively.
To give a sense of the main user demographic attributes, we plot distributions of
user reported age, height, education level, monthly income range and marriage status
in Fig. 2ae, respectively.
The youngest user is 19 years old and the largest fraction of users are in their early
20s. While there is a larger fraction of male users than female users below age 25,
the fraction of female users starts to match that of male users for age range 2535,
and exceeds that of male users after age 35. The median ages of male and female
users are 25 and 26, respectively.
The height distributions of males and females exhibit a bell shape. The median
heights of males and females are 172 and 162 cm, with a standard deviation of 5.4
and 4.7 cm, respectively.
Fig. 1 Inter-city communications of the online dating site within China
The fraction of female users is larger than that of male users for low income
ranges (less than 3,000 Chinese Yuan per month). For higher income ranges, the
trend becomes opposite. In general, males have larger incomes than females in our
dataset. The median income ranges of male and female users are 3,0004,000 and
2,0003,000 Chinese Yuan, respectively.
With respect to users education level, females stated education levels tend to be
higher than males. About 66.5 % of females state that they have at least a community
college degree in contrast with only 53.2 % of the males. The fraction of users with
stated doctoral and post-doctoral degrees is 0.61 %.
As shown in Fig. 2e, the majority of users in their early 20s are singles. As the
user age increases, the ratio of single users decreases while the ratio of widowed
users increases. The ratio of divorced users first increases with the user ages until
mid-40s and then starts to decrease. In general, the ratios of widowed and divorced
female users are larger than those of male users.
Unlike online dating behaviors in US where race plays an important role when it
comes to finding potential romantic partners [11, 14], most of the users (98.9 %) in our
dataset are Han (ethnic majority in China), and all other ethnic groups comprise 1.1 %
of the users. Moreover, the majority of the users (97.0 %) claim to be non-religious.
Those claiming a religion (Buddhism, Taoism, Catholic, Islamism, etc.) constitute
only 3.0 % of users. Note that the race and religion compositions in our dataset are
200 P. Xia et al.
(a) (b)
0.45
male
probability density function

0.4 female
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
<20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 >=60
age height (cm)
(c) (d)
cumulative distribution function
1 0.35

female male
0.9 male 0.3 female
0.8
0.25
0.7
0.2
0.6
0.5 0.15
0.4 0.1
0.3
0.05
0.2
0
0.1
Ju
Ju
Ba
Po
<200
2000
3000
4000
5000
7000
1000
1500
2000
2500
3000
>500
oc
ig
oc
ni
ni
as
st-
ch
h_
or
at
or
to
te
el
D
io
r
_H
Sc
_C
or
oc
-3000
-4000
-5000
-7000
0-100
0-150
0-200
0-250
0-300
-5000
0
00
na
ho
ol
ig
to
l_
le
h_
ol
r
Sc
ge
Sc
ho
00
00
00
00
00
ho
ol
ol
monthly income (Chinese Yuan) education level
(e) 1
Single
Divorced
Widowed
0.8
0.6
0.4
0.2
0
m
fe
m
fe
m
fe
m
fe
m
fe
m
fe
m
fe
m
fe
ale
m
ale
m
ale
m
ale
m
ale
m
ale
m
ale
m
ale
m
ale
ale
ale
ale
ale
ale
ale
ale
20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60
Fig. 2 a Age distribution of the male and female users. b Height distribution of the male ad female
users. c Cumulative distribution function of users reported monthly income. d Education level
distribution of the male and female users. e Marriage status distribution of the male and female
users
significantly different from those of online dating sites in the US where there is more
diversity [9, 11].
For each user in our sample, we have the time stamps of the messages as well as
the profile information of users that this user has communicated with. In this paper
we focus on the initial messages exchanged between users. Subsequent messages
between the same pair of users do not represent a new sender-receiver pair and
cannot be used as the only indicator for continuing relationship as users may choose
(a) 0.55 (b) 40

0.5 male male
ratio of unique sample
female 35 female
number of messages
0.45
0.4 30
0.35
senders
0.3 25
0.25 20
0.2
0.15 15
0.1 10
0.05
0 5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
week week
Fig. 3 a Fraction of users who sent out at least one message during a week. b Average number of
messages a user sent out each week given that a user sends at least one message
to go off-line from the site and communicate via other channels (e.g., email, phone,
or meet in person).
4 Temporal Behaviors
We are interested in how a users online dating activity level changes over time after
he or she registers an account on the online dating site. Since we only have eight
full weeks worth of online dating data for users who joined in November 30, 2011,
we only consider the activities of each user during the first eight weeks of his/her
membership. The following analysis is based on the activities of the 200,000 users
in the dataset described in Sect. 3.
4.1 Messages Sent from Sample Users
During the eight-week period, 2,089,029 initial messages were sent by 76,654 males
(55.0 % of the males in the dataset) to 508,118 unique females, which in turn gener-
ated 156,774 replies (a reply rate of 7.5 %). During the same time period, 1,217,672
initial messages were sent by 29,535 females (48.8 % of the females in the dataset)
to 440,714 unique male users, which in turn generated 112,696 replies (a reply rate
of 9.3 %).
The fraction of users from the dataset that sent out at least one message and the
average number of messages sent by each user are shown in Fig. 3a, b, respectively.
We observe that while a considerable fraction of users (51.2 % of males and 43.0 %
of females) sent out at least one message during the first week of their memberships,
the fraction decreases sharply in the second week (down to 11.3 % for males and
12.8 % for females) and further decreases in subsequent weeks. Except for the first
202 P. Xia et al.
(a) 1 (b) 1
male male
female female
complementary cumulative
complementary cumulative
0.1 0.1
distribution function
distribution function
0.01 0.01
0.001 0.001
0.0001 0.0001
1e-05 1e-05
1 10 100 1000 10000 100000 1 10 100 1000 10000
number of messages number of messages
Fig. 4 a CCDF of the number of messages a user sent out during the first eight weeks of his/her
membership. b CCDF of the number of messages a user received during the first eight weeks of
his/her membership
week, females are slightly more likely to send out a message than males on average.
The average number of messages a male sends out each week given that he sends
at least one message lies between 15 and 20 messages per week. While the average
number of messages a female sends given that she sends at least one message is more
than twice that of a male in the first week, it decreases sharply in the second week
and remains relatively stable at a much lower level than that of a male over the next
seven weeks.
For both males and females, we obtain the distribution of the number of messages
sent by each user per week given that a user sends at least one message during the
week, and plot its complementary cumulative density function (CCDF) in Fig. 4a.
We observe that the distributions exhibit heavy tails. Most users only sent out a small
number of messages: 94.6 % of males and 96.5 % of females sent out less than 100
messages during the first eight weeks of their membership. On the other hand, there
are small fractions of users that sent out a large number of messages. According to
the online dating site, most of these highly active users are likely to be fake identities
created by spammers and their accounts have been quickly removed from the site.
4.2 Messages Sent to Sample Users
During the same time period, 328,645 initial messages were sent by 94,179 females
to 44,509 males, which in turn generated 58,946 replies (a reply rate of 17.9 %).
1,586,059 initial messages were sent by 288,602 males to 45,623 females, which
in turn generated 150,917 replies (a reply rate of 9.5 %). Note that males are more
likely to initiate contact than females while messages from females are more likely
to generate replies than those from males.
The fraction of users from the dataset that received at least one message and
the average number of messages received by each user during the first eight weeks
of his/her membership are shown in Fig. 5a, b, respectively. We observe that the
(a) 0.7 (b) 20

male male
0.6 female 18 female
number of messages
number of messages
16
0.5
14
0.4 12
0.3 10
8
0.2
6
0.1 4
0 2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
week week
Fig. 5 a Fraction of users who received at least one message during a week. b Average number of
messages a user received each week given that a user received at least one message
fractions of both males and females that receive at least one message each week
gradually decreases over time, and that females are much more likely to receive
messages than males. Also, for each week, the average number of messages a user
received generally decreases over time for both genders, and the number of messages
received by a female each week is much larger than that for a male.
For those users that received at least one message during the first eight weeks
of their membership, we show the complementary cumulative distribution function
(CCDF) of the number of messages received by each user for both males and females
in Fig. 4b. We observe that the distributions for both male and female users exhibit a
log-normal-like behavior, and that females tend to receive more messages than male
users.
To investigate how long it takes a user to reply after receiving a message, we
define the reply delay of a message to be the time elapsed from when the message
is sent until the corresponding reply is generated when there is a reply. The reply
delay may have certain psychological implications to some people and hence affect
the progress of the communication. Thus it is an important metric to study.
We obtained the reply delay distribution for 209,863 messages replied to by
users within the dataset and plot it in Fig. 6. The reply delay distribution exhibits a
log-normal behavior with a cut-off point around 79,424 min (approximately 56 days
or 8 weeks). Note that the cut-off point is due to the fact that we only have the com-
munication record for each user during the first eight weeks of his/her membership,
so the obtained distribution is limited by this factor.
There is little difference in the reply delay distribution for male and female users.
The median reply delays of males and females are 8.9 and 9.0 h, respectively. Most
messages were replied to within a short time frame. Around 23.0 % of the messages
were replied to within 1 h, and 72.6 % of the messages were replied to within 24 h.
On the other hand, there is a small fraction of the messages with a long reply delay
of tens of days. For example, about 6.3 % of the messages required a week or more
to generate a reply.
204 P. Xia et al.
complementary cumulative distribution

male reply to female
female reply to male
0.1
0.01
function
0.001
0.0001
1e-05
1 10 100 1000 10000 100000 1e+06
reply delay (minutes)
Fig. 6 CCDF of the reply delay of messages sent by sample users
5 Message Sending and Replying Behaviors
After a user creates an account on the online dating site, he/she can search for potential
dates based on information within the profiles provided by other users including user
location, age, etc. Once a potential date has been discovered, the user then sends
a message to him/her, which may or may not be replied to by the recipient. The
message sending and replying behaviors of a user are strong indicators of what he/she
is looking for in a potential partner and reflect the users actual dating preferences.
In this section, we first present the correlation between user send and reply
behaviors with various user attributes including age, height, income, education level,
distance, and photo count. We further examine how actual user behavior deviates
from random selection where user attributes (e.g., age, height, income, etc.) of the
recipient of a message are randomly drawn from their respective distributions. When
appropriate, error bars are provided with a 95 % confidence interval.
At the online dating site, a user can provide his/her preferences for potential
dates in terms of age, location, height, education level, income range, marriage and
children status, etc. In the design of a recommendation algorithm for potential dates, it
is important to know whether and to what extent users follow their stated preferences
in actual dating. The discrepancy between a users stated preference and his or her
actual dating behavior is often referred to as dissonance in social psychology, and has
been previously observed [4]. In this section, we examine the degree of dissonance
of online dating in our dataset. In particular, we study to what extent users adhere to
their stated preferences and how reply probability varies as a function of the number
of user attributes that match receivers stated preference.
(a) 0.12 (b) 0.3

male send to female female reply to male
female send to male male reply to female
fraction of messages sent
male send to female (random selection) female reply to male (random selection)
0.1 female send to male (random selection) male reply to female (random selection)
0.25
reply probability
0.08
0.2
0.06
0.15
0.04
0.02 0.1
0
-30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 -25 -20 -15 -10 -5 0 5 10 15 20 25
age difference age difference
Fig. 7 a Distribution of age difference between senders and receivers. b Reply probability for users
with different age difference
5.1 Age
Figure 7a shows the distribution of the age difference between the sender and receiver
of all messages sent by the sample users in the dataset. The age difference is computed
as the senders age less the receivers age. While the age difference between senders
and receives covers a wide range, the preferences of males and females are opposite
of each other. Males tend to look for younger females and the distribution is skewed
towards much younger females. On the other hand, females tend to look for older
males and the distribution is skewed toward older males. The median age difference
is two for messages sent from males to females and 4 vice versa. Male and female
preferences are not random; they look for potential dates with a smaller age difference
than predicted by random selection.
Figure 7b plots the reply probability as a function of the age difference between the
sender and receiver of a message. For both males and females, the reply probability
deviates significantly from the result of random selection, exhibiting a bell shape
mode at a age difference of ten years older and eight years younger, respectively.
Males tend to reply to younger females while females tend to reply to older males
within a certain range of age difference.
Figure 8 depicts the heatmap of the fraction of messages and reply probabilities
between users of different age. As a male gets older, he searches for and replies to
relatively younger females. A female in her 20s is more likely to communicate with
older males, but as a female gets older, she becomes more open towards younger
males. This is the cause for the reply probability increase in the age difference range
from 3 to 10, as shown in Fig. 7b. These results are consistent with observations
made in [6].
206 P. Xia et al.
(a) (b)
(c) (d)
Fig. 8 Heat map of fraction of messages and reply probabilities between users of different age:
a fraction of messages sent from males to females, b fraction of messages sent from females to
males, c reply probability of males to females, d reply probability from females to males
5.2 Height
Figure 9a shows the distribution of height difference between the sender and receiver
of all messages sent by sample users. The height difference is computed as the
senders height less the receivers height. We observe that users message sending
behaviors with respect to height closely match those resulting from random selection.
While it appears that a male tends to look for females shorter than him and a female
tends to look for males taller than her, this is likely to be a result of random selection
rather than users preference.
Figure 9b plots the message reply probability as a function of height difference
between the senders and receivers. Similarly, user message reply behavior with
respect to height closely match that of random selection, and are thus likely to be the
result of random selection rather than user preference.
(a) (b)
0.45 0.3
female send to male female reply to male
0.4 male send to female male reply to female
female send to male (random selection) female reply to male (random selection)
0.35 male send to female (random selection) 0.25 male reply to female (random selection)
reply probability
0.3
0.25 0.2
0.2
0.15 0.15
0.1
0.1
0.05
0 0.05
(-30,-
(-25,-
(-20,-
(-15,-
(-10,-
(-5,0]
(0,5]
(5,10
(10,1
(15,2
(20,2
(25,3
]
5]
0]
5]
0]
25]
20]
15]
10]
5]
0
-30 -20 -10 0 10 20 30
height difference (cm) height difference (cm)
Fig. 9 a Distribution of height difference between senders and receivers. b Reply probability for
users with different height difference
(a) (b)
0.3
0.7 male send to female female reply to male
female send to male male reply to female
0.6
reply probability
male send to female (random selection) 0.25 female reply to male (random select)
female send to male (random selection) male reply to female (random select)
0.5
0.4 0.2
0.3 0.15
0.2
0.1 0.1
0
0.05
(-40000,-35000
(-35000,-30000
(-30000,-25000
(-25000,-20000
(-20000,-15000
(-15000,-10000
(-10000,-5000]
(-5000,0]
(0,5000]
(5000,10000]
(10000,15000]
(15000,20000]
(20000,25000]
(25000,30000]
(30000,35000]
(35000,40000]
<2000
2000-3000
3000-4000
4000-5000
5000-7000
7000-10000
10000-15000
15000-20000
20000-25000
25000-30000
30000-50000
>50000
]
]
]
]
]
]
Income Difference (Chinese Yuan) sender income (Chinese Yuan)
Fig. 10 a Distribution of income difference between senders and receivers. b Reply probability
for senders with different incomes
5.3 Income
Figure 10a shows the distribution of income difference between senders and receivers.
A user reports monthly income within a range such as below 2,000, 2,0003,000 (all
in Chinese Yuan), etc. We take the median value of the reported income range as a
users income and the income difference between the sender and receiver of a message
is computed as the difference sender income and receiver income. We observe that
user message sending behavior with respect to income closely matches that resulting
from random selection. While it appears that males tend to send messages to females
with lower income and females tend to send messages to males with higher income,
this is likely to be a result of random selection and the fact that male incomes are
larger than female incomes rather than users preference.
Figure 10b shows how reply probability varies with sender income. The reply
probability of female recipients increases with male sender income, deviating
208 P. Xia et al.
(a) (b)
0.6 0.4
reply probability
0.5 male send to female 0.35 male reply to female
female send to male (random selection) 0.3 female reply to male (random selection)
0.4 male send to female (random selection)
0.25
male reply to female (random selection)
0.3 0.2
0.2 0.15
0.1
0.1 0.05
0 0
junior_high_sc
vocational_sch
high_school
junior_college
bachelor
master
doctor
post_doctor
junior_high_sc
vocational_sch
high_school
junior_college
bachelor
master
doctor
post_doctor
hool
ool
hool
ool
receiver education level
sender education level
Fig. 11 a Fraction of messages sent to users of different education levels. b Reply probability for
messages from users of different education levels
significantly from the flat line of random selection. There is a strong correlation
coefficient of 0.90 between the reply probability and male sender income. On the
other hand, the income of a female does not have as significant an effect on the like-
lihood of her messages being replied to. The reply probability fluctuates around the
line of random selection. The correlation between the reply probability and female
sender income is much weaker with a correlation coefficient of 0.50.
5.4 Education Level
Figure 11a shows the fractions of messages sent to users of different education levels.
We observe that male behavior closely matches that of random selection, while
female behavior deviates considerably from that of random selection towards higher
education levels.
Figure 11b shows how reply probabilities vary with sender education levels for
males and females. The higher the education level of a male sender, the more likely
his messages will be replied to. The reply probability of a female user deviates
significantly from a random selection. On the other hand, the education level of a
female does not have as significant an effect on the likelihood of her messages being
replied to. The reply probability of male users stays relatively flat across different
education levels, similar to that resulting from random selection.
5.5 Geographic Distance

The geographic distance between two users plays an important role in their online
dating behavior. As mentioned in Sect. 3, a considerable portion (46.5 %) of the
communications occurred between users within the same city. For communications
(a)
fraction of messages sent (b)
0.45 0.25
0.4 male send to female male reply to female
0.35
0.2
reply probability
0.3
0.25
0.2 0.15
0.15
0.1
0.1
0.05
0
0.05
0-200
200-400
400-600
600-800
800-1000
1000-1200
1200-1400
1400-1600
1600-1800
1800-2000
2000-2200
2200-2400
2400-2600
2600-2800
2800-3000
0
0 500 1000 1500 2000
distance (km)
distance (km)
Fig. 12 a Distribution of messages of different send-receiver distances. b Reply probability for

users with different distances
between users in different cities, we further study how message sending behavior and
reply probability varies with the distance between users (computed as the straight
line distance between the two cities).
As shown in Fig. 12a, in general the fraction of messages decreases as the distance
between users increases. The messages between users of at least 1,000 km apart
constitutes only a small fraction (11.7 %) of the total number of messages. Note
that there is a small increase in the fraction of messages between distance 800 and
1,400 km for female senders.
Figure 12b depicts how reply probability varies with distance between a sender
and receiver. When a male receives a message from a female, the reply probability
generally decreases with distance between them. For females, the reply probability
first decreases with distance but increases in the range from 800 to 1,400 km.
The increase of the initial message ratio and reply probability of females for the
distance range from 800 to 1,400 km is due to the following. There is an increasing
number of big cities (Shanghai, Beijing, Hong Kong, Chongqing, Guangzhou, Xian,
etc.) between many of which the distance falls into this range, and unlike males,
females are more likely to send and reply to messages between these cities.
5.6 Photo Count
On the dating site, a user can post photos on his/her profile page. Figure 13a plots
the distribution of the number of photos posted by a user. A large fraction of users
did not post or posted only a small number of photos. In our dataset, about 69 % of
male users and 59 % of female users did not post any photos. Female users tend to
post more photos than male users.
As shown in Fig. 13b, a user tends to receive more messages if he/she has posted
more photos online, with the trend being more pronounced for females than for
210 P. Xia et al.
(a) (b)
average number of received messages

0.7 140
male male received from female
female femaled receive from male
0.6 120
ratio of sample users
0.5 100
0.4 80
0.3 60
0.2 40
0.1 20
0 0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
number of photos number of photos
(c) 0.35 female reply to male
0.3
reply probability
0.25
0.2
0.15
0.1
0.05
0 2 4 6 8 10
number of photos
Fig. 13 a Distribution of users photo count. b Average number of received messages during first
eight weeks of their memberships for users with different photo count. c Reply probability for users
with different photo counts
males. The number of received messages by a male user starts to level off after some
point.
Figure 13c shows how message reply probability varies with the number of photos
posted by the sender. We observe that male reply probability tends to increase with
the number of photos posted by the female sender. Interestingly, when a female
receives a message, the reply probability remains relatively stable as the number of
photos of the male sender increases.
5.7 Stated Preference Versus Actual Behavior
On the online dating site in our study, a user can specify a set of attributes that he/she is
looking for in a date, including age range, geographic location, height range, marriage
status (never married, divorced), education level, income range, house ownership,
and children status (no children, children living with user, children not living with
user).
0.9
male receive from female without replying
female receive from male without replying
0.85 female reply to male
0.8
unmatch ratio 0.75
0.7
0.65
0.6
0.55
0.5
1 2 3 4 5 6 7 8
week
Fig. 14 Fraction of reply messages that violate senders stated preference as a function of time
There is often considerable discrepancy between a users stated preference and

his or her actual dating behavior [4]. Therefore, it is important to understand users
true dating preferences in order to make better dating recommendations. In this
section we study to what extent users adhere to their stated preferences and how
reply probability varies as a function of the number of user attributes that match
receivers stated preference.
Figure 14 shows the fraction of replied and unreplied messages whose senders do
not satisfy the recipients stated preference in at least one user attribute as a function
of time. We refer to this fraction as the unmatch ratio. Among all replied messages,
the unmatch ratio is around 55 % for males and more than 70 % for females. The
discrepancy between a users stated dating preference and his/her actual behavior
is prevalent, with female users showing more flexibility than male users. Actually,
during the eight week period since their memberships, only 17.0 % of male users and
6.6 % of female users had strictly followed their stated preferences when replying to
a sender. We also observe that the unmatch ratio is larger for messages not replied
to than those replied to. This indicates that out of the population of users that send
messages, replies are more likely to go to those whose attributes come closest to the
preferences of the receivers.
Figure 15a, b show the unmatch ratio for each user attribute in a decreasing order
for both male and female users, respectively. We observe that males and females
share the same top-three most violated user attributes: age, location and height. For
male users, the unmatch ratios of other attributes are all very low (below 5 %), while
female users are most strict with marriage and children status, as well as education
level of the male senders. For each attribute, the unmatch ratio is larger for messages
not replied to than those replied to, indicating that replies are more likely to go to
users whose attributes come closest to the preferences of the receivers.
Figure 16a, b plot the fraction of the messages whose receivers do not meet the
senders stated preference in each week. For males, the unmatch ratios for most of
212 P. Xia et al.
(a) male receive from female without replying

(b) female receive from male without replying
female reply to male
0.35 0.4
0.3 0.35
0.3
unmatch ratio
unmatch ratio
0.25
0.25
0.2
0.2
0.15
0.15
0.1 0.1
0.05 0.05
0 0
Lo
Ch
Ed n
In
Lo
In
Ed Ow
Ch on_
M n
ge
ei
om
ge
ei
om
co
co
ar
ar
ca
uc
ca
uc
ild
ild
gh
gh
ria
ria
m
m
tio
e_
tio
e_
at
at
re
re
t
t
e
e
ge
ge
io
i
O
n
n
n_
w
ne
ne
Le
Le
r
r
ve
ve
attributes l attributes
l
Fig. 15 Unmatch ratio of different user attributes in reply messages sent by a male users and
b female users
(a) (b)
Fig. 16 Unmatch ratio of different user attributes in each week sent by a male users and b female
users
the user attributes remain relatively stable, including the income, house ownership,
marriage status and children status. The unmatch ratio for age decreases from 23.0
to 16.9 % during the eight week period, while the unmatch ratios for height and
education increase by a small amount during the same time period. For females,
the unmatch ratio for age remains relatively stable, while for other attributes, they
are more likely to follow their stated preference in the first week but become more
flexible afterwards.
At the online dating site in our study (baihe.com), a user can specify his or her age
and height preference by setting the maximum and minimum values. In Fig. 17, we
plot the difference between the age of receivers who do not meet the age requirement
of the sender and the age range specified by the sender for different sender age
groups. For both male and female users, when they send messages to people who do
not satisfy their stated age requirement, younger users (2029 years old) are more
likely to send messages to people older than their stated age preference, while users
of older age group (especially males) become more likely to send messages to people
younger than their stated preference.
(a) (b)
(c)
Fig. 17 The age difference between user actual behavior and their stated preference for different
age group senders: a 2029 b 3039 c 4049
Similarly, as shown in Fig. 18, users of lower height (150159 cm) are more likely
to send messages to people taller than their stated preference, while taller users are
more likely to send messages to people lower than their stated preference.
Figure 19 shows how male and female reply probabilities vary as a function of the
number of sender attributes that match the receivers stated preference. The margin
of error is provided with a 95 % confidence level. We observe that except for the
case where is no matching attribute, the reply probability increases with the number
of matched user attributes, indicating that both males and females tend to reply to
senders whose attributes best match their stated preferences. Note that although the
reply probability for zero matching user attribute is larger than that for one matching
user attribute, the sample size of users with zero matching attribute is rather small and
thus the corresponding margin of error is too large to make the calculation statistically
sound.
Figure 20a, b compare the reply probabilities of two different scenarios where a
senders attribute matches or does not match the receivers stated preference, respec-
tively. As expected, we observe that for both males and females, the reply probability
is larger when the senders attribute matches the receivers stated preference.
214 P. Xia et al.
(a) (b)
(c)
Fig. 18 The height difference between user actual behavior and their stated preference for different
height group senders: a 150159 cm b 160169 cm c 170179 cm
6 Discussion
Part of our results on user messaging behavior align with notions in social and
evolutionary psychology [1, 3, 12]. Males tend to look for younger females but do
not seem to care much about the socioeconomic status such as income and education
level of a potential date. On the other hand, females tend to look for older males
and place more emphasis on the socioeconomic status of a potential date. Moreover,
we observe that as a male gets older, he searches for relatively younger females. A
female in her 20s is more likely to look for older males, but as a female gets older,
she becomes more open towards younger males.
Online dating sites significantly increase the level of access to potential dates in
terms of geographic locations from traditional means. In our dataset, a considerable
fraction (53.5 %) of the initial messages traversed across city boundaries while the
remaining 46.5 % occurred between users in the same city. Users still prefer dates in
close proximity. For inter-city messages, the sending volume and reply rate quickly
decrease as users live farther apart. Compared to male users, females are more likely
to send and reply to messages between distant big cities (e.g., Beijing, Shanghai,
Hong Kong, Guangzhou, Xian, etc.).
Fig. 19 Reply probability as a function of the number of user attributes that matches receivers
stated preference
(a) (b)
Fig. 20 Reply probabilities for scenarios where senders attribute matches or does not match the a
male and b female receivers stated preference
On the online dating site, a user can post his/her own photos and view other users
photos. But profile photos affect male and females messaging behaviors differently.
Females with a larger number of photos are more likely to invite messages and secure
replies from males, but the photo count of males does not have as significant effect
in attracting contacts and replies.
In the analysis of users dating preferences, our results show that it is important
to differentiate between user dating preferences and the results of random selection.
Some user behaviors in choosing attributes in a potential date may largely explained
by random selection. For example, while it appears that a male tends to look for
females shorter than him and a female tends to look for males taller than her, the
message sending and replying behaviors of both genders closely approximate those
resulting from random selections, showing that these may be partly due to random
216 P. Xia et al.
selection rather than users true preferences. Similar observations have been made
for the behaviors of male users in terms of choosing the income and education level
of a potential date, while the corresponding female behaviors deviate significantly
from the random selection and thus reflect their true preferences.
Our results also show that there is significant level of discrepancy between a users
stated dating preference and his/her actual online dating behavior. A fairly large
fraction of messages are sent to or replied to users whose attributes do not match
the sender or receivers stated preferences. Females tend to be more flexible than
males in following their stated preferences when sending and replying to messages.
Both male and female users share the same top-three most violated user attributes:
age, location and height. For male users, the unmatch ratios of other attributes are all
very low (below 5 %), while female users are most strict with marriage and children
status, as well as the education level of the male senders. For both males and females,
out of the population of users that send messages, replies are more likely to go to
users whose attributes come closest to the preferences of the receivers.
7 Conclusion
We study how people online dating behaviors correlate with various user attributes
using a large real-world dataset from a major online dating site in China. Many of
our results align with notions in social and evolutionary psychology. In particular,
males tend to look for younger females while females put more emphasis on the
socioeconomic status (e.g., income, education level) of a potential date. Moreover,
geographic distance between two users and the photo count of users play an important
and different role in dating behaviors of males and females. Our results show that it is
important to differentiate between users true preferences and the results of random
selection. Some user behaviors in choosing attributes in a potential date may be a
result of random selection. Our results also show that there is significant discrepancy
between a users stated dating preference and his/her actual online dating behavior.
Our study provides a firsthand account of the user online dating behaviors in China,
a country with a large population and unique culture. These results on users dating
preference can provide valuable guidelines to the design of recommendation engine
for potential dates.
Acknowledgments This work was supported by the NSF grant CNS-1065133 and ARL Coop-
erative Agreement W911NF-09-2-0053. The views and conclusions contained in this document
are those of the authors and should not be interpreted as representing the official policies, either
expressed or implied of the NSF, ARL, or the US Government.
References
1. Buss DM (1989) Sex difference in human mate preferences: evolutionary hypotheses tested in
37 cultures. Behav Brain Sci 12:149
2. Cai X, Bain M, Krzywicki A, Wobckes W, Kim YS, Compton P, Mahidadia A (2011) Col-
laborative filetering for people to people recommendation in social networks. Adv Artif Intell
476485
3. Eagly AH, Wood W (1999) The origin of sex differences in human behavior: evolved disposi-
tions versus social roles. Am Psychol 54:408423
4. Eastwick PW, Finkel EJ (2008) Sex difference in mate preferences revisited: do people know
what they initially desire in a romantic partner? J Pers Soc Psychol 94:245264
5. Finkel EJ, Eastwick PW, Karney BR, Reis HT, Sprecher S (2012) Online dating: a critical
analysis from the perspective of psychological science. Psychol Sci Public Interest 3:366
6. Fiore AT, Taylor LS, Zhong X, Mendelsohn GA, Cheshire C (2010) Whos right and who writes:
people, profiles, contacts, and replies in online dating. In: Proceedings of Hawaii international
conference on system sciences
7. He Q, Zhang Z, Zhang J, Wang Z, Tu Y, Ji T, Yi T (2013) Potentials-attract or likes-attract in
human mate choice in china. PLoS ONE 8(4):e59457
8. Hitsch GJ, Hortacsu A, Ariley D (2010) Matching and sorting in online dating. Am Econ Rev
100:130163
9. Hitsch G, Hortasu A, Ariely D (2010) What makes you click? Mate preferences in online
dating. Quant Mark Econ 8:393427
10. Li l, Li T (2010) MEET: a generalized framework for reciprocal recommender systems. In:
Proceedings of ACM international conference on information and knowledge management
11. Lin K-H, Lundquist J Mate selection in cyberspace: the intersection of race, gender, and edu-
cation. Am J Sociol (forthcoming)
12. Luo S, Klohnen EC (2005) Assortative mating and marital quality in newlyweds: a couple-
centered approach. J Pers Soc Psychol 88:304326
13. Match.com, Bailey CM (2010) Recent trends: online dating
14. OkTrends. http://www.okcupid.com
15. Online dating statistics. http://www.statisticbrain.com/online-dating-statistics/
16. Pizzato L, Rej T, Chung T, Koprinska I, Kay J (2010) RECON: a reciprocal recommender for
online dating. In: Proceedings of ACM conference on recommendation system
17. Slater D (2013) Love in the time of algorithm. Penguin Group, New York
18. Tu K, Ribeiro B, Jensen D, Towsley D, Liu B, Jiang H, Wang X (2014) Online dating recom-
mendations: matching markets and learning preferences. In: Proceedings of 5th international
workshop on social recommender systems, in conjunction with 23rd international world wide
web conference
19. Xia P, Jiang H, Wang X, Chen C, Liu B (2014) Predicting user replying behavior on a large
online dating site. In: Proceedings of 8th international AAAI conference on weblogs and social
media
20. Xia P, Ribeiro B, Chen C, Liu B, Towsley D (2013) A study of user behaviors on an online
dating site. In: Proceedings of the IEEE/ACM international conference on advances in social
networks analysis and mining
21. Zhao K, Wang X, Yu M, Gao B (2014) User recommendation in reciprocal and bipartite social
networksa case study of online dating. In: Proceedings of intelligent systems. IEEE
Latent Tunnel Based Information Propagation
in Microblog Networks
Chenyi Zhang, Jianling Sun and Ke Wang
Abstract Information propagation in a microblog network aims to identify a set

of seed users for propagating a target message to as many interested users as possible.
This problem differs from the traditional influence maximization in two major ways:
it has a content-rich target message for propagation and it treats each link in the
network as communication on certain topics and emphasizes the topic relevance
of such communication in propagating the target message. In realistic situations,
however, the topics associated with a link are not explicitly expressed but are hidden
in the microblogs previously exchanged through the link. In this paper, we present
a topic-aware solution to information propagation in a microblog network. We first
model the latent topic structure of the network using observed microblog messages
published in the network. We then present two methods for estimating the propagation
probability based on the topic relevance between a link and the target message. Once
the propagation probability is estimated, we adopt the standard greedy algorithm for
influence maximization to find seed users. This approach is topic-aware in that the
target message finds its way of propagation according to its topic relevance to the
latent topic structure in the network. Experiments conducted on real Twitter datasets
suggest that the proposed methods are able to select right seed users.
Keywords Information propagation Microblog networks Topic modelling

Influence maximization Content analysis
C. Zhang (B) J. Sun

College of Computer Science, Zhejiang University, Hangzhou, China
e-mail: zhangchenyi.zju@gmail.com
J. Sun
e-mail: sunjl@zju.edu.cn
C. Zhang K. Wang
School of Computing Science, Simon Fraser University, Burnaby, Canada
e-mail: wangk@cs.sfu.ca

DOI 10.1007/978-3-319-12188-8_10
220 C. Zhang et al.
1 Introduction
With the rapid growth of social network services and applications such as Facebook,
Twitter and Weibo, research on social networks and social media is becoming a hot
area. Microblogging services offer a real-time platform to update personal status and
share information with friends. Consequently, information propagates over a social
network through homophily [19] and word-of-mouth (WOM) [12]. One example is
social advertising [14], which utilizes users relationships, interests and published
data to target social advertisement to potential users. For example, social advertising
as a kind of recommendation systems of sharing information between friends has
begun to attract attention in recent years [1]. Microblogs, also called microposts,
allow users to exchange small elements of content such as short sentences, individ-
ual images, or video links. Microbloggers post about topics ranging from the simple,
such as what Im doing right now, to the thematic, such as sports cars. Com-
mercial microblogs also exist to promote web sites, services and products, and to
promote collaboration within an organization. The study in [21] shows that a signif-
icant 88 % of all marketers indicated that their social media efforts have generated
more exposures for their businesses.
1.1 Information Propagation
In this paper, we consider the problem of leveraging the abundant microblogs

maintained by microblogging services to deliver some target information to
microbloggers, or simply users. This problem, termed information propagation,
can be stated as follows: given a microblog network with previously exchanged
microblogs among users, we want to identify k seed users to propagate a target text
message to as many users as possible in the network. The target message can be
any text message such as a tweet, a web page, or an advertisement. Two related but
different problems studied in the literature are influence maximization and social
contagion. Influence maximization [11] aims to select some specified number of
seed users that could influence the most number of users in a social network. Social
contagion [24] refers to the phenomenon of information diffusion, such as diffusion
of political opinions and adoption of new technologies [24]. Both problems leverage
the link structure of a social network, but not messages exchanged among users, to
influence more users. For example, traditional models of social contagion assume
the probability that an individual is affected grows monotonically with the size of
his or her neighborhood, and the recent study in [24] suggests that this probability
grows with the number of connected components in the individuals neighborhood,
not the size of the neighborhood.
Information propagation differs from these existing problems in two major ways:
it has a content-rich target message for propagation, and it treats each link in a
microblog network as communication on certain topics and emphasizes the topic
Latent Tunnel Based Information Propagation in Microblog Networks 221
user 1 user 1
user 2 user 5 user 2 user 5
user 3 user 4 user 3 user 4
Fig. 1 Social network with simple links (left) and latent topics (right)
relevance of such communication when propagating the target message. The basic
assumption in information propagation is that a target message is more likely to be
forwarded or retweeted if it is interesting to both the sender and the recipient, and
an interested user is more likely to react to a message (e.g., buying the advertised
product). This assumption is consistent with previous studies that social influences
are associated with certain topics [22] and marked tags or labels are useful for social
interest discovery [15].
To illustrate the differences from influence maximization, Fig. 1 shows a social
network with link structure (left) and the network with links representing commu-
nication on certain topics (right), where each color represents a topic and the width
of a link represents the intensity of the topic. Suppose that we want to propagate a
target message on the topic corresponding to the yellow color, user 3 is more likely
to be the best seed user to start the propagation because the message could reach
two other users, namely user 1 and user 4. However, if this problem is treated as the
traditional influence maximization, user 1 will be selected as the seed user because of
its maximum out-degree, despite the fact that user 1 will not forward the message due
to the lack of out-going communication on this topic. In this example, information
propagation depends on not only the link structure of the network, but also the nature
of a link in terms of the topics of the information exchanged.
To our knowledge, propagation of content-rich messages in a microblog network
in a topic-aware manner has not been considered previously. The challenge is that
the topics for messages and links are not explicitly expressed in a real life microblog
network where only exchanged messages are observed. Manually labeling the topics
for all messages and links, even for a training set, is unrealistic because expensive
user involvement is required. The key to information propagation is to extract the
hidden topics from the observable published messages in a microblog network and
leverage them for identification of seed users.
222 C. Zhang et al.
1.2 Contributions
Our contributions are as follows:

We define the information propagation problem: given a target message m and a
positive number k, we want to identify k seed users in the microblog network for
propagating m, with the goal of reaching as many users as possible. Our assump-
tion is that published microblogs implicitly convey the topics of communication
represented by a link and that propagation of the target message depends on the
topic relevance of such communication to the target message.
We adopt the standard topic modeling technique, Latent Dirichlet Allocation
(LDA) [2], to a microblog network to unveil the latent topics associated with
social links. The outcome is a manifest of the topic distribution for each link,
which serves as an explanation of the nature of a link in information flow. Our key
insight is not applying LDA to the messages for each link individually, but to the
whole collection of the messages for all links. We will explain the reasons for this
approach.
We propose two methods to estimate the propagation probability of a link based
on the topic relevance between the target message and the link. Once propagation
probability is estimated, we adopt the generic greedy algorithm for identifying
seed users for the target message.
We evaluate the performance of the proposed methods by comparing various prob-
ability estimation. Our study suggests that the proposed topic-aware propagation
method selects more relevant seed users than traditional influence maximization
methods.
In the rest of the paper, we review related work in Sect. 2, present the topic
modeling for microblog networks in Sect. 3, present the propagation probability
estimation and seed user selection in Sect. 4, and evaluate the proposed methods in
Sect. 5. Finally, we conclude the paper.
2 Related Work
2.1 Social Networks
One of the most robust findings in social networks is homophily [19] (i.e., love of the
same), the tendency of individuals to associate and bond with similar others. Based
on homophily, users tend to share interesting messages from their friends and spread
from one person to another in the style of a biological epidemic. In addition to the link
structure like in all social networks, a microblog network has its own characteristics,
i.e., exchanges of abundant but short messages and interpersonal activities such as
mentions and retweets. Such messages and activities convey certain important infor-
mation about the users involved and play an important role in analyzing information
diffusion in microblog networks. For example, [26] discussed information diffusion

on Twitter via users ongoing social interactions, and [27] considered the content of
retweet messages to improve topic mining for microblogs. To our knowledge, how-
ever, exploitation of published microblog messages for improving the propagation
of a target message has not been formally studied.
2.2 Topic Modeling
Probabilistic topic models such as LDA were introduced by [2]. [18] presented the
Author-Recipient-Topic (ART) model to learn the distribution specific to author-
recipient pairs. [23] proposed a supervised learning approach to categorize links and
quantify influence of web pages. Neither work considered information propagation.
The supervised learning approach requires a training data set that is a link-labeled
and link-weighted graph. Our work does not require such training data because it
works directly on the microblog messages published by users.
Topic modeling has been used to predict social influences between users. [22]
developed topical affinity propagation to model the topic-level social influence based
on information of nodes, which was extended to heterogeneous networks in [16].
These methods assumed a given topic distribution for each node and found all topic
level influence networks G z (Vz , E z ) for every topic z, where Vz is a subset of nodes
that are related to topic z and E z is the set of pair-wise weighted influence relations
over Vz . These works did not consider propagation of information, which is the focus
of our work. They assumed that the topic distribution is given for each user, whereas
we assume that the topic distribution is hidden in the messages exchanged between
users (thus, links). These works considered social influences for one topic at a time,
whereas we treat each link as communication involving a topic distribution, instead
of a single topic. The works in [20, 25] proposed a page rank based algorithm to find
influential topics in twitter and citation network. Again, this work did not consider
information propagation.
2.3 Influence Maximization
Influence maximization proposed in [11] aims to identify a set of seed users who
could influence the most number of other users in a social network. Two popular
influence propagation models are Independent Cascade Model and Linear Threshold
Model. These models assume influence probability based on simple heuristics, such
as uniform probability or probability proportional to the degree of a node. Moreover,
this problem does not have a target message nor consider the topics for a link. Most
previous works focused on improving the efficiency of greedy algorithms [3, 7,
13, 17], such as the CELF optimization based on the submodularity of incremental
influences [7, 13].
224 C. Zhang et al.
Table 1 Notation
Symbol Definition
G(V, E) A microblog network graph
T Number of topics
W Number of words in dictionary
M Number of microblogs
N Number of word occurrences in all microblogs
k Number of seed users to be selected
m Index of a message
e Index of a social link, e {1, . . . , |E|}
e T -dimensional topic distribution for a social link e: {e (1), . . . , e (T )}, where

j e ( j) = 1
j W -dimensional word distribution for a topic j: { j (1), . . . , j (W )}, where

w j (w) = 1
w N -dimension vector representing the word occurrences in all microblogs, where
wi [1, . . . , W ] and i [1, . . . , N ]
z N -dimension vector of the topic indicator for all word occurrences in all
microblogs, where z i [1, . . . , T ] indicates the topic of the word occurrence
wi in w
Pe Propagation probability for a social link e
Our work is closely related to the work on inferring the influence probability of
a link. [5] used an action log to infer the influence probability of a link, where an
action refers to a pre-determined activity such as joining a group. In the case that
such actions are not explicitly captured in the social network, acquiring the action log
requires the assistance from external information sources. Our work does not require
such action logs. The works in [4, 6] used time decay to infer influence probability.
Our work can be considered as a new way of estimating propagation probability
by taking into account the topics of the microblog messages readily available in a
microblogging service.
In summary, our work differs from previous works in two major aspects: we model
a social network as a network of content oriented communication, and we propose a
topic-aware estimation for propagation probability for such networks (Table 1).
3 Topic Modeling for Microblog Networks
The first step of our method is to extract the latent topics on social links in a microblog
network. We discuss first the extraction of microblog messages for a link and then
topic modeling for such messages. The outcome is the topic distribution for each link
and the word distribution for each topic. These distributions are used to estimate the
propagation probability of a link for the target message in the next section.
3.1 Content Based Social Links

There are two options for modeling the topics in a network: model the topics
(i.e., interests) for each user, and model the topics for each link. Since we are
interested in the topics for relationships, we adopt the second option. An exam-
ple illustrates our choice. Consider two users A and B who are colleagues and have
the same interests on three topics work, travel and movie. However, the two
users have only exchanged the messages related to work (because they wish to
limit their communication to work only). In this case, the user level topic modeling
would suggest that these users will influence each other on the topics of travel and
movie as well, which is a mistake because the two users did not communicate on
the topics travel and movie. This example clearly shows that the link level topic
modeling is a more natural choice for our purpose.
Microblog messages can be divided into three categories according to [10]:
broadcast messages, conversation messages, and retweet messages. A broadcast
message is published on the wall by some user A without a specific recipient. A
conversation message is also published on the wall by a user A but a notification is
sent to some user B to alert the publication. A message published by A is retweeted
by a user B when B re-publishes the message on the wall, in which case the user
A will get a notification that B has retweeted the message. All three types of mes-
sages can be viewed publicly, but only conversation and retweet messages involve an
explicit information flow from a user A to a user B. Since a broadcast message has
no explicit information flow between two users, it is hard to verify if any other user
has an interest in a broadcast message. For these reasons, we shall use all three types
of messages for topic modeling, but use only conversation and retweet messages to
determine the topics for a link.
Conversation and retweet messages can be identified using special symbols within
a message. For example, if the user A publishes a conversation message @B, Can
you lend me a book on data mining, the user B will get a notification that A has
published the message. This message flow can be observed on the social link A B.
Similarly, from the retweet message published by the user A, Good job RT @B I
have finished this experiment, the user B will get a notification that A has retweeted
the message. This message flow can be observed on the social link B A. The
structural symbol @, called contactor factor, indicates the contactor or recipient
of a message, and the structural symbol RT, called relation factor, indicates the
forward or quote relation [27]. Having clarified the above, we assume that each link
e is associated with a set of conversation and retweet messages.
3.2 Topic Modeling for Links

Given a microblog network G = (V, E), where V is the set of users and E is the set
of social links, we want to determine the topic distribution for the communication
represented by each link. Each link has a set of conversation and retweet messages,
each being represented by a bag of words. Our approach is applying Latent Dirichlet
226 C. Zhang et al.
allocation (LDA) [2] to the collection of microblog messages. Two issues must be
resolved. The first is that each microblog message is very short (up to 140 characters),
thus, sparse for topic mining. The second issue is that each social link may represent
zero or more messages; dealing with each message individually leads to multiple
topic distributions for each link, which is not only noisy due to the word sparsity
of each message but also unrelated to each other due to the separate topic spaces.
To address both issues, we model the set of messages associated with a link as one
aggregated message by taking the union of these messages, and apply LDA to these
messages plus all broadcast messages. Notice that there is only one topic modeling
for the entire corpus, not one topic modeling per link.
LDA is a generative model that allows sets of observations to be explained by
unobserved variables that explain why some parts of the data are similar. In our case,
observations are words collected into messages and unobserved variables are the
per-message topic distribution and the per-topic word distribution. Each message is
a mixture of a small number of latent topics and each words creation is attributable
to one of the messages topics. Suppose that we have T topics, we can write the
probability of the ith word in a given message as

T
P(wi ) = P(wi |z i = j)P(z i = j) (1)
j=1
P(w) can be observed from the collection of messages while P(w|z) and P(z) are
hidden and are the target of topic modeling. More formally, let W be the size of
the dictionary (i.e., the number of words) for messages, T be the number of latent
topics, e be the T -dimensional topic distribution for a social link e, and j be the
W -dimensional word distribution for a topic j. The generative process of LDA is as
follows:
1. choose j Dir() where j [1, . . . , T ]
2. choose e Dir() for each message on the link e
3. for each word wi that belongs to the link e
a. choose a topic z i Mul(m )
b. choose a word wi Mul(zi )
where is the parameter of the Dirichlet prior on the per-topic word distribution and
is the parameter of the Dirichlet prior on the per-message topic distributions.
Figure 2 expresses the generative model of topic mining on social links, assuming
that the topic structure, i.e., the topic z, the link-topic distribution , and the
topic-word distribution , is already known. and are hyperparameters, spec-
ifying the nature of the priors on and . However, the problem we face is the
reverse of this generative process: all words in messages w through social links e,
contactor factors C and relation factors R are observed, as indicated by shaded nodes
in Fig. 2, while the topic structure (i.e., z, , ) is hidden and must be estimated using
the observed variables.
Fig. 2 Bayesian network of topic modeling on social links
3.3 Inference
The Gibbs sampling [8] is widely used to infer the latent variables and . This
method sequentially samples all variables for z from its conditional distribution
P(z i |w, z i , , ) given the current values of all other variables and the data, where
z i refers to the topic assignments of all other words before sampling word wi . This
conditional distribution is derived as follows:
P(z, w|, )
P(z i |w, z i , , ) = (2)
P(z i , w|, )
where
P(w, z|, ) = P(z|)P(w|z, ) (3)
From Euler integral [8, 9],

T
T
(W) w (n j,w + )
P(w|z, ) = (4)
()W (n j, + W)
j=1
|E|
|E|
(T ) j (n e, j + )
P(z|) = (5)
()T (n e, + T )
e=1
where n j,w is the number of times word w has been assigned to topic j in sampling,
n e, j is the number of times topic j is assigned to e in sampling, n e, is the number
of times all topics are assigned to e in sampling, and n j, is the number of times
a word is assigned to topic j in sampling. Substituting these into the equation for
P(z i |w, z i , , ) and continuously conducting Gibbs sampling, we finally get the
topic distribution e for a social link e and the word distribution j for each topic j
computed as follows:
228 C. Zhang et al.
n e, j +
e ( j) = (6)
n e, + T
n j,w +
j (w) = (7)
n j, + W
The topic distribution for links and the target message used in the next section are
summarized as follows:
Topic distribution for links: the topic distribution of e is already learnt by Eq. (6).
A high value of e ( j) for a topic j indicates the existence of a tunnel for the com-
munication on the topic j through the link e.
Topic distribution for the target message: For the target message m, the topic
distribution of m is computed as follows. For each word wi occurring in m, we
determine the most likely topic for wi , i.e., the topic j such that j (wi ) is maximal,
and consider this as one vote for the topic j. Let vm, j denote the total number of votes
T
for the topic j and let m, j = vm, j / i=1 vm,i , 1 j T . The topic distribution
of m is defined by m = {m,1 , . . . , m,T }.
4 Topic Aware Information Propagation
The objective of information propagation is to identify k seed users to propagate

a given target message with the goal of maximizing the number of users reached.
To achieve this goal, seed users should likely publish the target message and the
recipients of the message should likely forward the message, and so on. The likelihood
of publishing or forwarding a message depends on whether there is a communication
tunnel on the topics of the target message between the sender and the recipient. We can
divide this problem into two sub-problems. The first sub-problem extracts the topic
structure for links, which was addressed in the previous section. The second sub-
problem will identify k seed users for propagating the target message, given the topic
structure for links. We present three algorithms for the second sub-problem.
4.1 Greedy Algorithm: First Cut Solution
A first cut solution is ignoring all published messages and the target message and
selecting seed users solely based on the topological structure of the microblog net-
work. This is exactly the traditional influence maximization and a general greedy
algorithm exists. This algorithm takes the graph structure of the microblog network
and the number k as input and returns k seed users as output. Algorithm 1 below,
GeneralGreedy, is an implementation of this algorithm from [3, 11]. It uses two inter-
nal parameters, the propagation probability P for all links e and the Monte Carlo
random process of propagation starting from a set of users S, MC(S, P). MC(S, P)
returns the estimated number of users reached by those in S. At each iteration, the
algorithm greedily selects the next seed user v such that MC(S{v}, P) is maximized.
Algorithm 1: GeneralGreedy(G, k)
1 uniformly set propagation probability Pe for all social links;
|E|
2 P = e=1 {Pe };
3 initialize S = ;
4 forall i = 1 to k do
5 select v = arg maxuV \S (MC(S {u}, P));
6 S = S {v};
7 end
8 return S
Not surprisingly, the GeneralGreedy algorithm does not perform well for
information propagation because it uses the uniform propagation probability Pe for
all links e, which ignores the topic relevance of a link to the target message. Next, we
present two topic-aware algorithms that take into account this topic relevance to infer
the propagation probability Pe . These algorithms differ in the way of quantifying the
topic relevance of a link.
4.2 Filtered Tunnel Algorithm
Let m denote the target message, e denote a link, and Pe denote the propagation
probability of m through the link e. Recall that e denotes the topic distribution of
e and m denotes the topic distribution of m. To determine Pe for a link e, we need
to determine what topics of e are relevant to m. One way is cutting off insignificant
topics in the topic distribution e by a threshold, but this is not robust because it is
difficult to know the proper threshold, which could vary from links to links. Our first
topic-aware algorithm deals with this issue by classifying the topics for e as tunneled
topics and blocked topics. The former refers to the topics that have large probabilities
in e to allow information flow on such topics, whereas the latter refers to the topics
with insufficient probabilities to allow information flow. The intrinsic motivation for
this classification is the observation that the distribution e usually consists of a small
number of major topics that have much higher probabilities than other topics. Topics
with high probability form the tunnels for information transfer while topics with low
probability remain blocked.
To identify these two groups of topics, we apply the 2-means clustering method to
the T data points represented by e , where the jth point represents the probability on
the topic j. The result is one cluster for tunneled topics and one cluster for blocked
topics, represented by the indicator vector Ie :
230 C. Zhang et al.

1 if j is a tunneled topic
Ie ( j) = (8)
0 if j is a blocked topic
Notice that this classification of topics is on a per-link basis and is independent of

the target message.
For a given target message m, we define the propagation probability for a link e
as follows:

T
Pe (m) = m, j e ( j)Ie ( j) (9)
j=1
In words, Pe (m) is the inner product of the topic distribution of m and the topic
distribution of e, except that only the topics j with Ie ( j) = 1 have effect.
The seed user selection based on the above propagation probability, called filtered
tunnel algorithm and denoted FilteredTunnel, is given in Algorithm 2. It takes a
microblog network G, a positive number k, and the target message m as the input,
and returns a set of k seed users as the output. The algorithm is an adaptation of
the GeneralGreedy algorithm but uses the propagation probability Pe (m) defined
in Eq. (9).
Algorithm 2: FilteredTunnel(G, k, m)
1 foreach social link e do
2 compute Pe (m) as in Eq. (9);
3 end
|E|
4 P = e=1 {Pe (m)};
5 initialize S = ;
6 forall i = 1 to k do
7 select v = arg maxuV \S (MC(S {u}, P));
8 S = S {v};
9 end
10 return S
4.3 Unfiltered Tunnel Algorithm
The filtered tunnel algorithm adopts the all or nothing strategy for each topic on a
link in order to focus on major topics. Sometimes two users communicate on a broad
range of topics where there is no clear cut between tunneled topics and blocked
topics. In such cases, a target message covering many topics could still be exchanged
by users. This situation calls for the second approach, unfiltered tunnel algorithm,
where topics with small probability are considered too for topic relevance. We can
model this approach conveniently by making all topics the tunneled topics, that is,
Ie ( j) = 1 for every topic j in Eq. (9), so the propagation probability Pe (m) in
Eq. (9) degenerates into the usual inner product of the topic distribution m of the
target message m and the topic distribution e of the link e:

T
Pe (m) = m, j e ( j) (10)
j=1
There are two cases for having a large propagation probability Pe (m): either m
and e have high probability in a few common topics, or m and e have small
probability in many common topics. The former corresponds to communication on
focused topics and the latter corresponds to communication on diversified topics.
With Pe (m) being defined by Eq. (10), the unfiltered tunnel algorithm remains the
same as Algorithm 2.
5 Experimental Evaluation
The ideal way of evaluating the propagation of a target message is placing the message
to the selected seed users in a live microblogging service and tracing the propagation
of the message. Unfortunately, this kind of evaluation requires full control over the
microblogging service, which is possible only for the owner of a microblogging ser-
vice. Without such full control over a microblogging service, we resort to publicly
available Twitter microblog datasets1 to approximate this evaluation. This dataset
has over 9 million microblogs covering domains such as news, music, entertain-
ment, technology, and web. We performed the following preprocessing: removed all
users who have no social links and their broadcast messages because such users do
not contribute to information flow; for the remaining users, took a random sample
of their messages because topic modeling does not need all the data and running
topic modeling on the whole collection of data is too slow; removed stop words and
URLs from all messages. The final dataset contains 323,481 messages (10 % broad-
cast, 66 % conversation, and 24 % retweet), 10,892 Twitter users, and 63,454 links
corresponding to followee/follower relations.
5.1 Experimental Design
We performed two experiments. In the first experiment, we randomly picked 50

target messages from relatively long retweet paths and withheld them from topic
modeling and seed user selection. We study the hit_ratio of the seed users who
published the exact target message, defined as the fraction of seed users who have
1 http://user.informatik.uni-goettingen.de/~txu/cuckoo/dataset.html.
232 C. Zhang et al.
forwarded the given target message according to the data set. While hit_ratio does
measure the users who propagated the given target message, it does not consider the
possibility of propagating any other messages, even such messages are similar to
the target message. This exact syntax based measure could be too stringent because
often a message is propagated because of its content, not because of its exact syntax.
For example, if a user forwards the message @B, Does anyone know Canucks
standing, likely the user will also forward the message @B, Is Canucks in first or
second place, if this message is presented instead. But the syntax based measure
does not consider this flexibility.
In the second experiment, we relax the exact syntax requirement and consider two
messages to be equivalent (with respect to propagation) if they are similar in topics.
For two messages m 1 and m 2 with the topic distributions m i = {m i ,1 , . . . , m i ,T },
i = 1, 2, the topic equivalence of m 1 and m 2 is defined as

T
sim(m 1 , m 2 ) = m 1 , j m 2 , j (11)
j=1
m 1 and m 2 are topic equivalent if sim(m 1 , m 2 ) > for some specified threshold .
We consider the following three metrics based on topic equivalence. The
publish_ratio is the fraction of the messages published by the seed users that are
topic equivalent to the target message:
a
publish_ratio = (12)
b
where a represents the number of topic relevant messages that were published by
seed users, and b represents the number of all messages that were published by seed
users. A higher publish_ratio value means that seed users are more likely to publish
the given target message.
The spread_ratio is the fraction of forwarding (i.e., retweet messages) originated
from the seed users that are topic equivalent to the target messages,
c
spread_ratio = (13)
d
where c represents the number of forwarding of the topic relevant messages published
by seed users, and d represents the number of forwarding of all messages published
by seed users. A higher spread_ratio means that the target message is more likely to
be propagated if it is published by a seed user.
The reach_num is the number of users reached through such forwarding. A larger
value in these metrics means that a topic equivalent message is more likely to be
published and propagated by the selected seed users. We randomly picked up 100
target messages for this experiment.
We evaluated three algorithms for information maximization.
GeneralGreedy, denoted GG: This is the traditional greedy algorithm in

Algorithm 1, which was shown to outperform distance based, degree based, and
random selection method [3]. We set Pe to 0.01, 0.02, 0.05, 0.1 as in [3]. GG0.01,
GG0.02, GG0.05, and GG0.1 denote GG with these parameters.
FilteredTunnel, denoted FT: This is Algorithm 2. This algorithm used the hyper-
parameters and for topic mining. We set = 1 and = 0.01 as in [9], and set
T = 50 (the number of topics).
UnfilteredTunnel, denoted UT: This is the unfiltered tunnel algorithm described
in Sect. 4.3. Like in FT, we set = 1, = 0.01, and T = 50.
The number of seed users k is set to 10 and 50 for all three algorithms. We adopted
a CELF optimization package2 for the Monte Carlo random process MC(S, P) in
all three algorithms. This optimization speeds up the runtime but does not alter the
result. All codes were written in Matlab and Java. The experiments were run on a
PC with 3.10 GHz Quad-Core CPU, 8G memory and Operating System of Ubuntu
Linux 9.10.
In the following sections, we first present the topics on social links to demonstrate
the effectiveness of topic modeling, and then show the evaluation results based on the
above two experiments respectively. All these focus on the macro level performance
of our topic-aware methods. Finally, we give some micro level case studies to further
exhibit the superiority.
5.2 Topic Modeling on Social Links
The first step in FT and UT is to model the topic distribution using conversation
and retweet messages on social links, as described in Sect. 3. Table 2 shows 6 out
of the 50 topics extracted. Each topic has a distribution of keywords (here top five
keywords are shown), which explains the topics latent semantics. For example, topic
19 is about music and online media; topic 32 is about microblog network; topic 39 is
about movie and TV series; topic 43 is about events of time; topic 46 is about games;
topic 50 is about Apples products and other web services. We find these topics learnt
are meaningful and use their distributions to estimate the propagation possibilities.
5.3 Evaluation Based on Exact Messages
Table 3 shows the hit_ratio of GG, FT and UT (averaged over all target messages).
Understandably, hit_ratio is rather low for all algorithms because only the users who
published the exact target message are considered in this metric. Despite this, there
is a notable difference among the three algorithms. GG is very sensitive to the setting
of the propagation probability Pe . In fact, GG always selects the same set of seed
2 http://www.cs.ubc.ca/~goyal/code-release.php.
234 C. Zhang et al.
Table 2 Representative topics on social links

topic 19 topic 32 topic 39 topic 43 topic 46 topic 50
video twitter watch morning game google
music followers tv today team iphone
check tweet movie going play app
song share show snow super apple
listening list film cold fan web
Table 3 Hit_ratio of GG, FT, and UT (%)

GG0.01 GG0.02 GG0.05 GG0.1 FT UT
k = 10 0.6 0 0 0 0.8 0.2
k = 50 0.36 0.2 0.12 0 0.6 0.28
users for any target message because it considers only the network structure, not
the content of messages. For the small propagation probability Pe = 0.01, GG0.01
tends to select central users in a dense community as seed users; such users usually
have a higher degree, thus, are likely publishing the target message. This explains
the higher hit_ratio. As the propagation probability increases, GG tends to select
seed users who bridge different communities because of the increased reachability,
but such users actually are less influential because the number of forwarding is very
low. In contrast, UT and FT are able to select seed users based on the topics of the
target message. Such users are likely to publish the target message. FT has a better
performance (i.e., a higher hit_ratio) than UT because of its focus on major topics.
See more discussions on this point below.
5.4 Evaluation Based on Topic Equivalent Messages
Figure 3 shows publish_ratio, spread_ratio and reach_num (from left to right) of GG,
FT and UT. The upper row is for k = 10 seed users and the bottom row is for k = 50
seed users. The three colors represent the three settings of the threshold for topic
equivalence in Eq. (11).
FT has significantly higher publish_ratio, spread_ratio, and reach_num than GG.
This improvement comes from a better selection of seed users by considering the
relevance of links to the target message. In particular, for a given target message,
FT considers not only the link connection, but also whether similar messages were
previously propagated through such links. As such, FT tends to select those users
who are likely to publish the target message (i.e., a high publish_ratio) and have a
network of users who are likely to forward such messages (i.e., a high spread_ratio).
Consequently, the target message can reach more users (i.e., a high reach_num). In
(a)
(b)
Fig. 3 Publish_ratio, spread_ratio, and reach_num of GG, FT and UT. a k = 10. b k = 50
1 1 1200
0.9 0.9
Publish_ratio of FT
Spread_ratio of FT
0.8 0.8 Reach_num of FT 1000

0.7 0.7
800
0.6 0.6
0.5 0.5 600
0.4 0.4
diagonal diagonal 400 diagonal
0.3 0.3 =0.1
=0.1 =0.1
0.2 =0.05
0.2 =0.05 =0.05
200
0.1 =0.02 0.1 =0.02 =0.02
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200
Publish_ratio of GG0.01 Spread_ratio of GG0.01 Reach_num of GG0.01
1 1 1200
0.9
Publish_ratio of FT
0.9
Spread_ratio of FT
Reach_num of FT
0.8 0.8 1000

0.7 0.7
800
0.6 0.6
0.5 0.5 600
0.4 0.4
diagonal diagonal 400 diagonal
0.3 =0.1
0.3
=0.1 =0.1
0.2 =0.05
0.2 =0.05 =0.05
200
0.1 =0.02 0.1 =0.02 =0.02
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200
Publish_ratio of UT Spread_ratio of UT Reach_num of UT
Fig. 4 Comparison of FT and GG (upper)/Comparison of FT and UT (bottom). k = 10
other words, the seed users selected by FT are influential in that their friends tend to
forward their messages, and so do friends friends.
For a closer examination, Fig. 4 shows the comparison of FT and GG0.01 at
the individual target message level for the case of k = 10 seed users. For each
target message, there is a point (x, y) where y represents the metric for FT and x
represents the metric for GG0.01. A point above the diagonal line y = x means that
FT outperforms GG0.01 by having a higher publish_ratio, a higher spread_ratio,
and a higher reach_num. For nearly all target messages considered, FT outperforms
GG0.01 through a higher value in all three metrics. This suggests that FT selects
more influential seed users than GG0.01. Another study, which is not shown here,
236 C. Zhang et al.
showed that FT outperforms UT in these metrics. One reason is that UT keeps many
minor topics that are insufficient to trigger publishing or forwarding of the target
message. This study suggests that the focus on major topics in FT is an effective
strategy.
5.5 Differences in Seed Users
We also studied the actual seed users selected. For discussion purpose, we consider
the single topic target message m 1 containing the words for topic 50, and the mixed
topic target message m 2 containing all of the words from topics 46 and 50. In general,
the seed users selected by GG are central in dense parts of the network but may not be
influential in the topics of the target message, in terms of the likelihood of published
messages being forwarded by others, whereas the seed users selected by FT are more
influential. The seed users selected by UT tend to be a mixture of those selected by
GG and those selected by FT because UT not only considers topic relevance but also
adds low propagation probability to each link.
Figure 5 shows the topic distribution of the messages published by seed users.
For m 1 (on the left), which has the topic 50, the messages published by the seed users
selected by FT have the highest probability for topic 50, followed by the messages of
the seed users selected by UT, followed by the messages published by the seed users
selected by GG0.01. For m 2 (on the right), which is on the topic 46 and the topic 50,
the messages published by the seed users selected by FT have higher probabilities
in both of these topics than those selected by UT and GG0.01.
Table 4 further exhibits the intersection size of seed users when propagating target
message m 1 , in which the greedy algorithm with different settings share more seed
users than our topic-aware methods. This demonstrates that FT and UT tend to choose
different seed users according to the content of the target message, compared with
the greedy algorithm solely depending on the social structure.
0.14 0.12
GG0.01 GG0.01
0.12 FT 0.1 FT
UT UT
0.1
Probability
0.08
Probability
0.08
0.06
0.06
0.04
0.04
0.02 0.02
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Topic_# Topic_#
Fig. 5 Topic distribution of published messages of seed users for m 1 (left) and m 2 (right). k = 10
Table 4 Intersection size of seed users selected by different methods. k = 10

GG0.01 GG0.02 GG0.05 GG0.1 FT UT
GG0.01 10
GG0.02 4 10
GG0.05 2 2 10
GG0.1 0 0 4 10
FT 0 1 0 0 10
UT 2 3 1 0 3 10
Table 5 Keywords occurrence in the published messages of top users selected by FT and GG
User Keywords occurrence Total
Propagating m 1 as target message
UGG {google = 20, iphone = 1, app = 2, web = 4, apps = 4} 31
UFT {google = 84, iphone = 47, app = 51, web = 89, apps = 16} 287
Propagating m 2 as target message
UGG {google = 20, iphone = 1, app = 2, web = 4, apps = 4} {game 78
= 21, team = 7, play = 15, football = 0, fan = 4}
UFT {google = 84, iphone = 47, app = 51, web = 89, apps = 16} 325
{game = 8, team = 22, play = 5, football = 1, fan = 2}
5.6 Case Study
The following is a case study on the detailed statistics of top user selected by GG0.01,
GG0.02, GG0.05 (user id 14703185, denoted as UGG ) and FT (user id 9453872,
denoted as UFT ) for propagating target message m 1 and m 2 mentioned in the previous
section. Both users have published 3,200 messages and UGG has 173 followers
while UFT has 138 followers. We use keywords occurrence (number of times that
the keywords occurred in the published messages) to measure whether a user is
interested in the topic and influential to place a target message. As shown in Table 5,
when propagating m 1 , the keywords occurrence of UFT is significantly higher than
that of UGG ; when propagating m 2 , although the keywords occurrences with topic 46
of two users are similar, the occurrence with topic 50 of UFT is significantly higher
than that of UGG . Both results demonstrate that UFT is more proper to be chosen.
Next, we randomly pick up five followers of each top user to check if they get
influenced and spread target messages. We verify the situation in propagating m 1 .
Their representative messages related to topic 50 are listed in Table 6. For UGG ,
only one user (id 33256817) has ever forwarded UGG s messages related to topic 50,
while other four have not forwarded such messages before (although out of these four
users, users with id 10877652 and 10355192 have forwarded 41 and 21 messages
from UGG respectively). For UFT , all five users have ever forwarded UFT s messages
related to topic 50. This experiment shows that the target messages of the seed users
238 C. Zhang et al.
Table 6 Representative messages forwarded by followers

User id Message
33256817 RT @9453872 (UFT ) iAd: Apple to Launch New Mobile Ad Platform? [RUMOR].
http://bit.ly/byKA8T
RT @9453872 (UFT ) Foursquare, Gowalla and More on a Google Map [Apps]. http://
bit.ly/9eDlz2
RT @9453872 (UFT ) iPhone, Firefox, Safari, IE8 Hacked at Pwn2Own Contest. http://
bit.ly/d6hGCs
RT @14703185 (UGG ): How to Use Google Analytics on Your Facebook Fan Page.
http://short.to/18sxl
19054532 RT @9453872 (UFT ): RT @45689230:Are we getting enough out of using Twitter
and Facebook? Or is it a waste of time?. http://amplify.com/u/1dmx
1991571 RT @9453872 (UFT ): iPhone versus Nexus One smack-down compilation. http://om.
ly/duHp
14082108 RT @9453872 (UFT ): Onion: Google Responds To Privacy Concerns With Unset-
tlingly Specific Apology. http://bit.ly/bhfZ6Y
RT @17525291: RT @9453872 (UFT ): If you are not too long, I will wait here for
you all my life. Oscar Wilde //an old fav tweet
1855771 RT @9453872 (UFT ): Please vote for the #Rochester Institute of Technology in the
Google Street View contest. http://digs.by/nOq
Table 7 Average keywords Keywords occurrence

occurrence of seed users
m 1 as target message m 2 as target message
GG0.01 25.6 70.7
GG0.02 86.3 133.6
GG0.05 48.4 109.7
GG0.1 19.6 143.7
FT 499 455.3
UT 179.5 203.6
selected by FT are more likely to be spread than those selected by general greedy
algorithm.
Then we conduct a case study on the group performance of all k = 10 seed
users selected by different methods. We evaluate it by the keywords occurrence
averaged over 10 seed users and report the results in Table 7. The keywords counted
are the same as in Table 5. From the results, we find two topic-aware methods achieve
significant better performance than the greedy algorithm, further indicating the seed
users selected by FT and UT are more related to the target messages.
To summarize, our study suggests that the topic-aware FT and UT perform better
than the traditional topic-blind GG for information propagation: they tend to select
right seed users, as demonstrated by higher probability of the target message being
published (i.e., higher publish_ratio), higher probability of being forwarded (i.e.,
4
10
3
10
Runtime (min)
2
10
1
10
0
10
GG0.01 GG0.02 GG0.05 GG0.1 FT UT
Fig. 6 Average running time (min). k = 50
higher spread_ratio), and more users being reached (i.e., higher reach_num). The
superiority of FT over UT suggests that taking all topics of messages into account
does not necessarily yield better results; in fact, minor topics tend to mislead the
selection of seed users. FT addresses this issue by focusing on major topics.
5.7 Runtime
Although our focus is on selecting more relevant seed users, the topic-aware selection
also helps reduce the running time of the selection process. For FT and UT, topic
modeling took about 2 min in our experiments. This step does not depend on the
choice of the target message and was performed only once for all target messages.
Figure 6 shows the running time (in logarithmic scale) for the selection of seed users.
GG is highly sensitive to the choice of the propagation probability Pe because a
larger probability means that GG will explore a larger part of the microblog network,
e.g., 700 min at Pe = 0.05 and more than 1,200 min at Pe = 0.1. This scale is
consistent with the study in [3, 7] in which GG0.1 took 2,439 min on a network
of 15 K nodes and 32 K unique edges. For the topic-aware UT and FT, the running
time is significantly reduced because propagation probability depends on the match
between the topics of a link and the topics of the target message; consequently, only
the links that are highly relevant to the target message are explored.
240 C. Zhang et al.
6 Conclusion
This paper presented a study on propagating a target message to reach a maximal

number of users in a microblog network. Existing solutions to influence maximization
are not suitable for this problem because it does not factor the topic relevance of a link.
Our contribution is a novel topic-aware estimation of the propagation probability of a
link with respect to the target message. The novelty is that we do not assume that the
topics of messages or links are given; rather, we assume that such topics are implicit
in the microblogs published by microbloggers. We presented a method to extract
such topics and use the extracted topics to infer the propagation probability for a
target message. To our knowledge, this is the first work on estimating propagation
probability in a topic-aware manner.
Acknowledgments This is the extended version of [28] published in ASONAM13. Jianling

Suns work is partially supported by Ministry of Industry and Information Technology of China
(No. 2010ZX01042-002-003-001). Ke Wangs work is partially funded by a Discovery Grant from
Natural Sciences and Engineering Research Council of Canada, and is partially done when he vis-
ited SA Center for Big Data Research hosted in Renmin University of China. This Center is partially
funded by a Chinese National 111 Project Attracting International Talents in Data Engineering and
Knowledge Engineering Research.
References
1. Bakshy E, Eckles D, Yan R, Rosenn I (2012) Social influence in social advertising: evidence
from field experiments. In: Proceedings of the 13th ACM conference on electronic commerce
(EC), pp 146161
2. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:9931022
3. Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In: KDD,
pp 199208
4. Gomez-Rodriguez M, Leskovec J, Krause A (2010) Inferring networks of diffusion and
influence. In: KDD, pp 10191028
5. Goyal A, Bonchi F, Lakshmanan LVS (2011) A data-based approach to social influence
maximization. Proc VLDB Endow 5(1):7384
6. Goyal A, Bonchi F, Lakshmanan LVS (2010) Learning influence probabilities in social net-
works. In: WSDM, pp 241250
7. Goyal A, Lu W, Lakshmanan LVS (2011) Celf++: optimizing the greedy algorithm for influence
maximization in social networks. In: WWW (Companion Volume), pp 4748
8. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci USA 101:5228
5235
9. Griffiths T, Steyvers M (2006) Probabilistic topic models. Latent semantic analysis: a road to
meaning. Laurence Erbaum, Hillsdale
10. Kang JH, Lerman K, Plangprasophchok A (2010) Analyzing microblogs with affinity propa-
gation. In: 1st workshop on social media analytics (SOMA), pp 6770
11. Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social
network. In: KDD, pp 137146
12. Leskovec J, Adamic LA, Huberman BA (2006) The dynamics of viral marketing. In: EC06:
proceedings of the 7th ACM conference on electronic commerce, pp 228237
13. Leskovec J, Krause A, Guestrin C, Faloutsos C, Van Briesen JM, Glance NS (2007) Cost-
effective outbreak detection networks. In: kDD, pp 420429
14. Li Y, Shiu Y (2012) A diffusion mechanism for social advertising over microblogs. Decis
Support Syst 54(1):922
15. Li X, Guo L, Zhao YE (2008) Tag-based social interest discovery. In: WWW, pp 675684
16. Liu L, Tang J, Han J, Jiang M, Yang S (2010) Mining topic-level influecne in heterogeneous
networks. In: CIKM, pp 199208
17. Mathioudakis M, Bonchi F, Castillo C, Gionis A, Ukkonen A (2011) Sparsification of influence
networks. In: KDD, pp 529537
18. McCallum A, Corrada-Emmanuel A, Wang X (2007) The author-recipient-topic model for
topic and role discovery in social networks: experiments with enron and academic email.
J Artif Intell Res 30(1):249272
19. McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social net-
works. Annu Rev Sociol 27(1):415444
20. Nallapati R, McFarland D, Manning C (2011) Topicflow model: unsupervised learning of
topic-specific influences of hyperlinked documents. In: AISTATS, pp 543551
21. Stelzner MA (2011) 2011 social media marketing industry report
22. Tang J, Sun J, Wang C, Yang Z (2009) Social influence analysis in large-scale networks. In:
KDD, pp 807816
23. Tang J, Zhang J, Yu JX, Yang Z, Cai K, Ma R, Zhang L, Su Z (2009) Topic distributions over
links on web. In: ICDM, pp 10101015
24. Ugander J, Backstrom L, Marlow C, Kleinberg J (2012) Structural diversity in social contagion.
Proc Natl Acad Sci 109(16):59625966
25. Weng J, Lim E, Jiang J, He Q (2010) Twitterrank: finding topic sensitive influential Twitters.
In: WSDM, pp 261270
26. Yang J, Counts S (2010) Predicting the speed, scale, and range of information diffusion in
Twitter. In: ICWSM
27. Zhang C, Sun J (2012) Large scale mircoblog mining using distributed mb-lda. In: WWW
(Companion Volume), pp 10351042
28. Zhang C, Sun J, Wang K (2013) Information propagation in microblog networks. In: Proceed-
ings of the 2013 IEEE/ACM international conference on advances in social networks analysis
and mining, pp 190196
Scaling Influence Maximization with Network
Abstractions
Mahsa Maghami and Gita Sukthankar
Abstract Maximizing product adoption within a customer social network under a

constrained advertising budget is an important special case of the general influence
maximization problem. Specialized optimization techniques that account for product
correlations and community effects can outperform network-based techniques that
do not model interactions that arise from marketing multiple products to the same
consumer base. However, it can be infeasible to use exact optimization methods that
utilize expensive matrix operations on larger networks without parallel computa-
tion techniques. In this chapter, we present a hierarchical influence maximization
approach for product marketing that constructs an abstraction hierarchy for scal-
ing optimization techniques to larger networks. An exact solution is computed on
smaller partitions of the network, and a candidate set of influential nodes is propagated
upward to an abstract representation of the original network that maintains distance
information. This process of abstraction, solution, and propagation is repeated until
the resulting abstract network is small enough to be solved exactly.
Keywords Influence maximization Marketing Multi-agent social simulation

Optimization
1 Introduction
Advertising in todays market is no longer viewed as a matter of simply convincing a

potential customer to buy the product but of convincing their social network to adopt
a lifestyle choice. It is well known that social ties between users play an important
role in dictating their behavior. One of the ways this can occur is through social influ-
ence where a behavior or idea can propagate between friends. By considering factors
M. Maghami G. Sukthankar (B)

Department of EECS, University of Central Florida, 4000 Central Florida Blvd,
Orlando, FL 32816, USA
e-mail: gitars@eecs.ucf.edu
M. Maghami
e-mail: mmaghami@cs.ucf.edu

DOI 10.1007/978-3-319-12188-8_11
244 M. Maghami and G. Sukthankar
such as homophily and possible unobserved confounding variables, it is possible to

examine these behavior correlations in a social network statistically [1]. The aim of
viral marketing strategies is to leverage these behavior correlations to create infor-
mation cascades in which a large number of customers imitate a much smaller set of
informed people, who are initially convinced by targeting marketing schemes.
Marketing with a limited budget can be viewed as a specialized version of the
influence maximization problem in which the aim is to advertise to the optimal set
of seed nodes to modify opinion in the network, based on a known influence propa-
gation model. Commonly used propagation models such as Linear Threshold Model
(LTM) and Independent Cascade Model (ICM) assume that a nodes adoption prob-
ability is conditioned on the opinions of the local network neighborhood [15]. Much
of the previous influence maximization work [8, 10, 25] uses these two interaction
models. Since the original LT model and IC model, other generalized models have
been proposed for different domains and specialized applications. For instance, the
decreasing cascade model generalizes models used in the sociology and economics
communities where a behavior spreads in a cascading function according to a prob-
abilistic rule, beginning with a set of nodes that adopt the behavior [15]. In contrast
with the original IC model, in the decreasing cascade model the probability of influ-
ence propagation from an active node is not constant. Similarly, generalized versions
of the linear threshold model have been introduced (e.g., [5, 23]). The simplicity of
these propagation models facilitates theoretical analysis but does not realistically
model specific marketing considerations such as the interactions between advertise-
ments of multiple products and the effects of community membership on product
adoption.
To address these problems, in previous work [21], we developed a model of
product adoption in social networks that accounts for these factors, along with a con-
vex optimization formulation for calculating the best marketing strategy assuming a
limited budget. These social factors can emerge from different independent variables
such as ties between friends and neighbors, social status, and the economic circum-
stance of the agents. Similar properties have been shown to influence people in other
domains; for instance, Aral and Walker demonstrated the effect of social status on the
influence factor of people on Facebook [3]. We believe that in marketing, all these
factors affect the customers susceptibility to influence and their ability to influence
others.
Having a more realistic model is particularly useful for overcoming negative
advertisement effects in which the customers refrain from purchasing any products
after being bombarded with mildly derogatory advertisement from multiple advertis-
ers trying to push their own products. It is critical to model the propagation of negative
influence as well since it propagates and can be stronger and more contagious than
positive influence in affecting peoples decisions [7].
The main limitation of this and similar types of optimization approaches is that
they involve matrix inversion which is slightly less than O(N 3 ) and is the limiting
factor preventing these algorithms from scaling to larger networks. In this chapter,
we propose a hierarchical influence maximization approach that advocates divide
and conquerthe network is partitioned into multiple smaller networks that can
Scaling Influence Maximization with Network Abstractions 245
be solved exactly with optimization techniques, assuming a generalized IC model,

to identify a candidate set of seed nodes. The candidate nodes are used to create a
distance-preserving abstract version of the network that maintains an aggregate influ-
ence model between partitions. Here we demonstrate how this abstraction technique
can be used to scale influence maximization algorithms to larger product adoption
scenarios. Moreover, we present a theorem which shows that the realistic social sys-
tem model has a fixed-point, validating the strategy of optimizing product adoption
at the steady state.
The chapter is organized as follows. Section 2 provides an overview of the related
work in influence maximization. Section 3 introduces our proposed method, Hierar-
chical Influence Maximization (HIM) [22], as well as summarizing the operation of
the realistic product adoption model introduced by [21]. We evaluate our method ver-
sus other influence maximization approaches on both real and synthetic networks in
Sect. 4. This chapter extends on our earlier work [22] by introducing new preprocess-
ing techniques for large networks and presenting a more comprehensive evaluation
of our framework on three larger real-world datasets. We end the chapter with a
discussion of future work.
2 Related Work
Influence maximization can be described as the problem of identifying a small set of

nodes capable of triggering large behavior cascades that spread through the network.
This set of nodes can be discovered using probabilistic approaches (e.g., [2, 17]) or
optimization-based techniques. [12, 21] treat influence maximization as a convex
optimization problem; this is feasible for influencing small communities but does
not scale to larger scale problems. Due to the matrix computation requirements,
these approaches fail when the number of agents in the system increases. Our HIM
algorithm overcomes this deficiency by using a hierarchical approach to factor the
system into smaller matrices.
The HIM model is designed to work on a complex social system where multiple
factors affect the propagation of influence. The simpler case, where the network
topology alone dictates activation spread, has been examined by multiple research
groups, seeking to improve on Kempes early work on greedy approaches for influ-
ence maximization [14]. Examples of possible speedups include innovations such as
the use of a shortest-path based influence cascade model [16] or a lazy-forward opti-
mization algorithm [19] to reduce the number of evaluations on the influence spread
of nodes. Clever heuristics have been used very successfully to speed computation in
both the LT model (e.g., the PMIA algorithm [8]) and also the IC model [25]. In this
chapter, instead of using the original cascade models by Kempe et al. we introduce
a cascade model that accounts for product interactions and community differences
in influence propagation.
Proposed models for investigating how ideas and influence propagate through
the network have been applied to many domains, including technology diffusion,
strategy adoption in game-theoretic settings, and the admission of new products

in the market [14]. For viral marketing, influential nodes can be identified either by
following interaction data or probabilistic strategies. For example, Hartline et al. [11]
solve a revenue maximization problem to investigate effective marketing strategies.
[26] presented a targeted marketing method based on the interaction of subgroups in
social network. Similar to this work, Bagherjeiran and Parekh leverage purchasing
homophily in social networks [4]. But instead of finding influential nodes, they base
their advertising strategy on the profile information of users. Achieving deep market
penetration can be an important aspect of marketing; Shakarian and Damon present a
viral marketing strategy for selecting the seed nodes that guarantees the spread of the
word to the entire network [24]. Our work differs from related work in that our model
not only considers social factors but also incorporates the negative effect of competing
product advertisements and the correlation between demand for different products.
Our optimization approach is largely unaffected by the additional complexity since
these factors only impact the long-term expected value and not the actual solution
method.
Some researchers (e.g., [6, 20]) focus on the adversarial aspect of competing
against other advertisers. In this case, the assumption is that the advertiser is unable
to unilaterally select nodes. In [5] a natural and mathematically tractable model is pre-
sented for the diffusion of multiple innovations in a network. Our work assumes that
influential nodes are selected in a central fashion and partitioned between advertisers
in an adversarial offline process.
3 Method
Our proposed hierarchical approach operates as follows:

1. Create a local network for each node consisting of its neighbors and neighbors of
neighbors;
2. Model the effect of the outside network by assigning a virtual node for each
boundary node to abstract activity outside the local partition;
3. Update the interaction parameters to the virtual node based on the model and the
network connections;
4. Create a candidate set of influential nodes for each local network using convex
optimization to maximize steady state product adoption;
5. Propagate the candidate set upward to a higher-level of abstraction and link the
abstract nodes based on their shortest paths in the previous network;
6. Repeat the abstraction process until the resulting network is small enough to be
optimized as a single partition; the resulting set of candidate nodes is then targeted
for advertisement. Figure 1 shows a flowchart of the algorithm.
Figure 2 demonstrates the process of the algorithm with three hierarchies. The
selected nodes at each local neighborhood, colored in red, are moved to the upper
hierarchy and reconnected based on shortest path distances from the lower-level.
Initial Net Network Influential Node Node Network

Division Identification Pruning Abstraction
Fig. 1 The flowchart for our algorithm, Hierarchical Influence Maximization (HIM)
H3
H2
H1
Fig. 2 At each hierarchical level (Hi ) local neighborhoods are created and virtual nodes (black)
are generated. By using an optimization technique the influential nodes (red) are selected. Nodes
that have been selected at least once as an influential node are transferred to the next level of the
hierarchy. At the higher levels, the connection between selected nodes is defined using the shortest
path distance in the original network. The process is repeated until the final set of influential nodes
is smaller than the total advertising budget
The same process is repeated at the next hierarchy to select more influential nodes.
The procedure terminates at the last hierarchy when the number of influential nodes
finally is smaller than the advertising budget.
3.1 Market Model
To explore the efficiency of the proposed hierarchical influence maximization (HIM)

method in business marketing, we have used the multi-agent system model, presented
by [21], to simulate a social system of potential customers. We have slightly changed
the definition of some parameters in this model to make a more sensible model with
generalized capabilities.
In this model, the population of N agents, represented by the set A = {a1 , . . . , a N },

consists of two types of agents (A = A R A P ), named Regular and Product agents
respectively. The Regular agents are the potential customers in the market who will
occasionally change their attitudes on purchasing products based on the influence
they receive either from other neighbors or from the Product agents who represent
salespeople offering one specific product.
Regular agents belong to a connected social network where the directed weighted
links in this network possess a history of past interactions among the agents. This
social network is modeled by an adjacency matrix, E, where eij = 1 is the weight
of a directed edge from agent ai to agent a j and the in-node and out-node degree of
agent ai is the sum of all in-node and out-node weights, respectively.

In this model a vector of X i is assigned to each agent, both Regular and Product
agents, representing the attitude or desire of the agent toward all of the products in the
market. Each element of this vector, xi p , is a random variable in the [1 1] interval
that indicates the desire of agent ai to buy an item or consume a specific product, p.
In the social simulation, each agent interacts with another agent in a pair-wise
fashion that is modeled as a Poisson process with rate 1, independent of all other
agents. By assuming a Poisson process of interaction, we are claiming that there is at
most one interaction at any given time. Here, the probability of interaction between
agents ai and a j is shown by pij and is defined as a fraction of the connection weight
between these agents over the total connections that agent i makes with the other
agents. Therefore,
eij

i, j A R
dout
i
pij = u ji
i AR, j AP (1)

Threshold

0 otherwise
where the Threshold parameter is the total number of links that Product agent can
make with Regular agents. The bounds on Threshold are a natural consequence of
the limited budget of companies in advertising their products. The u ji parameter is
an indicator marking whether the Product agent is connected to the Regular agent.
At each interaction there is a chance for agents to influence each other and change
their desire vector for purchasing or consuming a product. During these interactions
the Product agents never change their attitude and maintain a fixed desire vector of 1
toward themselves and 1 toward the other advertising companies. The probability
that agent i is susceptible to agent j is denoted as ij and calculated as:

ejii i, j A R
din
ij = (2)
cte i AR, j AP
The other important parameter in the agent influence process is ij , which

determines how much agent j will influence agent i. This parameter indicates the
role of social factors in decision making of agents. In contrast to previous work, we

did not restrict this parameter to a specific distribution to provide more flexibility
to the model. Moreover, in real life there is a correlation between the user demand
for different products in the market. The desire of customers for a specific prod-
uct is related to his/her desire toward other similar products. Matrix M models this
correlation, and we consider its effect in our formulation. The ultimate goal of our
marketing problem is to recognize the influential agents in the graph and define a set
of connections between the A P agents and A R agents, in such a way to maximize the
long term desire of the agents for the products. Note that the links between Product
agents and Regular agents are directed links from products to agents and not in the
opposite direction.
3.2 Generalized ICM
We use a generalized version of ICM similar to [13, 21]. The dynamics of the model
at each iteration k proceed as follows:
1. Agent i initiates the interaction according to a uniform probability distribution
over all agents. Then agent i selects another agent among its neighbors with
probability pij . Note that the desire dynamic can occur with probability N1 ( pij +
pji ) as agent is attitude can change whether it initiates the interaction or is selected
by agent j.
2. Conditioned on the interaction of i and j:
With propagability ij , agent i will change its desire:

X i (k + 1) = ij M X i (k) + 1 ij M X j (k)

(3)
X j (k + 1) = X j (k)
Recall that M is the pre-defined matrix indicating the correlation between the
demands of different products.
With probability of (1 ij ), agent i is not influenced by the other agent:

X i (k + 1) = X i (k)

(4)
X j (k + 1) = X j (k)
It is worthwhile to note that the above interaction model can be degraded to the
IC model, if we set ij = 0, M = I, and restrict pij s to be equal to 1 right after
activation of any node and equal to 0 the rest of the time. Also since the values of
the desire vector range from [1 1], the xi p s [0 1] and xi p s [1 0] can be
quantized to 1 and 0 respectively to match the IC model representation of activation
and deactivation.
3.3 HIM Algorithm
Using these assumptions about customer product adoption dynamics, we devised a

new scalable optimization technique, Hierarchical Influence Maximization (HIM).
The pseudocode of our proposed HIM algorithm is presented in Table 1. Here, matrix
E represents the connection matrix among Regular agents, and matrices P and A
contain all the pij s and ij s of the market model, respectively. In other words, all the
interactions and influence probabilities between two pairs of Regular agents, (A R ),
are embedded in the elements of these matrices. Agent contains all the information

about Regular and Product agent characteristics including desire vectors, ( X i s), and

influence tag vectors, Ii s with size P, where Ii p indicates the number of times that
agent i has been selected as an influential node for product p. The algorithm receives
as input all the available data on the agents and the model, and the output of the
algorithm is the U matrix that contains the assignments of u ji s and shows the final
connection matrix between all the products and influential seed nodes.
The level of the hierarchy is indicated by parameter H which increments until the
stopping criteria are satisfied. At each hierarchy (H ), we iterate over all the nodes
(is) in the network of that hierarchy, (E H ), and list the neighboring agents around
Table 1 HIM Algorithm

HIM (Agent, E, P, A, A R , Hmax , r )
H =0
EH = E
N H = |A R |
While stopCriteria do
H = H +1
infList = NULL
for i = 1 to N H do
neighborList = FindNeighborList (i, r , E H )
EiH = Subgraph (neighborList, E H )
EiH = AddOutsideWorld (E H , EiH )
(Pi , Ai ) = UpdateMat (EH , P, A, neighborList )
L = Optimize (Agent, EiH , Pi , Ai )

infList = infList L
Agent = UpdateAgent (infList)
end for
N H = |infList|
U = MakeU (Agent)
stopCriteria = UpdateCriteria (infList, H )
E H = UpdateHierarchy (infList)
end while
return U
each node. The radius of the neighborhood, denoted with parameter r , indicates the
granularity of analysis. Based on radius r , we partition the network into subsections,
(E iH ), and update the probability matrices, Pi and Ai for that subsection. HIM selects
the influential agents in that local network, E iH , using an optimization technique and
tags them for future use. The process of node selection is described in detail in
Sect. 3.3.2. Then we add these influential nodes to the set of influential nodes that
have been identified in other neighborhoods in the same hierarchy.
3.3.1 Outside World Effect
When a local neighborhood is detached from the complete network, there exist
boundary nodes that are connected to nodes outside the neighborhood. These con-
nections that fall outside of the neighborhood can potentially affect the desire vector
of agents within the neighborhood. One possible approach is to ignore these effects
and only consider the nodes inside the partition. In this chapter we account for these
effects by allocating a virtual node to each boundary node. This virtual node is the
representative of all nodes outside the neighborhood that are connected to the bound-
ary node. Figure 3 illustrates the abstraction of outside world effect and shows how
the models parameters are calculated between each boundary and virtual node.
3.3.2 Node Selection
The process of selecting influential nodes is repeated at each hierarchy and at each
local neighborhood surrounding node i. Following previous works [12, 13, 21], we
model the desire dynamic of all agents as a Markov chain where the state of the local
neighborhood is a matrix of all existing agents desire vectors at a particular iteration k
and the state transitions are calculated probabilistically from the pair-wise interaction
Fig. 3 The network on the left is an example of a neighborhood around node e; the network on
the right is the equivalent network with virtual nodes representing the outside world effect. Here
w can be any interaction parameter such as links weight, , or . The direction of the interaction
with the virtual node is based on the type of links the boundary node has with the nodes outside the
neighborhood. The value of the parameter is the average over all similar types of interactions with
outside world
between agents connected in a network. The state of the local network around agent
i at the kth iteration is a vector of random variables, denoted as Xi (k) R N Hi P1
(created through a concatenation of NiH vectors of size P) and expressed as:

[ X 1 (k)]
..
Xi (k) =
.

[ X N H (k)]
i
We calculate the expected long-term desire of the agents in each local network
around agent i and this calculation results in the following formulation:
E[Xi (k + 1)] = E[Xi (k)] + Qi E[Xi (k)]. (5)
In order to solve this system of equations efficiently, we decompose the matrices:

R
and

AB
Q= X () =
(6)
0 0 P
Here A RRPRP is the sub-matrix representing the expected interactions among

2
Regular agents while B RRPP represents the the expected interactions between
Regular agents and Product agents. Figure 4 shows the breakdown of matrix Q.
Fig. 4 Q matrix is a block matrix with size N N where N is the total number of agents (R + P)
and each block has the size of P P. Matrices A and B are the non-zero part of this matrix
which represent the interactions among Regular agents and interactions between Regular agents
and Products, respectively
Moreover,
R and
P are vectors representing the expected long-term desire
of Regular agents and Product agents, respectively, at iteration k . Note that
vector
P is known since the Product agents, the advertisers, are the immutable
agents, who never change their desire. Solving for
R yields the vector of expected
long-term desire for all regular agents, for a given set of influence probabilities on a
deterministic social network.
R +B
A

P = 0

R = A1 (B

P) (7)
Thus, we can identify the influential nodes in the network and connect the products
to those agents in a way that maximizes the long-term desire of the agents in the social
system. We define the objective function as the maximization of the weighted average
of the expected long-term desire of all the Regular agents in the network toward all
the products as:

max (i
R,i ) (8)
u
1kP iA R
R,i is the part of

R that belongs to agent i, and i parameter is simply a weight
we can assign to agents based on their importance in the network. In the case of
equivalent i = 1 for all the agents, the above function reduces to the arithmetic
mean of the expected long-term desire vectors for all agents.
3.3.3 Convergence
Using the Brouwer fixed-point theorem [18], we prove that each local neighborhood
has a fixed-point, hence solving Eq. (5) at steady state is a valid choice. The theorem
states that:
Theorem 1 Every continuous function from a closed ball of a Euclidean space to
itself has a fixed point.
According to the calculation of Eq. (5), E[Xi (k + 1)] is a continuous function as

it is the sum of two continuous ones. Also since X i (k + 1) in Eq. (3) is a bounded
function in [1 1], its expectation (E[Xi (k + 1)]) will be bounded as well. As a
result we have a bounded, continuous function which is guaranteed a fixed point
by the Brouwer fixed-point theorem. This allows us to solve our problem with the
proposed optimization algorithm to find the assignment of u ji s in a way to maximize
the long-term expected desire vector of agents toward all the products in the market.
3.3.4 Update Hierarchy
When we proceed from one hierarchy to the next one, the selected nodes which
are propagated to the upper hierarchy are not necessarily adjacent. Therefore, we
need to define the interaction model between them based on their position in the
real network. The UpdateHierarchy function is responsible for building the proper
network connection and interaction model for the next hierarchy based on the selected
influential nodes in current hierarchy. These nodes were propagated to the higher
hierarchy by being selected as influential nodes in at least one local neighborhood. It
is possible for a node to be present in multiple partitions and be selected more than
once.
Note that the selected nodes are unlikely to be adjacent nodes in the actual network
E. Therefore we need to find a way to form their connections to construct E H . To
do so, we look at the shortest path between these nodes in network E and use that to
calculate the weight of the edges in E H . In the E H network the weight of the link
between two selected nodes is the product of the weights of the shortest path between
these two nodes in the previous hierarchy. Also the probabilities of interaction and
influence between two influential nodes is set to be the product of the probabilities
along the shortest path between them.
3.3.5 Termination Criteria
To terminate the loop, we establish two different criteria in the UpdateCriteria

function. This function checks the stopping criteria based on the level of the hierar-
chy and the list of influential nodes. One criterion is based on the maximum number
of levels in the hierarchy and the other is based on the ratio of the selected influential
nodes and the advertising budget. According to the stopCriteria output, the algorithm
decides whether to proceed to a higher hierarchy or to stop the search, returning the
current U matrix to be used as the advertising assignment.
3.3.6 Optimization Procedure
The best assignment of Product agents to Regular agents is obtained through solving
the following optimization problem:
maximize A1 Vec(M

P
u )1

u
subject to ip [1 1], i A R ,
x (9)
u ij = cte.
jA R
Here, we are looking for a set of u ji s which minimizes our cost or, in another
words, maximizes the desire value of agents. Since u ji s indicate the existence or lack
of connection between Product and Regular agents, they are binary variables and can
be identified using mixed integer programming. To solve our optimization problem,
we used the GNU Linear Programming Kit (GLPK) package, which is designed for
solving large-scale linear programming (LP) and mixed integer programming (MIP)
problems. GLPK is a set of routines written in ANSI C and organized in the form of
a callable library which is free to download from http://www.gnu.org/software/glpk.
4 Evaluation
4.1 Experimental Setup
We conducted a set of simulation experiments to evaluate the effectiveness of our

proposed node selection method on marketing items in a simulated social system with
a static network. The parameters of the interaction model for all runs are summarized
in Table 2a. All results are computed over an average of 100 runs which represent
ten different simulations on each of ten network structures.
In the Regular and Product agent interactions, parameters and are fixed for
a given interaction and are presented in Table 2a. We assume that these parameters
can be calculated by advertising companies based on user modeling. The pij values
for this type of interaction are calculated using Eq. (1) and are parametric. Table 2b
provides the parameters for our HIM algorithm (neighborhood radius and the maxi-
mum hierarchy level). The remaining part of the social system setup is given by
matrix M, which models the correlation between the demand for different products.
This matrix is generated uniformly with random numbers between [0 1] and, as it
has a probabilistic interpretation, the sum of the values in each row, showing the total
demand for an item, is equal to one.
4.2 Benchmarks
We compared our hierarchical algorithm with the non-hierarchical version, Optimized

Influence Maximization (OIM) described in [21] and a set of centrality-based
Table 2 Parameter settings

Parameter Value Descriptions
(a) Market model parameters
Threshold 2 Number of links between P and R agents
0.4 Influence factor between P and R agents
0.8 Probability of influence between P and R agents
R Variable Number of Regular agents
P 10 Number of Product agents
NIterations 60,000 Number of iterations
NRun 10 Number of runs
NNet 10 Number of different networks
(b) HIM parameters
r 3 Neighborhood radius
Hmax 5 Max level of hierarchy
measures commonly used in social network analysis for identifying influential nodes
based on network structure [14].
OIM: The Optimized Influence Maximization method finds the influential nodes
globally using our optimization method on the original network.
Degree: Assuming that high-degree nodes are influential nodes in the network,
we calculated the probability of advertising to a Regular agent based on the out-
degree of the agents and linked the Product agents according to a preferential
attachment model. Therefore, nodes with higher degree had an increased chance
of being selected as an advertising target.
Betweenness: This centrality metric measures the number of times a node appears
on the geodesics connecting all the other nodes in the network. Nodes with the
highest value of betweenness had the greatest chance of being selected as an
influential node.
PageRank: On the assumption that the nodes with the greatest PageRank score
have a higher chance of influencing the other nodes, we based the probability of
node selection on its PageRank value.
Random: In this baseline, we simply select the nodes uniformly at random.
To evaluate these methods, we started the simulation with an initial desire vector
set to 0 for all agents, and simulated 60,000 iterations of agent interactions. The
entire process of interaction and influence is governed by Eqs. (3) and (4) (Sect. 3.2).
At each iteration, we calculated the average of the expected desire value of the agents
toward all products. This average is calculated over 100 runs (10 simulations on 10
different network structures) for the synthetic dataset and 100 runs on the real-world
datasets. Note that the desire vector of Product agents remain fixed for all products;
in our simulation it was set to 1 for the product itself and 0.1 for all other products
(e.g., 1 = [1 0.1 0.1 . . . 0.1]).
4.3 Synthetic Dataset
For the synthetic dataset, we used the same network generation technique described
in [21] for generating customer networks. To compare the performance of these
methods, the average expected desire value of the agents in a network with 150 agents
has been shown over time in Fig. 5. Here we selected 150 agents as an optimal number
of agents to compare all the algorithms together. With fewer agents, having ten
simultaneously marketed products saturates the network while with a larger number
of agents OIM suffers from scalability issues.
4.3.1 Marketing Effectiveness
In Fig. 5, by using the marketing-specific optimization methods for allocating the

advertising budget, the desire value of the agents toward all products increases the
Average of Agents Expected Desire

0.016
0.014
0.012
0.01
0.008
0.006
Random
0.004 Degree
Betweenness
0.002 HIM
OIM
PageRank
0 4
0 1 2 3 4 5 6x 10
Iterations
Fig. 5 The average of agents expected desire versus number of iterations, calculated across all
products and over 100 runs (10 different runs on 10 different networks). The optimization methods
have the highest average in comparison to the centrality measurement heuristics. As HIM is a
sub-optimal method, it is unsurprising that its performance is worse than the global optimization
method, OIM
most, resulting in the largest number of sales. Although HIM sacrificed some per-
formance in favor of scalability, it clearly outperforms the centrality measurement
methods. The locally-optimal selection approach of HIM results in a slightly lower
performance compared to globally optimal OIM.
Figure 6 shows the final average value of the expected desire of agents in the
last iteration for different number of Regular agents. Although OIM with global
0.014
Random
Degree
Average of Expected Desire
0.012 Betweenness
HIM
OIM
0.01 PageRank
0.008
0.006
0.004
0.002
0
50 100 150 300
Number of agents
Fig. 6 The average of the final expected desire vectors for different numbers of Regular agents and
10 Product agents. The optimization based methods (OIM and HIM) outperform the other methods
in selecting the seed nodes. While OIM is more successful than HIM in selecting the influential
nodes, it is unable to scale-up to networks with 300 agents and higher
Table 3 Runtime Number of agents OIM (s) HIM (s)

comparison between OIM
and HIM 50 10.67 74.09
100 94.76 160.80
150 290.67 208.97
200 897.51 354.35
optimization method outperforms HIM and other centrality measurement methods,

it is incapable of scaling up to 300 and more agents in the network due to near singular
interaction matrix. HIM, with its ability to scale up linearly, provides a sub-optimal
and yet practical solution in selecting the influential nodes in large networks.
4.3.2 Run-Time
Table 3 shows a runtime comparison between the two optimization methods, HIM
(proposed) and OIM (original). In small networks the runtime of the global opti-
mization method is less than the hierarchical but as the size of network grows, its
run time increases exponentially while the run time of the HIM increases at a slower
rate. The long runtime of OIM for the networks larger than 200 nodes makes the
algorithm impractical for finding influential nodes in very large networks.
4.3.3 Jaccard Similarity
To analyze the differences between the algorithms selection of influential nodes, we

use the Jaccard similarity measurement. This measurement is calculated by dividing
the intersection of two selected sets by the union of these sets. Figure 7 shows this
measurement for all pairs of algorithms. The OIM and HIM algorithms have the
highest similarity compared to the other methods with a similarity value of 0.47.
The other pairs of methods have very low similarities, resulting in dark squares in
the figure. Not surprisingly, Random has the least similar node selection to other
methods. This shows that HIM finds many of the same nodes as the original OIM
algorithm, with a much lower runtime cost.
4.4 Real-World Datasets
We also evaluated the performance and scalability of our proposed algorithm on

real-world directed networks from the Stanford Network Analysis Project (http://
snap.stanford.edu/).
1
Random 1.00 0.01 0.01 0.02 0.01 0.01
0.8
Degree 0.01 1.00 0.03 0.10 0.05 0.03
Betweenness 0.01 0.03 1.00 0.08 0.05 0.03 0.6

HIM 0.02 0.10 0.08 1.00 0.47 0.16 0.4
OIM 0.01 0.05 0.05 0.47 1.00 0.06
0.2
PageRank 0.01 0.03 0.03 0.16 0.06 1.00
m
IM
IM
k
es
an
re
do
O
eg
nn
R
an
ge
D
ee
R
Pa
tw
Be
Fig. 7 The average Jaccard similarity measurements between different methods, calculated over
100 runs (10 runs on 10 different networks). Lighter squares denote greater similarity between a
pair of algorithms. Note that HIMs selection of nodes is fairly close to OIMs optimal selection
WikiVote The network contains all the Wikipedia voting data from the inception
of Wikipedia until January 2008. Nodes in the network represent Wikipedia users,
and a directed edge from node i to node j indicates that user i voted on user j.
SlashDots is a technology-related news website known for its user community.
The website features user-submitted technology-oriented news. In 2002 Slashdot
introduced the Slashdot Zoo feature which allows users to tag each other as friends
or foes. This network contains friend/foe links between Slashdot users, obtained
in February 2009.
Epinions This is a network extracted from the consumer review site Epinions.com.
Nodes are members of the site who have reviewed products. A directed edge from
i to j indicates j trusts is reviews (and thus i has influence over j).
In all the experiments on real-world social media, we have preprocessed the networks
to eliminate isolated nodes and boundary nodes (nodes with a degree of one).
Table 4a, b summarize the statistics of these real-world networks before and after
the preprocessing stages, respectively. We used the same experimental parameters
(presented in Sect. 4.1). The only differences are the number of products and the
advertising budget which are equal to 10 and 50, respectively.
We benchmarked our optimization methods against two state of the art influence
maximization methods, Prefix-excluding Maximum Influence Arborescence (PMIA)
[25] and DegreeDiscount [9], in addition to the centrality measures.
PMIA: This heuristic algorithm, [25], examines the local neighborhood of each
node to find the influence pattern in each local arborescence in order to estimate the
influence propagation across the network. To our knowledge, the PMIA algorithm
is the best scalable solution to the influence maximization problem under the
Independent Cascade Model.
DegreeDiscount: This heuristic algorithm presented by Chen et al. [9], refined
the degree method by discounting the degree of nodes whenever a neighbor has
already been selected as an influential node.
Table 4 Statistics of the real-world networks

Dataset WikiVote SlashDot Epinion
(a) Before pre-processing
#Nodes 7K 82 K 76 K
#Edges 100 K 950 K 509 K
Average Degree 14.6 13.4 6.7
Maximal Degree 1,167 3,079 3,079
Diameter 7 11 14
(b) After pre-processing
#Nodes 2K 72 K 20 K
#Edges 38 K 840 K 3700
Average Degree 31.1 10.5 28.9
Maximal Degree 714 5,059 256
Diameter 7 13 12
Although using a hierarchical approach reduces the problem of dealing with huge
interaction matrices, it is still possible for network partitions to be quite large if they
are centered on a high degree node that is connected to a large portion of the network.
In addition to creating huge interaction matrices, these nodes will create star-shape
subgraphs which result in an infeasible solution for the optimization process. There
are a couple of solutions for dealing with these very high degree nodes: (1) ignore
them when we partition the network and assume that their high connectivity guar-
antees that they will appear within the network neighborhood of other nodes or (2)
ignore some of the low-degree neighbors of the node. In the following experiments,
we adopted the first approach in dealing with these large partitions. Therefore, in
all networks we only centered partitions around nodes with a degree less than 100.
Examining the average degree of nodes in all datasets presented in Table 4b shows
that this selection not only prevents huge matrices and star-shaped subgraphs but
still gives us a high percentage of nodes to process. The following results have been
generated for the WikiVote and Epinion datasets.
4.4.1 Marketing Effectiveness
Figure 8 gives the average expected desire value for all the agents over time for
300 K iterations of the simulated market. In this result, the OIM algorithm has the
highest value while HIM algorithm follows it closely. The performance of the HIM
algorithm approaches the global optimization method (OIM). The performance of
the DegreeDiscount heuristic, PMIA, and PageRank algorithms are very close to
each other with no significant differences.
While our algorithms outperform the other benchmarks on the WikiVote dataset,
on the Epinion dataset the degree-based algorithms perform better. Figure 9 shows the
3
x 10 Average Desire value
3.5
HIM
PMIA
3 PageRank
Degree
Degree Discount
2.5 OIM
Expectation
2
1.5
0.5
0
1 2 3 4 5 6
iterations /50000
Fig. 8 The average of agents expected desire versus number of iterations for the WikiVote dataset,
calculated across all products over 100 runs. The dataset was preprocessed by eliminating isolated
and boundary nodes, yielding 2 K nodes, and the simulation was run for 300 K iterations. The
optimization methods have the highest average in comparison to the rest of benchmarks. As the
HIM algorithm is a sub-optimal method, its performance is less than the global optimization method
Average Desire value

0.014
HIM
PMIA
0.012 PageRank
Degree
Degree Discount
0.01
Expecitation
0.008
0.006
0.004
0.002
0
1 2 3 4 5 6
iterations /500000
Fig. 9 The average of agents expected desire versus number of iterations for the Epinion dataset,
calculated across all products, over 100 runs. The dataset was preprocessed by eliminating iso-
lated and boundary nodes, yielding 20 K nodes, and the simulation was run for 300 K iterations.
HIM outperforms PMIA and PageRank, but it beaten by the degree-based algorithms, Degree and
DegreeDiscount. The OIM algorithm could not be run on this dataset, due to the size of the network
Final Average Desire Value

0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
Epinion Wiki
PMIA OIIM HIM Degree DegreeDiscount PageRank
Fig. 10 The final expected desire value of the agents at the end of the simulation for the different
methods and datasets. The OIM algorithm could not be run on the Epinion dataset, due to the size
of the network
results for all the benchmarks and the HIM algorithm. Although the HIM performance
is better than PMIA and PageRank, it does not beat the degree-based algorithms,
Degree and DegreeDiscount.
Figure 10 summarizes the final expected desire value of agents for different
algorithms and for different datasets. The low value of desire vector is a consequence
of having a low number of advertisers within huge networks; during influence prop-
agation, the agents desire vectors are repeatedly multiplied by and .
4.4.2 Analysis of Dataset Degree Distributions
To understand the poor performance of HIM on the Epinion dataset, we examined the
network structure to see how the networks different from one another. Table 5 shows
the quantile analysis of the node degree for the pre-processed datasets. Based on this
analysis we see that the WikiVote network is a very small network compared to other
two datasets, yet the max degree of the lower quartiles is higher the other networks.
This indicates that the WikiVote network has a more uniform degree distribution,
where node degree is not likely to be a highly discriminating feature of influence
propagation potential.
This can be verified by looking at the degree distributions of the datasets (Figs. 11,
12, and 13). In the Epinion and SlashDot datasets we have a small number of nodes
Table 5 Quantile analysis of node degree in preprocessed datasets

Dataset 0% 25 % 50 % 75 % 100 %
WikiVote 3 25 44 79.25 714
Epinion 0 6 11 33 2,684
SlashDot 3 4 7 17 5,061
Fig. 11 The degree histogram of the WikiVote dataset. The x-axis shows the logarithmic scale of
degree, and the curve shows the kernel density estimation. In this dataset the majority of nodes lie
in the middle range and have a degree between 50 and 100
Fig. 12 The degree histogram of the Epinion dataset. The x-axis shows the logarithmic scale of
degree, and the curve shows the kernel density estimation. In this dataset the network has a sparse
structure, with the majority of nodes possessing a degree less than 10
Fig. 13 The degree histogram of the SlashDot dataset. The x-axis shows the logarithmic scale of
degree, and the curve shows the kernel density estimation. In this dataset, the same as Epinion
dataset, the network has a sparse structure, with the majority of nodes possessing a degree less
than 10
with very high degrees while most of the nodes in the network possess a degree less
than 10. In these networks, a few nodes serve as hubs and are highly connected,
whereas the other nodes have few connections that, in the worst case, arent even
connected to the high degree node. Hence our heuristic of not centering the partitions
on high degree nodes sabotages the performance of HIMs optimization procedure.
On the other hand the degree-based algorithms can effectively target these high degree
nodes. In contrast, in the networks such as WikiVote or the synthetic networks where
the node degree is more uniform, HIM works well as the nodes in the middle bins
are more numerous and better connected to the entire network. In this case, the
degree-based algorithms perform poorly since degree is not as discriminative.
4.4.3 Optimization with Degree-Based Heuristic
Based on these results, we modified our preprocessing procedure to use a

degree-based heuristic to select the nodes considered by our optimization technique.
Here, we selected the top 5 % of high degree nodes in the Epinion dataset and created
a single-level abstracted network based on the shortest path among these nodes. Then
we ran our optimization technique (OIM) on the single network. Figure 14 shows the
result of OIM and other benchmarks on this preprocessed network. The result shows
that applying optimization to the abstracted network conclusively outperforms the
other benchmarks.
3
Average Desire value
2 x 10
PMIA
1.8 PageRank
Degree
1.6 Degree Discount
OIM
1.4 No U
Expectation
1.2
1
0.8
0.6
0.4
0.2
0
0 2 4 6 8 10 12 14
iterations /1200000
Fig. 14 The average of agents expected desire versus number of iterations for the Epinion dataset,
calculated across all products and over 10 different runs, for 300 K iterations. The dataset was
preprocessed by selecting the 1 % top degree nodes and building a subgraph based on the shortest
path between these nodes, rendering the graph small enough to be directly processed with OIM.
OIM outperforms the degree-based methods
In this chapter, we address the problem of influence maximization in social networks

for the purpose of advertising. In an advertising domain, our goal is to identify
the influential nodes in a social network as advertiser targets based on the network
structure, the interactions among the agents in the network, and the limited advertising
budget. We adopted agent-based modeling to model such a social system as it is
a powerful tool for the study of phenomena that are difficult to study within the
confines of the laboratory. We also attempted to model the market, the interactions and
propagation of influence, and the product adoption more realistically by incorporating
factors such as product correlation and group membership of agents.
Here we present a general hierarchical approach for applying optimization
techniques to influence maximization. The advantage our method has over network-
only seed selection techniques is that it can account for item correlations and com-
munity effects on the product adoption rate. Our method comes close to the optimal
node selection, at substantially lower runtime costs. However, prior analysis of the
network degree distribution of the network is essential for identifying the correct
preprocessing and abstraction procedure. The HIM algorithm can be used to improve
the scalability of influence maximization on networks with a semi-uniform degree
distribution. In networks with a high centralization, we recommend applying our
optimization technique to an abstracted version of the network created from the high
degree nodes. In this chapter, we have proposed one approach to partitioning the
network into overlapping sections and performing influence maximization on the
partitions. Another alternative would be to leverage preexisting network divisions
computed with community detection algorithms for the first level of the hierarchy.
Furthermore, working with dynamic networks where the agents can enter and leave
the network would be useful for practical applications in which the pool of customers
is constantly changing.
An important potential extension of this work would be to generalize the market
simulation to explicitly model the adversarial effects between competing advertisers
as a Stackelberg competition, in which one advertiser places ads and subsequent
competitors have knowledge of existing ad placement. In this chapter we assumed that
the probability of interaction and influence between two agents is small, compared
to the size of the network, which results in the agents sticking to a decision for a
reasonable period of time. However if the network is smaller or the probability of
interaction increases, there can be large fluctuations in the agents desire vector.
Applying a parameter to the model which forces the agents to retain their decisions
for a minimum period, regardless of external interactions, would ameliorate this
issue [20]. A more general framework for modeling and simulating customer product
adoption within social networks would be of great practical importance; our model
represents initial steps towards this ambitious goal.
Acknowledgments This research was supported in part by NSF IIS-08451.
References
1. Anagnostopoulos A, Kumar R, Mahdian M (2008) Influence and correlation in social networks.

In: Proceeding of the ACM SIGKDD international conference on knowledge discovery and
data mining, pp 715
2. Apolloni A, Channakeshava K, Durbeck L, Khan M, Kuhlman C, Lewis B, Swarup S (2009)
A study of information diffusion over a realistic social network model. In: Proceedings of the
international conference on computational science and engineering, pp 675682
3. Aral S, Walker D (2012) Identifying influential and susceptible members of social networks.
Science 337(6092):337341
4. Bagherjeiran A, Parekh R (2008) Combining behavioral and social network data for online
advertising. In: IEEE international conference on data mining workshops (ICDMW), pp 837
846
5. Bharathi S, Kempe D, Salek M (2007) Competitive influence maximization in social networks.
In: Deng X, Graham FC (eds) Internet and network economics. Springer, Berlin, pp 306311
6. Borodin A, Filmus Y, Oren J (2010) Threshold models for competitive influence in social
networks. In: Saberi A (ed) Internet and network economics. Springer, Berlin, pp 539550
7. Chen W, Collins A, Cummings R, Ke T et al (2011) Influence maximization in social networks
when negative opinions may emerge and propagate. In: Proceedings of the SIAM international
conference on data mining
8. Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing
in large-scale social networks. In: Proceedings of the ACM SIGKDD international conference
on knowledge discovery and data mining, pp 10291038
9. Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In:
Proceedings of the ACM SIGKDD international conference on knowledge discovery and data
mining, pp 199208
10. Chen W, Yuan Y, Zhang L (2010) Scalable influence maximization in social networks under the
linear threshold model. In: Proceedings of the IEEE international conference on data mining
(ICDM), pp 8897
11. Hartline J, Mirrokni V, Sundararajan M (2008) Optimal marketing strategies over social net-
works. In: Proceeding of the international conference on world wide web. ACM, pp 189198
12. Hung B (2010) Optimization-based selection of influential agents in a rural Afghan social
network. Masters thesis, Massachusetts Institute of Technology
13. Hung B, Kolitz S, Ozdaglar A (2011) Optimization-based influencing of village social networks
in a counterinsurgency. In: Proceedings of the international conference on social computing,
behavioral-cultural modeling and prediction, pp 1017
14. Kempe D, Kleinberg J, Tardos (2003) Maximizing the spread of influence through a social
network. In: Proceedings of the ACM SIGKDD international conference on knowledge dis-
covery and data mining. ACM, pp 137146
15. Kempe D, Kleinberg J, Tardos (2005) Influential nodes in a diffusion model for social
networks. In: Automata, Languages and Programming, pp 11271138
16. Kimura M, Saito K (2006) Tractable models for information diffusion in social networks. In:
Knowledge discovery in databases (PKDD), pp 259271
17. Kimura M, Saito K, Nakano R, Motoda H (2009) Finding influential nodes in a social network
from information diffusion data. Social computing and behavioral modeling. Springer, New
York, pp 18
18. Leborgne D (1982) Calcul diffrentiel et gometrie. Presses universitaires de France
19. Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N (2007) Cost-effective
outbreak detection in networks. In: Proceedings of the ACM SIGKDD international conference
on knowledge discovery and data mining, pp 420429
20. Liow L, Cheng S, Lau H (2012) Niche-seeking in influence maximization with adversary. In:
Proceedings of the annual international conference on electronic commerce. ACM, pp 107112
21. Maghami M, Sukthankar G (2010) Identifying influential agents for advertising in multi-agent
markets. In: Proceedings of the international conference on autonomous agents and multiagent
systems, pp 687694
22. Maghami M, Sukthankar G (2013) Hierarchical influence maximization for advertising in
multi-agent markets. In: Proceedings of the IEEE/ACM international conference on advances
in social networks analysis and mining. Niagara Falls, Canada, pp 2127
23. Pathak N, Banerjee A, Srivastava J (2010) A generalized linear threshold model for multiple
cascades. In: International conference on data mining (ICDM), pp 965970
24. Shakarian P, Paulo D (2012) Large social networks can be targeted for viral marketing with
small seed sets. In: Proceedings of the IEEE/ACM international conference on advances in
social networks analysis and mining (ASONAM), pp 18
25. Wang C, Chen W, Wang Y (2012) Scalable influence maximization for independent cascade
model in large-scale social networks. Data Min Knowl Discov 132
26. Yang W, Dia J, Cheng H, Lin H (2006) Mining social networks for targeted advertising.
In: Proceedings of the annual Hawaii international conference on system sciences. IEEE
Computer Society
Glossary
Centrality measures Measures of the relative importance of a node in a graph based

on its position within the network. Commonly used measures include: degree,
betweenness, closeness, and eigenvector centrality.
Community detection Action of automatically finding groups of highly connected
nodes in graphs, also called communities.
Consensual communities A consensual community as a set of nodes which are
frequently classified in the same community through multiple computations.
Edge clustering This is an alternate form of clustering in networks in which the
edges are grouped rather than the nodes.
Elite grouping An elite grouping in social networks is commonly structured on
group concept but distinguished by particular characteristics like a strategic role,
a durable behavior or a salient semantic character influencing or dominating the
network.
Group cohesion This subjective concept reflects how strongly a group of entities
connect to one another as a whole, either from a qualitative or quantitative stand-
point.
Heterogeneous collaboration network A network that associates a set of nodes
with different families of ties. It is also called multiplex network in which each
pair of nodes can be connected through multiple links.
Homophily relationships It is a category of relationships that link entities whenever
they exhibit similar features.
Influence maximization The identification of a small set of nodes capable of trig-
gering large behavior cascades that spread through the network.
Influence propagation model It is a model that seeks to express the process by
which nodes affect their network neighbors. Two commonly used propagation
models are the linear threshold model and the independent cascade model. Defin-
ing the influence propagation model for a network is an important precursor to
solving the influence maximization problem.

DOI 10.1007/978-3-319-12188-8
270 Glossary
Link prediction The problem of link prediction can be formally defined as given a
disjoint node pair (x, y), predict if the node pair has a relationship, or in the case
of dynamic interactions, will form one in the near future.
Network abstraction A network abstraction is a representation of the network in
which less important nodes are omitted from being explicitly represented. It can
be used to create a downsampled version of the network that is computationally
cheaper to browse.
Perspective community A set of participating actors and the temporal ties they
share for joint activities performed during a given time period.
Possibility theory A mathematical theory for dealing with certain types of uncer-
tainty.
Random graphs A graph is random if its edges are created according to a probability
distribution or by a random process.
Scale-free network A network whose degree distribution follows a Power law at
least asymptotically.
Social network analysis Use of graph network theory together with other methods
and techniques to analyze social networks.
Temporal dynamic model A temporal dynamic model of social network is a more
realistic representation of the network development process in time, in which
temporal information is expressed.
Index
A Entanglement index, 96, 97, 100, 104, 107

Abstraction networks, 121, 128
Active/passive social actors, 47, 4951, 64
Asadpour, Masoud, 71 F
Farah, Nadir, 119
B
Behavioral and attribute correlation, 12
Bipartite graph, 89, 90, 92, 93, 97, 105 G
Group cohesion, 95, 98, 101
Guillaume, Jean-Loup, 89, 145
C
Campigotto, Romain, 145
Chen, Cindy, 193 H
Collaborative networks, 168, 171 Hamadache, Billel, 119
Collectivity spirit, 124, 131, 132, 135, 137, Hao, Li, 1
138 Heng, Ji, 1
Community cores, 145 Heterogenousties, 123, 166, 167, 170
Community detection, 11, 45, 89, 90, 146, Homophily, 8993, 95, 100, 111, 113, 197,
152, 159, 166, 167, 266 222
Community evolution, 47, 55, 64 Hurricane sandy, 1, 2, 13, 15, 17, 19, 22, 23
Community reliability, 63, 92
Complex networks, 90, 145, 149
Consensual community, 145, 147, 150, 151,
I
153, 157, 269
Influence maximization, 219, 221, 223, 240,
Content analysis, 71, 72
244, 245, 265
Influential nodes, 243, 245247, 251, 253,
D 254, 258
Durability, 119, 121, 125, 129, 135, 139, 141 Information propagation, 219221, 223, 228
Iranian presidential election, 73, 76
E
Eidenbenz, Stephan J., 27 J
Elite grouping, 119, 122, 140, 269 Japan tsunami, 1, 2, 4, 7, 13, 15, 16, 18, 19,
Email networks, 32, 37 22, 23
Emergency management, 1, 22, 23 Jiang, Hua, 193
DOI 10.1007/978-3-319-12188-8
272 Index
K S
Key players, 1 Sarr, Idrissa, 45
Keyword extraction, 77 Seed users, 219223, 228, 230233, 235,
236, 238, 239
Semantic model, 119, 121, 130, 132, 133,
L 139, 140
Latent Dirichlet Allocation, 197, 222, 225 Semantic overlaps, 139, 141
Link prediction, 166169, 172, 176, 177, Seridi-Bouchelaghem, Hassina, 119
179, 181, 186, 189, 190, 270 Sims, Benjamin H., 27
Liu, Benyuan, 193 Sinitsyn, Nikolai, 27
Social features, 167, 168, 171, 172, 178, 179,
190
M Sukthankar, Gita, 165, 243
Maghami, Mahsa, 243 Sun, Jianling, 219
Marketing, 243245, 249, 255
Melancon, Guy, 89
Microblog networks, 222, 223 T
Missaoui, Rokia, 45 Tabatabaei, Seyed Amin, 71
Modularity, 91, 123, 146, 152, 159 Temporal dynamic network, 121, 270
Multi-agent social simulations, 247 Topic mining, 223, 226
Topic modeling, 222225
Towsley, Don, 193
N
Trend analysis, 76
Natural language processing, 1, 2
Tu, Kun, 193
Ndong, Joseph, 45
Twitter, 1, 2, 4, 5, 7, 11, 13, 1618, 22, 23,
Network visualization, 27, 35
48, 7175, 85, 220, 223
O
Online dating, 193199, 201, 204, 210, 215, U
216 User/actor attributes, 11
Optimization, 37, 233, 244, 246, 250, 260, User/actor behavior analysis, 47
265
Organizational hierarchies, 27, 28
Organization subdivisions, 27 V
Overlapping communities, 147, 166, 170, Viaud, Marie-Luce, 89
190
Overlaying networks, 45
W
Wallace, William A., 1
P Wang, Ke, 219
Perspective community, 49, 60 Wang, Xi, 165
Possibility theory, 45, 54, 55 Wang, Xiaodong, 193
Power law model, 27, 28
X
R Xia, Peng, 193
Random graphs, 145, 146, 152, 153, 157,
162
Random walk, 6, 11, 168, 169, 175, 181, 183, Y
184 Yulia, Tyshchuk, 1
Recommendation, 193, 194, 196, 197, 204,
216
Renoust, Benjamin, 89 Z
Ribeiro, Bruno, 193 Zhang, Chenyi, 219

(Lecture Notes in Social Networks) Rokia Missaoui, Idrissa Sarr (Eds.) - Social Network Analysis - Community Detection and Evolution-Springer International Publishing (2014) PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Lecture Notes in Social Networks) Rokia Missaoui, Idrissa Sarr (Eds.) - Social Network Analysis - Community Detection and Evolution-Springer International Publishing (2014) PDF

Uploaded by

Copyright:

Available Formats

Lecture Notes in Social Networks

ISSN 2190-5428 ISSN 2190-5436 (electronic)

Library of Congress Control Number: 2014956200

Springer Cham Heidelberg New York Dordrecht London

Printed on acid-free paper

Springer International Publishing AG Switzerland is part of Springer Science+Business Media

Rokia and Idrissa

Calgary, August 2014 Reda Alhajj

Community Detection and Evolution

focused on community evolution/detection by relying entirely on the behavior of

Chapter Link Prediction in Heterogeneous Collaboration Networks written by

Influence/Information Propagation and Maximization

Influence propagation is usually modeled using propagation models such as Linear

Chapter titled Latent Tunnel Based Information Propagation in Microblog

August 2014 Rokia Missaoui

The Emergence of Communities and Their Leaders

Hierarchical and Matrix Structures in a Large Organizational

Overlaying Social Networks of Different Perspectives

Study of Influential Trends, Communities, and Websites

Entanglement in Multiplex Networks: Understanding Group

An Elite Grouping of Individuals for Expressing a Core Identity

The Power of Consensus: Random Graphs Still Have

Link Prediction in Heterogeneous Collaboration Networks . . . . . . . . . 165

Characterization of User Online Dating Behavior and Preference

Latent Tunnel Based Information Propagation

Scaling Influence Maximization with Network Abstractions . . . . . . . . . 243

Masoud Asadpour Social Networks Lab, School of Electrical and Computer

Joseph Ndong Universit Cheikh Anta Diop, Fann Dakar, Senegal

Yulia Tyshchuk, Hao Li, Heng Ji and William A. Wallace

Abstract Twitter is presently utilized as a channel of communication and information

Y. Tyshchuk (B) W.A. Wallace

Springer International Publishing Switzerland 2014 1

Keywords Social network analysis Community evolution Community detection

Twitter is an important channel of information dissemination. It is particularly useful

2.1 Warning Response Process During Emergencies

During emergencies affected individuals participate in the warning response process,

2.2 Social Media During Emergencies

2.3 Social Network Analysis and Twitter

Social network analysis facilitates the determination of the communication patterns

2.4 Open-Domain Event Discovery

Traditional event extraction work focused on supervised learning for pre-defined

Twitter by learning a latent set of records and a record-message alignment simulta-

2.5 First Story Detection

Fig. 1 Overview of the methodology picture

3.2 NLP Approach

We defined the following terminology for a series of NLP approaches.

3.2.2 On-Topic Tweet Detection

Inspired by the hashtag definition, we developed a novel annotation scheme based

3.2.3 Actionable Event Extraction

3.2.4 Event Attribute Labeling

In addition to identifying actionable events, we also labeled semantic attributes

1 We treat hashtags appear more than 50 times as high frequency ones.

3.2.5 First Story Detection and Event Clustering

h ij (x) = h ij (y), i [1 . . . L], j [1 . . . k] (1)

and the hash function h i j (x) is defined as:

h ij (x) = sgn(u ijT x) (2)

4 We set t as 0.2 in our experiments.

Algorithm 1: LSH-based FSD

3.3 SNA Methodology

3.3.1 Network Construction

3.3.2 Attribute Setup

The NLP analysis assigned specific attributes to each actionable tweetmodality

3.3.3 Community Finding