Professional Documents
Culture Documents
Jiaojiao Jiang
Sheng Wen
Bo Liu
Shui Yu
Yang Xiang
Wanlei Zhou
Malicious Attack
Propagation
and Source
Identification
Advances in Information Security
Volume 73
Series editor
Sushil Jajodia, George Mason University, Fairfax, VA, USA
More information about this series at http://www.springer.com/series/5576
Jiaojiao Jiang • Sheng Wen • Bo Liu • Shui Yu
Yang Xiang • Wanlei Zhou
Malicious Attack
Propagation and Source
Identification
123
Jiaojiao Jiang Sheng Wen
Swinburne University of Technology Swinburne University of Technology
Hawthorne, Melbourne Hawthorne, Melbourne
VIC, Australia VIC, Australia
Bo Liu Shui Yu
La Trobe University University of Technology Sydney
Bundoora, VIC, Australia Ultimo, NSW, Australia
ISSN 1568-2633
Advances in Information Security
ISBN 978-3-030-02178-8 ISBN 978-3-030-02179-5 (eBook)
https://doi.org/10.1007/978-3-030-02179-5
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
In the modern world, the ubiquity of networks has made us vulnerable to various
malicious attacks. For instance, computer viruses propagate throughout the Internet
and infect millions of computers. Misinformation spreads incredibly fast in online
social networks, such as Facebook and Twitter. Experts say that “fake news” on
social media platforms influenced US election voters. Researchers and manufactur-
ers evolve new methods to produce detection systems to detect suspicious attacks.
However, how can we detect the propagation source of the attacks so as to protect
network assets from fast acting attacks? Moreover, how can we build up effective
and efficient prevention systems to stop malicious attacks before they do damage
and have a chance to infect a system.
So far, extensive work has been done to develop new approaches, to effectively
identify the propagation source of malicious attacks and to efficiently restrain the
malicious attacks. The goal of this book is to summarize and analyze the state-of-
the-art research and investigations in the field of identifying propagation sources
and preventing malicious propagation, so as to provide an approachable strategy
for researchers and engineers to implement this new framework in real-world
applications. The striking features of the book can be illustrated from three basic
aspects:
• A detailed coverage on analyzing and preventing the propagation of malicious
attacks in complex networks. On the one hand, a practical problem in malicious
attack propagation is the spreading influence of initial spreaders. This book
presents and analyzes different methods for influential spreader detection. On
the other hand, various strategies have been proposed for preventing malicious
attack propagation. This book numerically analyzes these strategies, concludes
the equivalences of these strategies, and presents a hybrid strategy by combining
different strategies.
• A rich collection of contemporary research results in identifying the propagation
source of malicious attacks. According to the categories of observations on
malicious attacks, current research can be divided into three types. For each
v
vi Preface
type, we particularly present one representative method and the theory behind
each method. A comprehensive theoretical analysis of current methods is further
presented. Apart from the theoretical analysis, the book numerically analyzes
their pros and cons based on real-world datasets.
• A comprehensive study of critical research issues in identifying the propagation
source of malicious attacks. For each issue, the book presents a brief introduction
to the problem and its challenges, and a detailed state-of-the-art method to solve
the problem.
This book intends to enable readers, especially postgraduate and senior under-
graduate students, to study up-to-date concepts, methods, algorithms, and analytic
skills for building modern detection and prevention systems through analyzing the
propagation of malicious attacks. It enables students not only to master the concepts
and theories in relation to malicious attack propagation and source identification but
also to readily use the material introduced into implementation practices.
The book is divided into three parts: malicious attack propagation, propagation
source identification, and critical research issues in source identification. In the
first part, after an introduction of the preliminaries of malicious attack propagation,
the book presents detailed descriptions on areas of detecting influential spreaders
and restraining the propagation of malicious attacks. In the second part, after a
summary on the techniques involved in propagation source identification under
different categories of observations about malicious attack propagation, the book
then presents a comprehensive study of these techniques and uses real-world
datasets to numerically analyze their pros and cons. In the third part, the book
explores three critical research issues in the research area of propagation source
identification. The most difficult one is the complex spatiotemporal diffusion
process of malicious attacks in time-varying networks, which is the bottleneck
of current approaches. The second issue lies in the expensively computational
complexity of identifying multiple propagation sources. The third important issue
is the huge scale of the underlying networks, which makes it difficult to develop
efficient strategies to quickly and accurately identify propagation sources. These
weaknesses prevent propagation source identification from being applied in a
broader range of real-world applications. This book systematically analyzes the
state of the art in addressing these issues and aims at making propagation source
identification more effective and applicable.
vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Malicious Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Examples of Malicious Attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Propagation Mechanism of Malicious Attacks . . . . . . . . . . . . . . . . . . . . . 4
1.4 Source Identification of Malicious Attack Propagation . . . . . . . . . . . . 6
1.5 Outline and Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ix
x Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Chapter 1
Introduction
With the remarkable advances in computer technologies, our social, financial and
professional existences become increasingly digitized, and governments, healthcare
and military infrastructures rely more on computer technologies. They, meanwhile,
present larger and more lucrative targets for malicious attacks [81]. A malicious
attack is an attempt to forcefully abuse or take advantage of a computer system or
a network asset. This can be done with the intent of stealing personal information
(logins, financial data, even electronic money), or to reduce the functionality of a
target computer. According to statistics, worldwide financial losses due to malicious
attacks averaged $12.18 billion per year from 1997 to 2006 [50] and increased to
$110 billion between July 2011 and the end of July 2012 [167].
Typical malicious attacks include viruses, worms, Trojan horses, spyware,
adware, phishing, spamming, rumors, and other types of social engineering. Since
the first computer virus surfaced in the early 1980s, malicious attacks has developed
into thousands of variants that differ in infection mechanism, propagation mecha-
nism, destructive payload, and other features [51].
According to the types of cyber threats that contributed to breaches, mali-
cious attacks are divided into two main categories [100]: malware and social
engineering.
• Malware, short for “malicious software”, is a broad term that refers to a variety
of malicious programs designed to compromise computers and devices in several
ways, steal data, bypass access controls, or cause harm to the host computers.
Malware comes in a number of forms, including viruses, worms, Trojans,
spyware, adware, rootkits and more.
• Social Engineering is the art of getting users to compromise information systems,
where the attackers use human interaction (i.e., social skills) to obtain or
compromise information about an organization or its computer systems. Instead
Once executed,
Users receive spam with TSPY_ZBOT.VNA exhibites TROJ_CRILOCK.NS locks
TROJ_UPATRE.VNA certain files and then asks
malicious attachment several malicious behavior
connects to certain websites
(TROJ_UPATRE.VNA) including downloading users to purchase a
to download
TROJ_CRILOCK.NS decrypting tool
TSPY_ZBOT.VNA
leaves very little trace on the hard disk as it is able to run entirely on memory,
with a size of 3569 bytes. Once infected, it will proceed to make a hundred copies
of itself but due to a bug in the programming, it will duplicate even more and ends
up eating a lot of the systems resources. It will then launch a denial of service
attack on several IP address, famous among them the website of the White House.
It also allows backdoor access to the server, allowing for remote access to the
machine. It was estimate that it caused $2 billion in lost productivity. A total of
1–2 million servers were affected.
• Conficker is a worm for Windows that made its first appearance in 2008. It
infects computers using flaws in the OS to create a botnet. The malware was
able to infect more than 9 millions computers all around the world, affecting
governments, businesses and individuals. It was one of the largest known
worm infections to ever surface causing an estimate damage of $9 billion. The
worm works by exploiting a network service vulnerability that was present and
unpatched in Windows. Once infected, the worm will then reset account lockout
policies, block access to Windows update and anti-virus sites, turn off certain
services and lock out user accounts among many. Then, it proceeds to install
software that will turn the computer into a botnet slave and scareware to scam
money off the user.
• Mydoom surfacing in 2004, was a worm for Windows that became one of the
fastest spreading email worm since ILOVEYOU. The worm spreads itself by
appearing as an email transmission error and contains an attachment of itself.
Once executed, it will send itself to email addresses that are in a user’s address
book and copies itself to any P2P program’s folder to propagate itself through
that network. The payload itself is twofold: first it opens up a backdoor to
allow remote access and second it launches a denial of service attack on the
controversial SCO Group. It was believed that the worm was created to disrupt
SCO due to conflict over ownership of some Linux code. It caused an estimate
of $38.5 billion in damages and the worm is still active in some form today.
• Flashback is one of the Mac malware. The Trojan was first discovered in 2011.
A user simply needs to have Java enabled. It propagates itself by using com-
promised websites containing JavaScript code that will download the payload.
4 1 Introduction
Once installed, the Mac becomes part of a botnet of other infected Macs. More
than 600,000 Macs were infected, including 274 Macs in the Cupertino area, the
headquarters of Apple.
As we can see from the above examples of malicious attacks, a malicious attack
propagation starts from one or few hosts and quickly infects many other hosts.
For example, the ILOVEYOU worm attached in Outlook Mails mailed itself to all
addresses on the host’s mailing list. The Code Red performed network scanning and
propagated to IP addresses connected to the host. In order to fight against malicious
attacks, it is important for cyber defenders to understand malicious attack behavior,
such as membership recruitment patterns, the size of botnets, distribution of bots,
especially the propagation mechanism of malicious attacks.
The diagram shown in Fig. 1.2 illustrates the process of real-world Trojan
malware [54, 103, 105, 169]. Such a process consists of three stages:
• In the first stage, the malware developer creates one or more fake profiles and
infiltrates them into the social network. The purpose of these fake profiles is
to make friends with as many real OSN users as possible. Infiltration has been
shown to be an effective technique for disseminating malicious content in OSNs
such as Facebook [22].
• In the second stage, the malware developer uses social engineering techniques
to create eye-catching web links that trick users into clicking on them. The web
Fig. 1.2 Example of Trojan malware propagation in online social networks [55]
1.3 Propagation Mechanism of Malicious Attacks 5
links, which are posted on the fake users’ walls, lead unsuspecting users to a web
page that contains malicious content. A user simply needs to visit or “drive by”
that web page, and the malicious code can be downloaded in the background and
executed on the user’s computer without his/her knowledge. When security flaws
are absent [198], malware creators resort to social engineering techniques to get
assistance from users to activate the malicious code.
• In the third stage, after a user is infected, the malware also posts the eye-catching
web link(s) on the user’s wall to “ecruit” his/her friends. If a friend clicks on the
link(s) and, as a result, unknowingly executes the malware, the friend’s computer
and profile will become infected and the propagation cycle continues with his/her
own friends.
Note that, malicious attacks are similar to biological viruses on their self-
replicating and propagation behaviors. Thus the mathematical techniques developed
for the study of biological infectious diseases have been adapted to the study of
malicious attack propagation. The basic epidemic model, Susceptible-Infected (SI)
model, separates populations into two groups of nodes changing over time:
• A susceptible node is a node that is vulnerable to malicious attack but otherwise
“healthy”. We use S(t) to denote the number of susceptible nodes at time t.
• An infected node is a node that became infected and may potentially infect other
nodes. We use I (t) to denote the number of infected nodes at time t.
In the SI model, it was assumed that the population is large and steady with n
nodes. If a node got infected, it does not become uninfected. Figure 1.3 presents
350000
300000
250000
infected hosts
200000
150000
100000
50000
0
00:00 04:00 08:00 12:00 16:00 20:00 00:00 04:00
07/19 time (UTC) 07/20
the propagation progress of the Code Red worm [141]. The dataset on the Code Red
worm was collected by Moore et al. during the whole day of July 19th [127]. The
SI model well matches the propagation of Code Red worm [206].
Many other models are derivations of this basic SI form. For example, the
Susceptible-Infected-Recovered (SIR) models the propagation that node can get
recovered or immune from the infectious state and if a node get recovered it will
never again become susceptible. The Susceptible-Infected-Susceptible (SIS) models
the propagation in which recovered node can become susceptible again.
sequence, and concurrency of contacts among nodes [27, 170]. Then, can we
model the way that malicious attack spreads in time-varying networks? Can we
estimate the probability of an arbitrary node being infected by a malicious attack?
How do we detect the propagation source of a malicious attack in time-varying
networks? Can we estimate the infection scale and infection time of the malicious
attack?
• Malicious attacks often emerge from multiple sources. However, current methods
mainly focus on the identification of a single attack source in networks. A few
approaches are proposed for identifying multiple attack sources but they all
suffer from extremely high computational complexity, which is not practical to
be adopted in real-world networks. In this book, we will answer the following
questions corresponding to multi-source identification. How many sources are
there? Where did the diffusion emerge? When did the diffusion start?
• Another critical challenge in this research area is the scalability issue. Current
methods generally require scanning the whole underlying network of malicious
attack spreading to locate attack sources. However, real-world networks of
malicious attack diffusion are often of a huge scale and extremely complex
structure. Thus, it is impractical to scan the whole network to locate the attack
sources. We develop efficient approaches to identify attack sources by taking the
structural features of networks and the diffusion patterns of malicious attacks into
account, and therefore address the scalability issue.
To address the above challenges, this book aims to summarize the new tech-
nologies and achieve a breakthrough in source identification of malicious attacks to
enable its effective applicability in real world applications. Based on the challenges,
we divide the book into three main parts.
• Part I: Malicious Attack Propagation
1. Primary knowledge of modeling malicious attack propagation.
2. Spreading influence analysis of network hosts in the propagation of malicious
attacks.
3. Restraining the propagation of malicious attacks.
• Part II: Source Identification of Malicious Attack Propagation
1. Source identification under complete observations: a maximum likelihood
(ML) source estimator.
2. Source identification under snapshots: a sample path based source estimator.
3. Source identification under sensor observations: a Gaussian source estimator.
• Part III: Critical Research Issues in Source Identification
1. Identifying propagation source in time-varying networks.
2. Identifying multiple propagation sources.
3. Identifying propagation source in large-scale networks.
The approaches involved in this book include complex network theory, information
diffusion theory, probability theory, and applied statistics.
Part I
Malicious Attack Propagation
Chapter 2
Preliminary of Modeling Malicious
Attack Propagation
Graphs are usually used to represent networks in different fields such as computer
science, biology, and sociology. A graph G = (V , E) consists of a set of nodes V to
represent objects, and a set of edges E = {eij |i, j ∈ V } to represent relationships.
For example, in a computer network, a node represents a computer or a server, and
an edge stands for the connection between two computers or servers. In a social
network, a node represents a people and an edge represents the friendship between
two people. Mathematically, a graph can also be represented as an adjacency matrix
A, in which each entry aij labels the weight on the edge eij .
To capture the importance of nodes in a network, many different centrality
measures have been proposed over the years [98]. According to Freeman in 1979
[66]: “There is certainly no unanimity on exactly what centrality is or on its
conceptual foundations, and there is little agreement on the proper procedure for
its measurement.” In this chapter, we introduce some popular centrality measures as
follows.
Degree Given a node i, the degree [66] of node i is the number of edges connected
to node i. In Fig. 2.1a, the black nodes present higher degree values than the
white nodes. A high degree centrality gives an indication of high influence
of the node in the network. For example, the high-degree nodes in computer
networks often serve as hubs or as major channels of data transmission in the
network. Meanwhile, degree measures the local influence of nodes as the value is
computed by considering the number of links of the node to other nodes directly
adjacent to it. The degree D of a node i can be computed as follows:
n
D(i) = eij , (2.1)
j =1
where n is the total number of nodes in the network, eij = 1 if and only if i and
j are connected by an edge; otherwise eij = 0.
Betweenness The betweenness of a node quantifies the number of times the node
acts as a bridge along the shortest path between two other nodes [65]. Nodes with
high betweenness facilitate the flow of information as they form critical bridges
between other nodes or groups of nodes (see Fig. 2.1b). To be precise, suppose
(st)
that gi is the number of shortest paths from node s to node t that pass through
node i, and suppose that nst is the total number of shortest paths from s to t.
Then the betweenness of node i is defined as follows:
(st)
s<t gi /nst
B(i) = . (2.2)
2 n(n − 1)
1
Researchers have found some nodes which do not have large degrees also play a
vital role in the information diffusion [72, 110]. As shown in Fig. 2.1b, the degree
of node E is smaller than nodes A, B, C and D. However, node E is noticeably
more important to information spread as it is the connector of two large groups.
By using betweenness centrality, we can successfully locate these nodes.
Closeness The closeness [65, 136] of a node is defined as the average length of
the shortest path between the node and all other reachable nodes. Mathematically,
the closeness centrality C of a node i can be computed as follows [66]:
n−1
C(i) = n , (2.3)
j =1 d(i, j )
where d(i, j ) denotes the distance of the shortest path from node i to node j .
The closeness of a node can be regarded as a measure of how long a piece of
information will take to spread from the node to all the other nodes sequentially
[136]. The more central a node is, the lower total distance it has to all other nodes
and hence a larger closeness presents. As shown in Fig. 2.1c, nodes A and B are
the most closest nodes to all other nodes.
Fig. 2.1 Illustration of some centrality measures. (a) Degree. (b) Betweenness. (c) Closeness
2.1 Graph Theory 13
A triangle refers to a set of three nodes with three undirected edges among
them. The local clustering coefficient is defined for a node i as the fraction
of triangles among all the triples of nodes in i’s neighborhood, while both the
selected triangles and node triples should contain i.
{ej k |j, k ∈ Ni , ej k ∈ E}
C(i) = , (2.6)
ki (ki − 1)
malicious attacks, and analyze how to restrain malicious attacks through blocking
influential nodes. For other centrality measures, readers could refer to [21] for
details.
Fig. 2.3 The Watts-Strogatz model reproduces the small-world phenomenon by rewiring edges in
a regular network according to the randomness parameter p [179]
Fig. 2.4 The connectivities of various large real-world networks have scale-free distributions, (a)
actor collaboration graph, (b) the World Wide Web, and (c) the power grid network [15]
it to a random existing node or its neighbors [97, 101]. Another model proposed
by Newman et al. [135] aimed to build up a random graph with the arbitrary
degree setting. The ranking model grew the network according to a rank of the
nodes by any given prestige measure; the probability of linking a target node
could be any power law function of its rank, resulting in a power-law degree
distribution [60].
Fig. 2.5 Illustration of network communities. (a) Non-overlapping communities. (b) Overlapping
communities
module movements inside the module i and the probability of exiting i. The first
part of the formula describes the entropy of the movement between communities,
and the second part sums up the entropy within each community. Eventually
Infomap applies computational search algorithm to find the best partition as the
outcome [153].
Link Clustering Different from Infomap, the Link Clustering algorithm aims at
discovering overlapping communities in which a node is allowed to belong to
multiple communities. This algorithm reinvents communities as groups of links
rather than nodes. The set of neighbors of a node i is denoted as Ni . Given a pair
of links with one shared node, eij and ej k , the similarity between these two links
is the Jaccard similarity between neighbor sets of distinct nodes:
|Ni ∩ Nk |
S(eij , ej k ) = . (2.8)
|Ni ∪ Nk |
Then a dendrogram is built up according to these similarities using single-linkage
hierarchical clustering and cutting the dendrogram at some level produces the
overlapped community structure. Given a partition P = {P1 , P2 , . . . , PC }, a
partition density D can be computed by the average partition density weighted
by the fraction of present links in each partition:
18 2 Preliminary of Modeling Malicious Attack Propagation
mc 2 mc − (nc − 1)
D= Dc = mc , (2.9)
c
M M c (nc − 2)(nc − 1)
where mc and nc are the numbers of edges and nodes in the partition Pc ,
respectively. The cutting threshold in the dendrogram can be determined by
achieving a maximum partition density.
According the assumption, we have S(t) = n − I (t). So, we can rewrite the SI
model as follows
dI (t)
= βI (t)(n − I (t)). (2.11)
dt
Fig. 2.6 Illustration of epidemic spreading models. (a) SI model; (b) SIR model; (c) SIS model
2.4 Information Diffusion Models 19
dS(t) βI (t)S(t)
=− , (2.12)
dt n
dI (t) βI (t)S(t)
= − γ I (t). (2.13)
dt n
dR(t)
= γ I (t). (2.14)
dt
dS(t) βI (t)S(t)
=− + γ I (t), (2.15)
dt n
dI (t) βI (t)S(t)
= − γ I (t), (2.16)
dt n
The number of infected nodes and susceptible nodes as a function of t based on
the SI, SIR, and SIS models are shown in Fig. 2.7a, b and c, respectively. There
are also many other epidemic models, such as SIRS [166], SEIR [196], MSIR
[78], SEIRS [36]. Readers could refer to the work of [190] and [176] for more
epidemic models.
Fig. 2.7 The propagation process based on different models. (a) SI model. (b) SIR model. (c) SIS
model
Chapter 3
User Influence in the Propagation
of Malicious Attacks
Networks portray a multitude of interactions through which people meet, ideas are
spread, and infectious diseases and malicious rumors propagate within a society.
Recently, researchers have found that unsolicited malicious attacks spread extremely
fast through influential spreaders [42]. For example, in April 23, 2013, the twitter
account of Associated Press was hacked to spread the rumor that explosions at
White House injured Obama. This led to both the DOW Jones industrial average
and Standard & Poor’s 500 Index plunging about 1% before regaining their losses
[143]. Hence, identifying the most efficient ‘spreaders’ in a network becomes an
important step towards restraining spread of malicious attacks. In this chapter, we
investigate the methods of measuring influence of network nodes.
3.1 Introduction
The propagation of malicious attacks has long been a critical problem in various
forms of networks. For example, rumors spread incredibly fast in online social
networks [42]. Computer viruses spread throughout the Internet and compromise
millions of computers [177]. In Smart Grids, isolated failures lead to the rolling
blackouts in cities [194]. Influential users can initiate and conduct the dissemination
of information more efficiently than normal users. Therefore, influential users in
networks are normally more responsible for large cascades of malicious attacks.
Researchers have developed many methods to expose influential users in net-
works. The simplest measure is the degree of node which counts the number of
edges incident on a node [154]. Generally, the large-degree nodes correspond to
the popular users in social networks. The eigenvector centrality [20] is an extension
of the degree measure. Unlike the degree which weights every neighboring nodes
equally, the eigenvector centrality weights the neighboring nodes according to their
importance. The Katz centrality [92] is another extension of the degree measure.
The node degree stands for the number of direct neighbors, while the Katz centrality
counts the number of all reachable nodes and the contributions of distant nodes
are penalized. A more sophisticated centrality measure is closeness [66], which
is the mean geodesic (i.e., shortest-path) distance from the node of interest to all
other reachable nodes. The closeness measures the efficiency of a node distributing
information to any node in networks.
Another important class of centrality measures are betweenness measures. In
1977, Freeman [64] proposed the shortest-path betweenness which is defined as the
fraction of shortest paths between node pairs in a network that pass through the node
of interest. The shortest-path betweenness is the simplest and most widely used
betweenness measure, which is usually regarded as a measure of influence a user
possesses over information spreading between any pair of users. However, in most
networks, information does not spread only along the geodesic paths. To address
this problem, in 1991, Freeman et al. [67] proposed a more complex betweenness
measure, usually known as the flow betweenness. The flow betweenness is based on
the idea of maximum flow, which is defined as the number of flow units through
the node of interest when the maximum flow is transmitted between node pairs.
For these two betweenness measures, the information needs to “know” the ideal
route (shortest or maximum-flow path) from one node to another. However, the
ideal routes between node pairs are normally unknown during the transmission,
and the information wanders around randomly in the network until it reaches the
destination. Accordingly, in 2005, Newman [132] proposed a new betweenness
measure based on random walks. The random-walk betweenness counts how often
a node is traversed by a random walk between node pairs.
In this chapter, we study the betweenness measures and their application of
capturing influential nodes in the epidemics, in which malicious information is
disseminated from one node to multiple neighboring nodes and destinations. The
traditional betweenness measures have proved of great value to the analysis of node
influence in complex networks [34, 136, 187]. However, these measures focus on
the dissemination of information from one node to another, instead of one-to-multi.
This is conceptually not suitable for the real-world malicious attacks, in which
the malicious information propagation through multi-paths to all reachable nodes.
Wen et al. [182] proposed a betweenness measurement, epidemic betweenness, to
measure influence of nodes in epidemics.
As epidemic incidents may start from any nodes in networks, the node of
interest may be the epidemic source or an intermediary forwarding the receiving
to the neighboring nodes. As for the influence of a node, we normally consider a
node to be influential to the epidemics if this node can influence a large number
of following nodes after it becomes influenced by the epidemics. Formally, the
epidemic betweenness of an arbitrary node i, bEP (i), is the expected number of
nodes that are influenced directly or indirectly by node i after i becomes influenced
by epidemics. The value bEP (i) is averaged by the epidemic incidents that start
from all possible sources in the network. Hence, the epidemic betweenness reflects
the potential influence of a node to any epidemic in a complex network.
3.2 Problem Statement 23
The traditional betweenness measures (i.e., shortest-path [64], flow [67] and
random-walk [132] betweenness) have long been employed to locate the influential
nodes in complex networks. However, these measures are conceptually not suitable
for the epidemics in which information spreads from one node to multiple receivers
rather than the transmission from one to another. In this section, we discuss the
difference between epidemic betweenness and traditional betweenness measures in
estimating the influence of network nodes.
Epidemic Betweenness vs. Shortest-Path Betweenness The shortest-path
betweenness centrality is defined as the fraction of the geodesic (i.e., shortest)
paths between node pairs that pass through the node of interest in a network. To be
(st)
precise, suppose that gi is the number of geodesic paths from node s to t that pass
through node i, and g (st) is the total number of geodesic paths from s to t. Then the
shortest-path betweenness centrality of node i is
gi(st) /g (st)
s<t
bSP (i) = , (3.1)
(1/2)n(n − 1)
where n is the total number of nodes in the network. The shortest-path betweenness
stands for the ability of a node relaying information between an arbitrary pair of
nodes in the network under the consideration of always choosing the shortest paths.
In epidemics, the geodesic paths that have the same distance between a pair of
nodes may provide different propagation probabilities. The nodes on these paths will
have the same shortest-path betweenness. However, as the information proceeds
their epidemic distributions differently in the paths with different propagation
probabilities, these nodes may have different epidemic influences. Therefore, the
shortest-path betweenness is not suitable for exposing the influence of nodes in
epidemics.
We introduce a simple example to explain the problem. As shown in Fig. 3.1(I),
two large groups are bridged by connections among just a few nodes. The weight
on the path “A − C1 − B” is “0.1 + 0.9”, while on the path “A − C2 − B” it is
“0.5 + 0.5”. All shortest paths between the two groups must pass through C1 or
C2 . As the weights on “A − C1 − B” and “A − C2 − B” are equal to 1, node C1
and C2 will get the same shortest-path betweenness values in this case. However,
the influential scale of node C1 and C2 will be different. When epidemics start from
group 1 to group 2 or in the reverse case, the epidemic distribution probability of
choosing path “A−C1 −B” is 0.09, while the probability of choosing the alternative
path “A − C2 − B” is 0.25 which is much higher than the former. Therefore, node
C2 has larger influence in epidemics than C1 . This example explains the reason why
the shortest-path betweenness cannot reflect the influence of node C1 and C2 in the
network.
24 3 User Influence in the Propagation of Malicious Attacks
node C can influence a big number of nodes in group 2 after node C is infected
by A. The same thing occurs when the epidemic starts from group 2. This example
explains the reason why the flow betweenness measure cannot reflect the influence
of node C in the network.
Epidemic Betweenness vs. Random-Walk Betweenness The random-walk
betweenness of an arbitrary node i counts how often node i is traversed by a
random walk between node pair (s, t), average over all s and t, as in
(st)
s<t Ii
bRW (i) = , (3.3)
(1/2)n(n − 1)
where Ii(st) is the net flow of random walk starting from s to t through i. This
measure is appropriate to a network in which information wanders about essentially
at random until it finds its target.
For the node of interest in a network, it will have random-walk betweenness
value if it can provide possible paths between node pairs. However, if the neighbors
of this node can always receive the information earlier, this node will not have
any contribution to the epidemics as its neighbors have already been influenced.
Therefore, the random-walk betweenness fails to reflect the influence of some nodes
in epidemics.
Consider the network sketched in Fig. 3.1(III) which again has two large groups
joined by a few connections. In this case, since node C is one of the nodes
connecting the two groups, it will get relatively high random-walk betweenness
value. However, the influence of node C will be very low in epidemics. Suppose
an epidemic started from group 1 and arrived at node A, then node C and B will
be influenced by A simultaneously. Then node B will influence the nodes in group
2, while node C cannot continue the epidemics since its neighbors (node A and B)
have already been influenced. As a result, node C has no influence in epidemics. The
same thing occurs when the epidemic starts from group 2. This example explains the
reason why the random-walk betweenness measure cannot reflect the influence of
node C in the network.
In the real world, an arbitrary user can receive information and forward it to the
topological neighbors. Let random variable Xi (t) represent the state of user i at
discrete time t. According to the concept from pathology, the values of Xi (t) can be
represented as follows
⎧
⎪
⎪
⎨Sus., susceptible
Xi (t) = Con., contagious . (3.4)
⎪
⎪
⎩Inf., inf ected Dor., dormant
n
S(t) = P (Xi (t) = Sus.), (3.5)
i=1
where P (·) denotes the probability of a variable. Then, the number of infected nodes
at time t, I (t), can be derived as in
As shown in Fig. 3.2, v(i, t) denotes the probability of user i becoming contagious.
Then, the value of P (Xi (t) = Sus.) can be iterated using a discrete difference
equation as in
P (Xi (t) = Sus.) = 1 − v(i, t) · P (Xi (t − 1) = Sus.). (3.7)
R(i, t) denotes the probability of user i not receiving or accepting the information.
Since the information comes from topological neighbors, the value of R(i, t) can be
derived by assuming all the neighbors cannot successfully forward the information
to user i. Then, according to the multiplication principle, we have
R(i, t) = 1 − ηj i · P (Xj (t − 1) = Con.) , (3.8)
j ∈Ni
where Ni denotes the set of user i’s neighbors. Following the definition of R(i, t),
Wen et al. [182] derived the value of v(i, t) as in
According to the state transition graph in Fig. 3.2, the value of P (Xi (t) = Con.)
can be derived as in
Note that the length of each time tick relies on the real environment. It can be 1 min,
1 h or 1 day.
Given a network and an epidemic incident starting at node s, Wen et al. [182]
introduced the influence of an arbitrary node i to this epidemic incident, Ai|s , as
the expected number of the following nodes which can be infected by node i after
i getting infected. Node i may get infected at any time in the epidemic propagation
dynamics. Therefore, Ati|s denotes the influence of node i if i gets infected at time t.
Based on the mathematical model, Wen et al. [182] estimated the overall influence
of node i in the epidemic incident, Ai|s , as
∞
E(Ai|s ) = P (Xi (t) = Con.) · E(Ati|s ) . (3.11)
t=0
28 3 User Influence in the Propagation of Malicious Attacks
Fig. 3.3 Example of the calculation of Eqs. (3.12) and (3.13). In this example, we have already
obtained the influence of node j , E(At+1 j |s ), in this epidemic incident originating from node s. As
both node h and i can infect node j , we need to calculate the contribution of node h and i to infect
node j . The contribution from node h or i to node j is determined by the ratio of their infection
probabilities, δijt and δhj
t . The influence of node i will be the proportional part of node j ’s influence
In epidemics, nodes receive and send information to their neighbors. Therefore, the
influence of node i at time t, Ati|s , can be derived from the influence of its neighbors
at time t + 1. Then Ati|s can be computed as in
E(Ati|s ) = δijt E(At+1
j |s ) + P (Xj (t + 1) = Con.) , (3.12)
j ∈Ni
where δijt denotes the ratio of node i’s contribution to the infection of node j at time
t among all the neighboring nodes of node j , and
The analysis in Sect. 3.3.2 fixes the position of starting nodes in epidemics. In
fact, epidemics may start from any node in the network. Therefore, the epidemic
influence of an arbitrary node i regardless of the starting node s should be averaged
over all the possible positions of the starting nodes in the network. As the epidemic
3.3 Epidemic Betweenness 29
betweenness stands for the epidemic influence regardless of the starting nodes, Wen
et al. [182] computed the epidemic betweenness of an arbitrary node i, bEP (i), as in
1
n
bEP (i) = E Ai|s , (3.14)
n
s=1
where E(Ai|s ) is calculated from (3.11). Note from (3.14) that bEP (i) only relies on
the structure of the topology regardless of the starting node s. Because P (Xi (t) =
Con.) can be rapidly derived by iterations, the epidemic betweenness of each node
bEP (i) can be calculated efficiently.
In this subsection, a few simple examples are used to illustrate the calculation of
different betweenness measures. The examples are based on the graphs sketched in
Fig. 3.1, with the two groups consisting of a complete graph of five nodes for each.
The details of the examples are shown in Fig. 3.4. Note that the unweighted links in
Fig. 3.4 have weight 1 by default. The results of different betweenness measures are
listed in Table 3.1.
Firstly in Fig. 3.4(I), node C1 and C2 have the same shortest-path betweenness
values (bSP (C1 ) = bSP (C2 ) = 0.1524), but their epidemic betweenness results
are different (bEP (C1 ) = bEP (C2 )). This confirms the previous analysis that node
C2 has high influence than node C1 . Secondly in Fig. 3.4(II), node C has a very
low flow betweenness value (bF L (C) = 0.0035), but this node provides a relatively
high epidemic betweenness result (bEP (C) = 0.0535). This confirms the previous
analysis that node C possesses large epidemic influence. Finally in Fig. 3.4(III),
node C has large random-walk betweenness value (bRW (C) = 0.0829), but this
node has less contribution to the epidemics (bEP (C) = 0.0624). This result
confirms the previous analysis that node C has low influence.
In Sect. 3.3.3, note that the computation of the epidemic betweenness consists of
two parts: (1) presenting the propagation dynamics and (2) computing the influence
of each node reversely.
The first part is mainly concerned with Eqs. (3.7), (3.8) and (3.10). At each
time tick t, we need to update the probabilities in Eqs. (3.7) and (3.10) for node
i when node i becomes contagious at time t − 1. Therefore, in the worst case, the
computation of these two equations for all nodes is O(n). In Eq. (3.8), at time t, we
need |Ni | multiplications to calculate the probability R(i, t) for each node i. The
probability R(i, t) will be updated when node i is linked to a contagious neighbor.
Therefore, the average computation of Eq. (3.8) becomes the product of the mean
of degree and the number of contagious nodes at time t.
In the following, we show the details of how to calculate the number of
contagious nodes in the worst case at time t. For convenience, we use c1 to denote
the mean degree of a node, i.e.,
c1 = k . (3.15)
3.3 Epidemic Betweenness 31
In the worst case, a source node can infect every reachable node at time t, i.e., the
nodes within t distance from the source node can be contagious at time t. Then, the
number of contagious nodes becomes
t
c2t − c1t 1
Qt = cj = · t−2 . (3.18)
c2 − c1 c1
j =1
Suppose the propagation ends at time T , the first part of the computation of the
epidemic betweenness becomes
O(n) + Tt=1 Qt · k · T
(3.20)
= O(nT ) + k · T · (c2 QT − c12 T )/(c2 − c1 ),
where QT ≤ n. In most real networks, the average degree of nodes, k, is small
[40]. Therefore, the first part of computation of epidemic betweenness can be
rewritten as
Given a finite network, the propagation ends within limited steps (T ). In the real
world, because the structure of networks usually has small-world and scale-free
features, the information propagation throughout the real networks becomes very
fast [207]. Therefore, the value of T is usually small (T
n).
The second part of the computation is mainly concerned with Eqs. (3.11), (3.12)
and (3.13). Given a propagation source s, the computation of Eq. (3.11) for all
nodes is O(nT ). To calculate Eq. (3.13), we need an average computation of k.
Therefore, the average computation of E(Ati|s ) in Eq. (3.12) is (k)2 . Besides,
similar to Eq. (3.8), the number of contagious nodes needs to be determined at time
t. Therefore, the computation of Eq. (3.12) at time t for all nodes becomes
32 3 User Influence in the Propagation of Malicious Attacks
t
(k) ·
2
cj . (3.22)
j =1
By combining the first and second parts, we have the computation of epidemic
betweenness as in
O(n2 T ). (3.24)
3.4 Evaluations
A series of experiments were carried out to evaluate the accuracy of the epidemic
betweenness. The experiments were conducted on both synthetic networks and real-
world networks. The synthetic networks include the Erdős-Rényi (ER) network [52],
the scale-free network [136] and the small-world network [179]. They are generated
by the widely approved open-source software Pajek [172] with 1000 nodes and
average degree to be 2. The real networks are the Enron Email network [88], the
protein-protein interaction (PPI) network [82] and the U.S. Power-Grid network
[15]. The attributes of these real-world networks are presented in Appendix. In the
experiments, we choose the infection probabilities, ηij , by Gaussian distribution.
Typically, the average infection probability, E(ηij ), is set to be 0.6. The simulation
results are obtained by 1000 runs of experiments. Each run of the simulation stops
when there are no new contagious nodes in the network.
10 10 10
8 8 8
Deviation of influence
6 6 6
4 4 4
2 2 2
0 0 0
-2 -2 -2
-4 -4 -4
-6 -6 -6
-8 10% away from the average influence -8 -8
-10 -10 -10
0 200 400 600 800 0 100 200 300 400 500 0 200 400 600 800 1000
Node ID Node ID Node ID
a b c
Fig. 3.5 The scatter plot of the difference between the influence and epidemic betweenness of
nodes. The dots indicate their difference. The dash-line pairs indicate 10% away from the averaged
influence. (a) ER; (b) Scale free; (c) Small world
5 15 20
4 15
10
Deviation of influence
3
10
2
5
1 5
0 0 0
-1 -5
-5
-2
10% away from the average influence -10
-3
-10
-4 -15
-5 -15 -20
0 50 100 150 0 500 1000 1500 0 1000 2000 3000 4000 5000
Node ID Node ID Node ID
a b c
Fig. 3.6 The scatter plot of the difference between the influence and epidemic betweenness of
nodes. The dots indicate their difference. The dash-line pairs indicate 10% away from the averaged
influence. (a) Enron Email; (b) PPI; (c) Power Grid
this node, averaged over 1000 runs of experiments. The results will be considered
as the benchmark to evaluate the accuracy of the epidemic betweenness. We expect
the epidemic betweenness to be close to the simulation result on each node. To
be precise, we use b(i) to represent the influence of node i from simulations, and
use b̃ to denote the overall average of b(i). We consider the epidemic betweenness
to be very close to the simulation result when the error between them satisfies
|b(i) − bEP (i)|/b̃ < 10%.
The experiment results on the synthetic networks are shown in Fig. 3.5. We
introduce a pair of red dashed lines in each subplot to indicate the boundaries of
10% × b̃. It can be seen that the majority of the results fall within the boundaries
in each synthetic network. This indicates that the epidemic betweenness measure
can accurately reflect the influence of nodes in synthetic networks. The results on
the real networks are shown in Fig. 3.6. In this case, we also introduce a pair of
red dashed lines to show the boundaries. Similarly, nodes seldom fall outside the
boundaries in each real network, which indicates that the epidemic betweenness
measure can accurately describe the influence of nodes in real networks.
34 3 User Influence in the Propagation of Malicious Attacks
80 80 80
60 60 60
λ > 30% λ > 30% λ > 30%
40 40 40
20 20 20
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Sampling ratio λ (%) Sampling ratio λ (%) Sampling ratio λ (%)
(A) (B) (C)
Shortest path Flow Random walk Degree Closeness Eigenvector Katz Epidemic
Fig. 3.7 The intersection percentage of the influential nodes identified from simulations and the
influential nodes captured by different measures. (a) ER; (b) Scale free; (c) Small world
3.5 Correlation Analysis 35
80 80 80
60 60 60
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Sampling ratio λ (%) Sampling ratio λ (%) Sampling ratio λ (%)
(A) (B) (C)
Shortest path Flow Random walk Degree Closeness Eigenvector Katz Epidemic
Fig. 3.8 The intersection percentage of the influential nodes identified from simulations and the
influential nodes captured by different measures. (a) Enron Email; (b) PPI; (c) Power Grid
Fig. 3.9 Scatter plots of the epidemic betweenness of nodes in the Enron Email network, against
the traditional betweenness measures: (a) shortest-path betweenness; (b) flow betweenness; (c)
random-walk betweenness. The dotted lines indicate the best linear fits in each case
Fig. 3.10 Scatter plots of the epidemic betweenness of nodes in the PPI network, against the
traditional betweenness: (a) shortest-path betweenness; (b) flow betweenness; (c) random-walk
betweenness. The dotted lines indicate the best linear fits in each case
Fig. 3.11 Scatter plots of the epidemic betweenness of nodes in the Power Grid network, against
the traditional betweenness: (a) shortest-path betweenness; (b) flow betweenness; (c) random-walk
betweenness. The dotted lines indicate the best linear fits in each case
correlated with the random-walk and flow betweenness. Similar to the results on the
Enron Email network, some nodes that have high (low) epidemic betweenness also
correspond to relatively low (high) traditional betweenness.
In general, some, but not all, of the high traditional betweenness nodes also pos-
sess high epidemic betweenness. Since information may not always spread through
the shortest paths, some high shortest-path betweenness nodes will have relatively
low epidemic betweenness. Similarly, information may not always spread through
the maximum-flow paths. Therefore, some high flow-betweenness nodes show low
epidemic betweenness. Furthermore, information spreads randomly in networks but
also complies with propagation probabilities between nodes. Therefore, random-
walk betweenness also cannot accurately describe the influence of nodes. Epidemic
betweenness considers not only the propagation probabilities between nodes but
also the influential scale of a node when the propagation initiated from an arbitrary
node. Thus, it can accurately describe the influence of a node.
We have observed in the experiment that the shortest-path betweenness is
highly correlated with the epidemic betweenness. Although these two measures
are different, we can prove that the shortest-path betweenness is equivalent to the
3.5 Correlation Analysis 37
Fig. 3.12 Scatter plots of the epidemic betweenness against other centralities in Enron Email. (a)
Degree; (b) closeness; (c) eigenvector; (d) Katz
38 3 User Influence in the Propagation of Malicious Attacks
Fig. 3.13 Scatter plots of the epidemic betweenness against other centralities in PPI. (a) Degree;
(b) closeness; (c) eigenvector; (d) Katz
Fig. 3.14 Scatter plots of the epidemic betweenness against other centralities in Power Grid. (a)
Degree; (b) closeness; (c) eigenvector; (d) Katz
The techniques of exposing the influential nodes can be divided into two distinct
classes: (1) How can we identify a set of k starting nodes, so that once they are
influenced, they will infect the largest number of susceptible nodes in the network?
(2) How can we identify a set of k nodes, so that when they are immunized to the
epidemic, they will have the largest impact in preventing the susceptible nodes from
being infected? We explain these two classes in the following.
For the methods in class (1), the early work came from P. Domingos et al. [43]
by using a Markov random field model to compute the network ‘value’ of each
node. D. Kempe et al. followed the work in [43] and pointed out that it is an
NP problem to maximize the influence of nodes[95]. Accordingly, they provided
approximation algorithms for maximizing the influence of nodes with provable
performance guarantee. As the algorithm proposed in [95] is computationally
intensive, some approaches were proposed to address the scalability, such as the
work in [30, 121, 175]. W. Chen et al. proposed a ‘degree discount’ heuristic method
to improve the original greedy algorithm in [30]. M. Mathioudakis et al. proposed
a pre-processing method to accelerate the processes of exposing influential nodes
in networks without compromising the accuracy [121]. Y. Wang et al. considered
the identification of influential nodes in mobile networks [175]. They proposed a
two-step method where in the first step, social communities are detected and in the
3.7 Summary 39
3.7 Summary
Restraining the propagation of malicious attacks in complex networks has long been
an important but difficult problem to be addressed. In this chapter, we particularly
use rumor propagation as an example to analyze the methods of restraining
malicious attack propagation. There are mainly two types of methods: (1) blocking
rumors at the most influential users or community bridges, and (2) spreading truths
to clarify the rumors. We first compare all the measures of locating influential
users. The results suggest that the degree and betweenness measures outperform
all the others in real-world networks. Secondly, we analyze the method of the
truth clarification method, and find that this method has a long-term performance
while the degree measure performs well only in the early stage. Thirdly, in order
to leverage these two methods, we further explore the strategy of different methods
working together and their equivalence. Given a fixed budget in the real world, our
analysis provides a potential solution to find out a better strategy by integrating both
kinds of methods together.
4.1 Introduction
The popularity of online social networks (OSNs) such as Facebook [171], Google
Plus [74] and Twitter [102] has greatly increased in recent years. OSNs have
become important platforms for the dissemination of news, ideas, opinions, etc.
Unfortunately, OSN is a double-edged sword. The openness of OSN platforms also
enables rumors, gossips and other forms of disinformation to spread all around
the Internet. In the real world, rumor has caused great damages to our society.
For example, the rumor “Two explosions in White House and Obama is injured”
happened in April 23, 2013 led to 10 billion USD losses before the rumor was
clarified [143].
Currently, there are mainly two kinds of strategies used for restraining rumors in
OSNs, including blocking rumors at important users [41, 49, 83, 96, 122, 129, 190,
193, 206] and clarifying rumors by spreading truths [25, 69, 71, 107, 165, 173]. We
can further categorize the first strategy into two groups according to their measures
in identifying the most important users: the most influential users [34, 70, 80, 96,
110, 159, 185] and the community bridges [31, 33, 106, 137–139, 174].
Every kind of strategy has pros and cons. Each method claims the better
performance among all the others according to their own considerations and
environments. However, there must be one standing out of the rest. Because there
does not exist a universal standard to evaluate all them together, the question of
which method is the best has long been important but difficult to be answered.
Accordingly, previous work mainly focused on the ‘vertical’ comparison (methods
inside their own category), such as the work [96, 110], but not on the ‘horizontal’
comparison (methods from different categories). All these methods are proposed to
restrain the spread of rumors in OSNs.
To numerically evaluate different methods, we introduce a mathematical model
to present the spread of rumors and truths. This is a discrete model so as to easily
locate the most important nodes in the modeling. We can thus implement different
strategies on this mathematical platform in order to evaluate their impacts to the
spread of rumors and truths. Through a series of empirical and theoretical analysis
using real OSNs, we are able to disclose the answer to the unsolved question.
In the real world, blocking rumors at important users may incur criticism since
it has risk of violating human rights. On the other hand, the probability of people
believing the truths varies according to many social factors. Therefore, it is very
important to find out the optimal strategy of restraining rumors, which possibly
should integrate both strategies together. The discussion on which method is the best
will be a small, but an important, step towards the solution of this part of work. Thus,
we are further motivated to explore the numerical relation and equivalence between
different methods. Wen et al. [184] systematically analyzed different strategies for
restraining rumors.
The most common but popular method is to monitor a group of influential users and
block their outward communication when rumors are detected on them. According
to the way they choose the influential users, we category current methods into three
types: degree, betweenness and core.
Degree The most direct and intuitive methods are to control the popular OSN
users. In social graphs, these users correspond to the nodes with large degrees
in OSNs. The theoretical bases of these methods are the scale-free and power-
law properties of the Internet matters that a few highly-connected nodes play a
vital role in maintaining the network’s connectivity [136, 147]. We illustrate this
method in Fig. 4.2a. We can see that when adequate popular users are controlled
in OSNs, the spread of rumors will be limited in a small branch of the whole
topology.
Betweenness Researchers have found that some nodes which do not have large
degrees in topologies also play a vital role in the dissemination of social
information. As shown in Fig. 4.2b, the degree of node E is smaller than node
A, B, C and D. However, node E is noticeably more important to the spread
of rumors as it is the connector of two large groups of users. To locate this
kind of nodes in OSNs, scientists introduced the measure of betweenness which
stands for the number of shortest paths passing through a given node [67]. We
can also find some other variants of betweenness, such as the RW betweenness
[132]. The work [34, 70, 80, 110, 185] argued that controlling the nodes with
higher betweenness values is more efficient than controlling those with higher
degrees.
44 4 Restrain Malicious Attack Propagation
Fig. 4.2 Restraining the rumors by controlling the influential nodes. (a): the influential nodes are
those of large degree; (b): the influential nodes are those of large betweenness; (c): the influential
nodes are those in the innermost core
Core In this case, the network topologies are decomposed using the k-shell
analysis. Some researchers have found that the most efficient rumor spreaders
are those located within the core of the OSNs as identified by the decomposition
analysis [96, 159]. We illustrate this viewpoint in Fig. 4.2c. We can see that
the nodes in the innermost component of the network may possibly have
smaller degrees, but they contribute to the kernel of the network and build the
connectivity between the outside components. Thus, the nodes in the core are
more crucial for restraining the rumors in OSNs.
Most real OSNs typically contain parts in which the nodes are more highly
connected to each other than to the rest of the network. The sets of such nodes
are usually called communities in OSNs. The existing methods used to identify
communities mainly have two types: finding overlapped communities [138, 139]
and finding separated communities [31, 33, 33, 106, 137, 174, 174].
Overlapped Every OSN user in the real world has numerous roles. For example,
a user is a student so that he or she belongs to a schoolmate community. This
user may also belong to the communities of a family and various hobby groups.
Therefore, most of the actual OSNs are made of highly overlapping cohesive
groups of users [138, 140]. The nodes which locate at more than one community
are the bridges between communities. The bridges forward the information from
one community to another. If we control the bridges and block the spread of
rumors on them, the scale of the rumors propagation will be limited to the local
community. We illustrate this kind of methods [138, 139] in Fig. 4.3a.
4.2 Methods of Restraining Rumors 45
Fig. 4.3 Restraining the rumors by controlling the bridges between communities. (a): communi-
ties are overlapped; (b): communities are separated
Separated Some researchers [31, 33, 33, 106, 137, 174, 174] extract social rela-
tionship graphs by partitioning the topologies of OSNs into numerous separated
communities. The premise of these methods is that users are more likely to
receive and forward information from their social friends. Thus, these separated
communities are representative of the most likely propagation paths of the rumors
and the truths. Compared with the overlapped communities, the bridges are
the nodes which have outward connections to the nodes of other communities.
As shown in Fig. 4.3b, when the bridges between separated communities are
controlled, the spread of rumors will also be limited to a small scale.
We build up in this section the mathematical model in order to analyze the spread
of rumors and investigate the methods of restraining their propagation.
In the real world, people may believe rumors, truths or have not heard of any
information from OSN. Let random variable Xi (t) represent the state of user i at
discrete time t. We borrow the concepts from epidemics and derive the values of
Xi (t) as follows
⎧
⎪
⎪ Sus., susceptible
⎪
⎪
⎪
⎪ Def., def ended
⎪
⎨
Act., active
Xi (t) = Rec., recovered (4.1)
⎪
⎪ I mm., immunized
⎪
⎪
⎪
⎪ Con., contagious
⎪
⎩ I nf., I nf ected
I mm., misled
Firstly, every user is presumed to be susceptible (Xi (t) = Sus.) at the beginning.
If a user is proactively controlled and will block the rumors, the node of this user
is at the Def. state. An arbitrary user i believes the rumor if Xi (t) = I nf. or the
truth if Xi (t) = Rec. Secondly, seldom users will forward the same messages of the
rumor or the truth multiple times to ‘persuade’ their social friends into accepting
what they have believed. Thus, we assume OSN users distribute the rumor or the
4.3 Propagation Modeling Primer 47
truth only once at the time when they get infected (Xi (t) = Con.) or recovered
(Xi (t) = Act.). After that, they will stop to spread the rumor (Xi (t) = Mis.) or the
truth (Xi (t) = I mm.). Thirdly, the origins of the true news in the real world usually
have high prestige among the masses. Thus, an infected user can be recovered and
will not be infected again. The user will stay being immunized after he or she trusts
the truth. We provide the state transition graph for an arbitrary user in Fig. 4.5. We
can see that most users will finally believe the truth as the Imm. state is an absorbing
state.
The nodes and the topology are the basic elements for the propagation of OSN
rumors and truths. Given an OSN, we derive the topology of it. A node in the
topology denotes a user in the OSN. Here, we propose employing m × m square
matrix with elements ηij R , ηT (ηR , ηT ∈ [0, 1]) to describe the topology of an
ij ij ij
OSN with m nodes, as in
⎡ R , ηT ⎤
R , ηT
η11 11 ··· η1m 1m
⎢ .. .. ⎥
⎣ . ηij , ηij
R T . ⎦
ηm1 , ηm1
R T ··· ηmm , ηmm
R T
where, ηijR and ηT denote the probability of rumors and truths spreading from user
ij
R = 0, ηT = 0.
i to user j respectively. If user i has contact with user j , we have ηij ij
Otherwise, ηij = 0, ηij = 0.
R T
We introduce a widely approved discrete model [9, 29, 109, 185, 186, 195] to present
the propagation of rumors and truths in OSNs. The discrete model can locate each
influential node and evaluate its impact to the spread. Given a topology of an OSN
with m nodes, we can estimate the number of susceptible and recovered users at
time t, S(t) and R(t), as in
48 4 Restrain Malicious Attack Propagation
S(t) = m P (Xi (t) = Sus.)
i=1
m (4.2)
R(t) = i=1 P (Xi (t) = Rec.)
As shown in Fig. 4.5, a susceptible user may accept the rumor and the node
enters the I nf. state. An infected node may also be recovered if this user accepts
the truth. We use v(i, t) and r(i, t) to denote the probability of user i being
infected or recovered. Then, the values of P (Xi (t) = Sus.), P (Xi (t) = Rec.)
and P (Xi (t) = Def.) can be iterated using the discrete difference equations as in
where Ni denotes the set of user i’s neighbors. We assume the states of nodes in
the topology are independent. Then, according to the state transitions in Fig. 4.5, the
values of P (Xi (t) = Con.) and P (Xi (t) = Act.) can be derived as in
From the above equations, we adopt discrete time to model the propagation
dynamics. Note that the length of each time tick relies on the real environment.
It can be 1 min, 1 h or 1 day.
4.3 Propagation Modeling Primer 49
According to the ways people believe rumors and truths, we drive different values
of v(i, t) and r(i, t). In part, we summarize two major cases on the basis of our
analysis in the real world.
Absolute Belief In this case, we optimistically assume OSN users absolutely
believe the truths except they only receive rumors. Then, we can derive the values
of v(i, t) and r(i, t) as in
v(i, t) = [1 − Neg(i, t)] · P os(i, t)
(4.10)
r(i, t) = 1 P os(i, t)
In the real world, this case happens generally when the origins of true news have
high prestige among the masses. For example, when the rumor “two explosions
in White House and Barack Obama is injured” fast spread in twitter [143], White
House, as an origin which has absolute credibility among most people, swiftly
stopped the rumor by clarifying and spreading the truth “Obama is fine and no
explosion happened”.
Minority is Subordinate to Majority In this case, people do not absolutely trust
the origins of the truths. They believe either the rumor or the truth according to
the ratio of believers among their OSN friends. We can estimate the number of
received rumor and truth copies (CR (i, t) and CT (i, t)) for each user i as in
CR (i, t) = j ∈Ni ηij · P (Xj (t − 1) = Con.)
(4.11)
CT (i, t) = j ∈Ni ηij · P (Xj (t − 1) = Act.)
where, the value of Neg(i, t) · P os(i, t) is the probability of people refuting both
kinds of information. In the real world, “minority is subordinate to majority” (M-
S-M) is a more general case. When more friends choose to accept one kind of
information, the probability of the user believe this kind of information is larger
than the probability of choosing the opposite one.
Before we carry out analysis using the mathematical model, we set up simulations
to validate its correctness. The experiment topologies are two real OSNs: Facebook
50 4 Restrain Malicious Attack Propagation
Fig. 4.6 The accuracy evaluation of the modelling compared with simulations
[171], Google Plus [74]. The simulations are implemented on the basis of existing
simulation work [192]. We mainly focus on the critical rumors (ηij R > 0.5). Thus,
In this section, we analyze the proactive measures in order to find out the most
efficient one for blocking rumors. The degree measure can be directly derived from
the OSN topology. The betweenness measure is worked out using the standard
algorithm [132]. We also implement the k-shell decomposition algorithm [26] to
identify the core of OSNs. To locate community bridges, we use CFinder [28] to
identify the overlapped communities and NetMiner [130] for the separated ones.
We focus on the Facebook network [171] in this section.
We first work out all proactive measures and show the sorted results of influential
nodes in Fig. 4.7. For the degree measure (Fig. 4.7a), we can see that the node
4.4 Block Rumors at Important Users 51
Fig. 4.7 The sorted results of the influential nodes in the Facebook topology
degrees follow the power-laws [147]. This means the nodes with large degrees
are rare in the topology but have significant contribution to the OSN connectivity.
Similar results can also be observed in the measure of betweenness (Fig. 4.7b). For
the core measure (Fig. 4.7c), we can see that the innermost part finally leaves to be
a quite small group of nodes in the network.
The results of network communities are shown in Fig. 4.8. For the separated
communities (Fig. 4.8a), we find several large communities dominate the majority
of nodes in the network. In Fig. 4.8b, we set k = 5 (refer to CFinder [28]) and obtain
similar results for the overlapped communities.
From the empirical perspective, we examine which proactive measure can be
more efficient. We use λ to denote the defense ratio of nodes in OSNs, and λ ranges
from 1% to 30%. We mainly focus on critical rumors in this chapter (E(ηij R) >
0.5). To be typical, we set E(ηijR ) = E(ηT ) = 0.6 or 0.9. In the real world, since
ij
critical rumors often originate from the most popular users, we let the rumors in
the modelling spread from the node with large degree. The results of the rumor
spreading scale are shown in Fig. 4.9.
Observation 1 If we set the defense ratio (λ) close to 30%, the degree and
betweenness measures will almost stop the spread of rumors. This result is in
52 4 Restrain Malicious Attack Propagation
Fig. 4.9 The final steady amount of infected nodes when we apply proactive measures with
different defense ratios
Fig. 4.10 The propagation dynamics of rumors when we carry out defense according to different
proactive measures
accordance with the percolation ratio used to stop viruses in Email network [207].
However, the real OSNs generally have large-scales. Blocking rumors at 30% users
in OSNs is too many to be realized in the real world.
Observation 2 The betweenness and degree measures outperform all the other
measures, and the betweenness measure performs much better than the degree
measure if λ ≤ 20%. This result is in accordance with the work [110, 185].
Figure 4.9 has presented the final amount of infected users given a rumor spreading
in network. We further investigate the propagation dynamics of those measures
(typically setting λ = 10% or 20%). The results are shown in Fig. 4.10.
Observation 3 The degree measure performs better than the betweenness measure
in the early stage. The degree and betweenness measures outperform all the others
all over the spreading procedure. However, different from the observation 2, the
degree measure has a short-term better efficiency than the betweenness measure.
This degree measure is also suggested by the work [5].
4.4 Block Rumors at Important Users 53
OSN users receive and send rumors from and to their neighboring users. We use
Atij to denote the potential contagious ability caused by the rumor spread from node
i to node j at time j . We also introduce Pijt to denote the potential contagious
probability of node j contributed by node i at time t. The mean value of Ati can then
be recursively worked out as in
t+1
E(Ati ) = E(Aij ) + Pijt+1 (4.14)
j ∈Ni
where δijt denotes the ratio of node i’s contribution to the infection of node j at time
t among all the father nodes of node j , and we have
As shown in Fig. 4.5, the I mm. state is an absorbing state. Given an OSN with finite
number of users, we can predict that the spread of rumors finally becomes steady
and the values of Ati and P (Xi (t) = Con.) converge to zero if 0lleqt ≤ ∞. As a
54 4 Restrain Malicious Attack Propagation
result, the contagious ability of each node in OSNs can be recursively and reversely
worked out by setting a large final time of the spread.
We further calculate the contagious time in order to numerically evaluate the
temporal efficiency of those measures against the spread of rumors.
Definition 4.2 Given an OSN and an incident of rumor spreading in this network,
the contagious time of an arbitrary node i, Ti , is defined as the mean time of node i
getting infected in the whole propagation.
Conceptually, the contagious time of node i, Ti , can be easily computed as in
∞
P (Xi (t) = Con.) · t
Ti = ∞ (4.17)
t=0 t=0 P (Xi (t) = Con.)·
In this section, we will analyze the remedial measure using the mathematical model.
T ). They can greatly affect the inject
There are mainly two factors, tinj ect and E(ηij
efficiency of restraining rumors by spreading truths.
R ) = E(ηT ) =
To exclusively investigate the impact of tinj ect , we typically set E(ηij ij
0.75. Based on the spreading dynamics shown in Fig. 4.10, we assign tinj ect as
• truth starts with rumor,
• truth starts in the early stage of rumor spread,
• truth starts in the late stage of rumor spread.
56 4 Restrain Malicious Attack Propagation
R ) = 0.75
Fig. 4.13 The number of infected users by varying truth injecting time. Setting: E(ηij
Fig. 4.14 The number of the contagious and the active nodes at any time t in the propagation.
R ) = 0.75
Setting: E(ηij
The experiments are executed on both the Facebook and Google Plus networks, and
with both the cases of the people making absolute choices and and making M-S-M
choices. The results are shown in Fig. 4.13.
Observation 4 The truth clarification method performs better if the spread of truths
starts earlier, but if not, this method has a weak performance in the early stage since
the rumors are distributed incredibly fast. We can see that the propagation scale will
decrease dramatically after we inject the truth into the network. Both the spread of
rumors and truths will finally become steady. The results in Fig. 4.13 indicate that
the remedial measure of spreading truth mainly perform a long-term effectiveness
in restraining rumors.
We further investigate thenumber of the contagious nodes ( m i P (Xi (t) =
Con.)) and the active nodes ( m i P (Xi (t) = Act.)) at any time t during the spread.
The results are shown in Fig. 4.14. We can see from Figs. 4.14(A1) and 4.14(C1) that
tinj ect has some effect on restraining the number of contagious nodes when people
4.5 Clarify Rumors Using Truths 57
making absolute choices. However in Figs. 4.14(B1) and 4.14(D1), we find tinj ect
has no obvious effect when people making M-S-M choices. Moreover, we can see
from Fig. 4.14(A2–D2) that the number of active nodes will take effect according
to the value of tinj ect . The results of Fig. 4.14, both from the number of contagious
nodes and active nodes in the propagation dynamics, have well explained the impact
of tinj ect observed in Fig. 4.13.
typically set tinj ect = 3 and E(ηij ) = 0.6. The value of E(ηij ) will be set as
R T
spreading
m truths in OSNs. We additionally examine the number of active nodes
i P (X i (t) = Act.) at any time t during the spread dynamics. As shown in
Fig. 4.16, a smaller value of E(ηij T ) will lead to a smaller number of active nodes.
This exactly corresponds to the limited efficiency of the remedial measure shown in
Fig. 4.15.
Given a critical rumor spreading in the network E(ηij R ) > 0.5, we can summarize
T ). Setting:
Fig. 4.15 The number of infected nodes and recovered nodes with different values E(ηij
tinj ect = 3, E(ηij ) = 0.75
R
58 4 Restrain Malicious Attack Propagation
inj ect = 3,
T ). Setting: t
Fig. 4.16 The number of active nodes with different values E(ηij
E(ηij ) = 0.6
R
R ) > 0.5
E(ηij In the real world, through the propaganda or other measurements,
people may be willing to believe and spread truths. According to previous
analysis, the truth holder can receive an acceptable or even better results by
spreading truths to restrain rumors when E(ηijR ) > 0.5.
R ) < 0.5
E(ηij According to the results of Fig. 4.15, the remedial measure may not
be able to counter the spread of rumors at this case. Actually, this is a common
phenomenon always happened in the real world.
In this section, we investigate the pros and cons when different measures work
together. We also explore the equivalence of these measures.
To numerically evaluate the effectiveness of these measures, we use the maximal
number of infected users (Imax ) and the final number of infected users (If inal ) to
present the damage caused by rumors. In the real world, when either Imax or If inal
becomes larger, more damages will be caused to the society.
Firstly, we examine the values of Imax and If inal on the basis of the mathematical
model. We typically set tinj ect = 3 and E(ηij T ) ranges from 0.1 to 0.9. The results
are shown in Fig. 4.17. We can see that the values of Imax always stay large while
T ). This indicates the
the values of If inal gradually decrease with the increasing E(ηij
remedial measure cannot alleviate the damage denoted by Imax . On the contrary, the
proactive measures are able to reduce Imax .
Secondly, the spread of rumors and truths actually presents a common issue in
the psychology field when E(ηij T ) < 0.5 < E(ηR ). That is the “rumor has wings
ij
while truth always stays indoors” since people naturally have ‘negativity bias’ on the
4.6 A Hybrid Measure of Restraining Rumors 59
Fig. 4.17 The maximum number of infected users (Imax ), the final number of infected users
(If inal ) and the final number of recovered users (Rf inal ). Settings: tinj ect = 3, E(ηij
R ) = 0.75,
E(ηijT ) = 0.6: remedial measures, (3) λ = 5%, E(ηT ) = 0.3: two methods together.
ij
The results are shown in Fig. 4.18. We find that if we set λ = 5%, E(ηij T ) = 0.3,
both Imax and If inal will decrease compared with another two extreme settings
which can only reduce either Imax or If inal .
60 4 Restrain Malicious Attack Propagation
In the real world, the surveillance on influential users needs much financial support.
The propaganda used to prompt the spread of truths also costs much money. Given
a limited budget, we explore the equivalence between the proactive and remedial
measures in order to leverage these two different strategies.
Firstly, we investigate If inal when we apply different defense ratios (λ) and
values of E(ηij T ) on the propagation of rumors and truths. On the basis of our
mathematical model, this part of analysis will disclose the congruent relationship
inj ect = 3,
between the values of λ and E(ηij T ) in networks. Typically, we set t
E(ηij ) = 0.75 and use the Facebook and Google Plus topologies. The results are
T
shown in Fig. 4.19. Given a pair of λ and E(ηij T ), we can find several equivalent
solutions with different values of λ and E(ηij T ). These different solutions have the
same performance as the original pair of λ and E(ηij T ). This means we can leverage
Fig. 4.19 The final number of infected nodes (If inal ) when we set a series of different defense
inj ect = 3, E(ηij ) = 0.75
T ). Setting: t
ratios (λ) and truth spreading probability E(ηij R
4.7 Summary 61
Fig. 4.20 The numeric equivalence between the degree measure and the remedial measure when
T ). Setting:
we set a series of different defense ratios (λ) and truth spreading probabilities E(ηij
tinj ect = 3, E(ηij ) = 0.75
R
4.7 Summary
In this section, we first discuss the robustness of the contagious ability. Then, we
discuss the fairness to the community bridges when we evaluate the efficiency of
restraining rumors. We finally summarize the work in this chapter.
In this section, we firstly discuss the robustness of the contagious ability. According
the definition of contagious ability, its usage relies on the rumor spreading origins.
However, it can be directly used for numeric evaluation of other measures when the
spread of rumors originates from highly connected nodes. To confirm the robustness
of this usage, we examine the average degree of contagious nodes, Dt , at each time
tick t, as in
m
P (Xi (t) = Con.)
Dt = m · di . (4.18)
i=0 i=0 P (Xi (t) = Con.)
In the real world, people form various communities according to their interests,
occupations and social relationships. They are more likely to contact the ones within
the same communities. Thus, it would be more precise to consider this premise in
our analysis. However, the algorithms (CFinder [28] and NetMiner [130]) have not
considered the communication bias between community members. This may cause
some unfairness to the community bridges when we evaluate the rumor restraining
efficiency.
In fact, the spread of information in community environment is a more complex
process. We plan to corporate the communication bias in communities from the
records of the real OSNs. This may help us more accurately evaluate the efficiency
of different measures. Due to the page limit, we will move this part to our future
work.
In summary, we carry out a series of analysis on the methods of restraining
rumors. On the basis of our mathematical model, the analysis results suggest that
the degree and betweenness measure outperform all the other proactive measures.
In addition, we observe that the degree measure has better short-term performance
in the early stage. We also investigate the efficiency of spreading truth in order
to restrain the rumors. We find the truth clarification method mainly has a long-
term performance. In order to address the critical case “rumor has wings while
truth always stays indoors”, we further explore the strategies of different measures
working together and the equivalence leveraging both of them. From both the
academic and practical perspective, our work is of great significance to the work
in this field.
Part II
Source Identification of Malicious
Attack Propagation
Chapter 5
Preliminary of Identifying Propagation
Sources
A B C
Fig. 5.1 Illustration of three categories of observation in networks. (a) Complete observation; (b)
Snapshot; (c) Sensor observation
but cannot distinguish susceptible or recovered nodes; (3) only a set of nodes
were observed at time t when the snapshot was taken; (4) only the nodes who
were infected exactly at time t were observed. An example of the 4-th type of
snapshots is shown in Fig. 5.1b.
Sensor Observation Sensors are firstly injected into networks, and then the
propagation dynamics over these sensor nodes are collected, including their
states, state transition time and infection directions. In fact, sensors also stand
for users or computers in networks. The difference between sensors and other
nodes in networks is that they are usually monitored by network administrators
in practice. Therefore, the sensors can record all details of the malicious
attack propagation over them, and their life can be theoretically assumed to be
everlasting during the propagation dynamics. This is different from the mobile
sensor devices which may be out of work when their batteries run out. As an
example, we show the sensor observation in Fig. 5.1c.
The initial methods, such as Rumor Center [160] and Dynamic Age [58], for
propagation source identification require the complete observation of the network
status. Later, researchers proposed source identification methods, such as Jordan
Center and Concentricity based methods, for partial observations like snapshots.
Researches also explored source identification methods through injecting sensors
into the undering network and identifying the propagation sources based on the
observations of the sensors. Accordingly, current source identification methods can
be categorized into three categories in accordance with these three different types of
observations, which we will introduce in the following chapters.
n
logL (θ |X) = logf (xi |θ ). (5.2)
i=1
n
θ̂ = arg maxθ log L (θ |X) = arg max log f (xi |θ ). (5.3)
i=1
To find the optimal parameter θ̂ which best describes the observed data given
the model and thus provides the largest log-likelihood value, we can solve the
Eq. (5.3) or computationally search for the best solution in the parameter space. This
thesis mainly adopts MLE to estimate the probability of a node being a candidate
propagation source.
6.1 Introduction
In the modern world the ubiquity of networks has made us vulnerable to various
types of malicious attacks. These malicious attacks arise in many different contexts,
but share a common structure: an isolated risk is amplified because it is spread by the
network. For example, as we have witnessed computer viruses utilize the Internet
to infect millions of computers everyday. Malicious rumors or misinformation can
rapidly spread through existing social networks and lead to pernicious effects on
individuals or society. In the recent financial crisis, the strong dependencies or
‘network’ between institutions have led to the situation where the failure of one
institution have led to global instabilities.
In essence, all of these situations can be modeled as a rumor spreading through
a network, where the goal is to find the source of the rumor in order to control
and prevent these network risks based on limited information about the network
structure and the “rumor infected” nodes. The answer to this problem has many
important applications, and can help us answer the following questions: who is the
rumor source in online social networks? which computer is the first one infected by
a computer virus? and where is the source of an epidemic?
The initial work in addressing the problem of identifying the rumor propagation
source has primarily focused on the complete observations of the networks under
malicious attack. Shah and Zaman [160–162] are the first who provided a systematic
study of the problem. They model rumor spreading in a network with the popular
Susceptible-Iinfected (SI) model and then construct an estimator for the rumor
source. The estimator is based on a novel topological quantity, called rumor
centrality. They established a maximum likelihood (ML) estimator for a class of
graphs: regular tree graph, general tree graph, and general graph. They find that
on tree graphs, the rumor center and distance center are equivalent, but on general
graphs, they may differ.
Suppose that the rumor starting at node v ∗ at time 0 has spread in the network G.
We observe the network at some time and find N infected nodes. Then, these nodes
must form a connected subgraph of G. The subgraph is denoted as GN . Based on the
observation GN and the knowledge of G, the maximum likelihood (ML) estimator
v̂ of v ∗ minimizes the error probability. By definition, the ML estimator is
where P (GN |v) is the probability of observing GN under the SI model assuming v
is the source, v ∗ . Thus, we need to calculate P (GN |v) for all v ∈ GN and then treat
the one with the maximal value as the rumor source.
6.4 Rumor Source Estimator: ML for Regular Trees 71
Note that, the calculation of P (GN |v) is not computationally tractable. Shah and
Zaman [160] first evaluate P (GN |v) on a tree graph. Essentially, they need to find
the probability of all possible events that result in GN after N nodes are infected
starting with v as the source under the SI model. For example, in Fig. 6.1, suppose
node 1 was the source, i.e., we need to calculate P (G4 |1). Then there are two
disjoint events or node orders in which the rumor spreads that will lead to G4 with 1
as the source: {1, 2, 3, 4} and {1, 2, 4, 3}. in general to evaluate P (GN |v), we need
to find all such permitted permutations and their corresponding probabilities. The
permitted permutations are defined as follows.
Definition 6.1 (Permitted Permutation) Given a connected tree G(V , E) and a
source node v ∈ V , consider any permutation σ : V → {1, 2, · · · , |V |} of its
nodes where σ (u) denotes the position of node u in the permutation σ . σ is called a
permitted permutation for tree G(V , E) with source node v if
1. σ (v) = 1.
2. For any (u, u ) ∈ E, if d(v, u) < d(v, u ), then σ (u) < σ (u ). Here, d(v, u)
denotes the shortest path distance from v to u.
Let (v, GN ) be the set of all permitted permutations starting with node v and
resulting in rumor graph GN . The next step is to determine the probability P (σ |v)
for each σ ∈ (v, GN ). Let σ = {v1 = v, v2 , · · · , vN }, and define Gk (σ ) as the
subgraph of GN containing nodes {v1 = v, v2 , · · · , vk } for 1 ≤ k ≤ N . Then,
P (σ |v) = N th
k=2 P (k infected node = vk |Gk−1 (σ ), v). (6.2)
Each term in the product on the right-hand side in Eq. (6.2), can be evaluated as
follows. Given Gk−1 (σ ) and source v, the next infected node could be any of the
neighbors of the nodes in Gk−1 (σ ) which are not yet infected. If Gk−1 (σ ) has
nk−1 (σ ) uninfected neighboring nodes, then each one of them is equally likely to be
the next infected node with probability 1/nk−1 (σ ). Therefore, Eq. (6.2) reduces to
1
P (σ |v) = N
k=2 . (6.3)
nk−1 (σ )
Given Eq. (6.3), now the problem of computing P (σ |v) becomes evaluating the size
of the rumor boundary 1/nk−1 (σ ) for 2 ≤ k ≤ N . Suppose the kth added node
to Gk−1 (σ ) is vk (σ ) with degree dk (σ ). Then it contributes dk (σ ) − 2 new edges
(and hence nodes in the tree) to the rumor boundary. This is because, dk (σ ) new
edges are added but the edge along which the recent infection happened has to be
removed. That is, nk (σ ) = nk−1 (σ )dk (σ ) − 2. Subsequently
k
nk (σ ) = d1 (σ ) + (di (σ ) − 2). (6.4)
i=2
Therefore,
1
P (σ |v) = N
k=2 k−1 . (6.5)
d1 (σ ) + i=2 (di (σ ) − 2)
For a d regular tree, since all nodes have the same degree d, it follows from
Eq. (6.5) that every permitted permutation σ has the same probability, independent
of the source. Specifically, for any source v and permitted permutation σ
−1 1
P (σ |v) = N ≡ p(d, N). (6.6)
k=1
dk − 2(k − 1)
From above, it follows immediately that for a d regular tree, for any GN and
candidate source v, P (GN |v) is proportional to |(v, GN )|. Formally, they use
R(v, GN ) to denote the number of distinct permitted permutations |(v, GN )|.
Definition 6.2 Given a graph G(V , E) and vertex v of G, R(v, GN ) is defined as
the total number of distinct permitted permutations of nodes of G that begin with
node v ∈ G and respect the graph structure of G.
In summary, the ML estimator for a regular tree becomes
As Eq. (6.7) suggests, the ML estimator for a regular tree can be obtained by simply
evaluating R(v, GN ) for all v. However, as indicated by Eq. (6.5), such is not the
case for a general tree with heterogeneous degree. To form an ML estimator for
a general tree, it is required to keep track of the probability of every permitted
permutation. This is computationally expensive due to the exponential number of
terms involved.
6.6 Rumor Source Estimator: ML for General Graphs 73
Note that, the likelihood of a node is a sum of the probability of every permitted
permutation for which it is the source. In general, these will have different values,
but it may be that a majority of them have a common value. To obtain this common
value, Shah and Zaman [160] assume the nodes receive the rumor in a breadth-first
search (BFS) way. For example, consider the network in Fig. 6.2. If node 2 is the
source, then a BFS sequence of nodes would be {2, 1, 3, 4, 5} and the probability of
this permitted permutation is given by Eq. (6.5).
Suppose σv∗ is the BFS permitted permutation with node v as the source, then the
rumor source estimator becomes
Consider a simple example as shown in Fig. 6.3. The BFS trees for each node are
shown. Using the expression for from Sect. 6.12, the general graph estimator values
for the nodes are
1 5!
P (σ1∗ |1)R(1, Tbfs (1)) = , P (σ2∗ |2)R(2, Tbfs (2))
4 ∗ 6 ∗ 8 ∗ 10 20
1 5!
=
4 ∗ 6 ∗ 8 ∗ 10 30
1 5!
P (σ3∗ |3)R(3, Tbfs (3)) = , P (σ4∗ |4)R(4, Tbfs (4))
4 ∗ 6 ∗ 8 ∗ 10 20
1 5!
=
4 ∗ 6 ∗ 8 ∗ 10 10
1 5!
P (σ5∗ |5)R(5, Tbfs (5)) =
4 ∗ 6 ∗ 8 ∗ 10 40
Node 4 maximizes this value and would be the estimate of the rumor source.
Note from Eqs. (6.7), (6.8), and (6.9), R(v, GN ) plays an important role in each of
the rumor source estimators. Recall that R(v, GN ) counts the number of distinct
ways a rumor can spread in the network GN starting from source v. Shah and
Zaman [160] called this number, R(v, GN ), as the rumor centrality of the node
v with respect to GN . The node with maximum rumor centrality is called the rumor
center of the network. This section introduces the approach proposed by Shah and
Let GN be a tree graph. Define Tuv as the number of nodes in the subtree rooted at
node u, with node v as the source. Figure 6.4 illustrates this notation in a simple
example. Here, T21 = 3 because there are 3 nodes in the subtree with node 2 as the
root and node 1 as the source. Similarly, T71 = 1 because there is only 1 node in the
subtree with node 7 as the root and node 1 as the source.
To calculate R(v, GN ) is to count the number of permitted permutations of N
nodes of GN . There are N slots in a given permitted permutation, and the first of
which must be the source node v. Note that, a node u must come before all the
nodes in its subtree Tuv . Given a slot assignment for all nodes in Tuv subject to this
constraint, there are R(u, Tuv ) different ways in which these nodes can be ordered.
This suggests a natural recursive relation between the rumor centrality R(u, GN )
and the rumor centrality of its immediate childrens subtrees R(u, Tuv ) with u ∈
child(v). Here child(v) represents the set of all children of v in tree GN assuming v
as its root. Specifically, there is no constraint between the orderings of the nodes of
different subtrees Tuv with u ∈ child(v). This leads to
R(u, Tuv )
R(u, GN ) = (N − 1)!u∈child(v) (6.10)
Tuv !
If we expand this recursion Eq. (6.10) to the next level of depth in GN we obtain
R(u, Tuv )
R(u, GN ) = (N − 1)!u∈child(v)
Tuv !
(Tuv − 1)! R(w, Twv )
= (N − 1)!u∈child(v) w∈child(u) (6.11)
Tuv ! Twv !
1 R(w, Twv )
= (N − 1)!u∈child(v) v w∈child(u) .
Tu Twv !
A leaf node l will have 1 node and 1 permitted permutation, so R(l, Tlv ) = 1. If we
continue this recursion until we reach the leaves of the tree, then we find that the
number of permitted permutations for a given tree GN rooted at v is
1 1
R(u, GN ) = (N − 1)!u∈GN \v v
= N !u∈GN v . (6.12)
Tu Tu
Thus, Eq. (6.12) gives the simple expression for rumor centrality in a tree graph. As
an example of the use of rumor centrality, consider the network in Fig. 6.5. Using the
rumor centrality formula in Eq. (6.12), we find that the rumor centrality of node 1 is
5!
R(1, G) = = 8. (6.13)
5∗3
Indeed, there are 8 permitted permutations of this network with node 1 as the source:
Shah and Zaman [160] further compared rumor centrality with distance centrality.
Distance centrality has become popular in the literature as a graph based score
function for various other applications. For a graph G, the distance centrality of
node v ∈ G, D(v, G), is defined as
D(v, G) = d(v, j ), (6.14)
j ∈G
where d(v, j ) is the shortest path distance from node v to node j . The distance
center of a graph is the node with the smallest distance centrality. Intuitively, it is
6.7 Rumor Centrality 77
the node closest to all other nodes. Shah and Zaman [160] proved that the distance
center is equivalent to the rumor center on a tree graph in the following theorem.
Theorem 6.1 On an N node tree, if vD is the distance center, then, for all v = vD
Shah and Zaman [160] show that, the rumor center is not always equivalent to the
distance center in a general graph.
Extensive simulations have been performed on both synthetic networks (a small-
world network and a scale-free network) and real-world networks (the Internet
autonomous system (AS) network [1] and the U.S electric power grid network [16]).
The results show that the rumor center estimator either finds the source exactly or
within a few hops of the true source across different network topologies.
Chapter 7
Source Identification Under Snapshots:
A Sample Path Based Source Estimator
7.1 Introduction
graph and each node in the network has three possible states: susceptible (S),
infected (I ), and recovered (R). Nodes in state S can be infected and change to
state I , and nodes in state I can recover and change to state R. Recovered nodes
cannot be infected again. Initially, all nodes are assumed to be in the susceptible
state except one infected node. The infected node is the propagation source of the
malicious attack. The source then infects its neighbors, and the attack starts to spread
in the network. Now given a snapshot of the network, in which some nodes are
infected nodes, some are healthy (susceptible and recovered) nodes. The susceptible
nodes and recovered nodes assumed to be indistinguishable. Zhu and Ying [200]
proposed a low complexity algorithm, called reverse infection algorithm, through
finding the sample path based estimator in the underlying network. In the algorithm,
each infected node broadcasts its identity in the network, the node who first collect
all identities of infected nodes declares itself as the information source. They proved
that the estimated source node is the node with the minimum infection eccentricity.
Since a node with the minimum eccentricity in a graph is called the Jordan center,
they call the nodes with the minimum infection eccentricity the Jordan infection
centers.
Consider an undirected graph G(V , E), where V is the set of nodes and E is the
set of undirected edges. Each node v ∈ V has three possible states: susceptible (S),
infected (I ), and recovered (R). Zhu and Ying [200] assumed a time slotted system.
Nodes change their states at the beginning of each time slot, and the state of node
v in time t is denoted by Xv (t). Initially, all nodes are in state S except source
node v ∗ which is in state I . At the beginning of each time slot, each infected node
infects its susceptible neighbors with probability q. Each infected node recovers
with probability p. Once a node get recovered, it cannot be infected again. Then,
the infection process can be modeled as a discrete time Markov chain X(t), where
X(t) = {Xv (t), v ∈ V } is the states of all the nodes at time t. The initial state of this
Markov chain is Xv (0) = S for v = v ∗ and Xv ∗ (0) = I .
Suppose at time t, we observe Y = Yv , v ∈ V ⊆ V such that
1, if v is in state I ;
Yv = (7.1)
0, if v is in state S or R.
and others are susceptible or recovered nodes. The pair of numbers next to each
node are the corresponding infection time and recovery time. For example, node 3
was infected at time slot 2 and recovered at time slot 3. -1 indicates that the infection
or recovery has yet occurred. Note that these two pieces of information are generally
not available.
Suppose X[0, t] = X(τ ) : 0 < τ ≤ t is a sample path of the infection process from
0 to t, and define the function F (·) such that
1, if Xv (t) = I ;
F (Xv (t)) = (7.2)
0, otherwise.
Then, F(X[t]) = Y if F (Xv (t)) = Yv for all v. Identifying the propagation source
can be formulated as a maximum likelihood detection problem as follows:
v̂ ∈ arg max Pr(X[0, t]|v ∗ = v), (7.3)
v∈V
X[0,t]:F(X(t))=Y
where Pr(X[0, t]|v ∗ = v) is the probability to obtain sample path X[0, t] given the
source node v.
Note that the difficulty of solving the problem in (7.3) is the curse of dimen-
sionality. For each v such that Yv = 0, its infection time and recovered time are
required, i.e., O(t 2 ) possible choices; for each v such that Yv = 1, the infection
time needs to be considered, i.e., O(t) possible choices. Therefore, even for a
fixed t, the number of possible sample paths is at least at the order of t N , where
N is the number of nodes in the network. This curse of dimensionality makes
it computational expensive. Zhu and Ying [200] introduced a sample path based
approach to overcome the difficulty.
82 7 Source Identification Under Snapshots: A Sample Path Based Source Estimator
Instead of computing the marginal probability, Zhu and Ying [200] proposed to
identify the sample path X∗ [0, t ∗ ] that most likely leads to Y, i.e.,
where X = {X[0, t]|F(X(t)) = Y}. The source node associated with X∗ [0, t ∗ ] is
viewed as the information source.
Based on the in graph theory in [77], the eccentricity e(v) of a vertex v is
the maximum distance between v and any other vertex in the graph. The Jordan
centers of a graph are the nodes which have the minimum eccentricity. For example,
in Fig. 7.2, the eccentricity of node v1 is 4 and the Jordan center is v2 , whose
eccentricity is 3. Following a similar terminology, Zhu and Ying [200] define the
infection eccentricity ẽ(v) given Y as the maximum distance between v and any
infected nodes in the graph. Then, the Jordan infection centers of a graph are the
nodes with the minimum infection eccentricity given Y. In Fig. 7.2, nodes v3 , v10 ,
v13 and v14 are observed to be infected. The infection eccentricities of v1 , v2 , v3 , v4
are 2, 3, 4, 5, respectively, and the Jordan infection center is v1 .
Zhu and Ying [200] proved that the propagation source associated with the
optimal sample path is a node with the minimum infection eccentricity. The proof
of this result is consist of three steps: first, assuming the information source is vr ,
they analyze tv∗r such that
i.e., tv∗r is the time duration of the optimal sample path in which vr is the information
source. They proved that tv∗r equals to the infection eccentricity of node vr . In the
second step, they consider two neighboring nodes, say nodes v1 and v2 . They proved
that if ẽ(v1 ) < ẽ(v2 ), then the optimal sample path rooted at v1 occurs with a higher
probability than the optimal sample path rooted at v2 . At the third step, they proved
that given any two nodes u and v, if v has the minimum infection eccentricity and u
has a larger infection eccentricity, then there exists a path from u to v along which
the infection eccentricity monotonically decreases, which implies that the source of
the optimal sample path must be a Jordan infection center. For example, in Fig. 7.2,
node v4 has a larger infection eccentricity than v1 and v4 → v3 → v2 → v1 is the
path along which the infection eccentricity monotonically decreases from 5 to 2. In
the next subsection, we briefly explain the techniques involved in these three steps.
Lemma 7.1 Consider a tree network rooted at vr and with infinitely many levels.
Assume the information source is the root, and the observed infection topology is
Y which contains at least one infected node. If ẽ(vr ) ≤ t1 < t2 , then the following
inequality holds
where d(vr , u) is the length of the shortest path between and u and also called the
distance between vr and u, and I is the set of infected nodes.
This lemma states that the optimal time is equal to the infection eccentricity. The
next lemma states that the optimal sample path rooted a node with a smaller
infection eccentricity is more likely to occur.
Lemma 7.2 Consider a tree network with infinitely many levels. Assume the
information source is the root, and the observed infection topology is Y which
contains at least one infected node. For u, v ∈ V such that (u, v) ∈ E, if tu∗ > tv∗ ,
then
P r(X∗u ([0, tu∗ ])) < P r(X∗v ([0, tv∗ ])), (7.8)
where X∗u ([0, tu∗ ]) is the optimal sample path starting from node u.
Proof Denote by Tv the tree rooted in v and Tu−v the tree rooted at u but without
the branch from v. See Tv−v 9
1
and Tv−v
7
2
in Fig. 7.2. Furthermore, denote by C (v)
the set of children of v. The sample path X[0, t] restricted to Tu−v is defined to be
X([0, t], Tu−v ).
The first step is to show tu∗ = tv∗ + 1. Note that Tv−u ∩ I = ∅, otherwise, all
infected node are on Tu−v . As T is a tree, v can only reach nodes in Tu−v through
edge (u, v), tv∗ = tu∗ + 1, which contradicts tu∗ > tv∗ .
84 7 Source Identification Under Snapshots: A Sample Path Based Source Estimator
and ∀b ∈ Tv−u ∩ I ,
Hence,
According to the definition of tu∗ and tvI , within tu∗ − tvI time slots, node v can infect
all infected nodes on Tv−u . Since tu∗ = tvI + 1, the infected node farthest from node u
must be on Tv−u , which implies that there exists a node a ∈ Tv−u such that d(u, a) =
tu∗ = tv∗ + 1 and d(v, a) = tv∗ . So node v cannot reach a within tu∗ − tvI time slots,
which contradicts the fact that the infection can spread from node v to a within
tu∗ − tvI time slots along the sample path X∗u [0, tu∗ ]. Therefore, tvI = 1.
Now given sample path X∗u ([0, tu∗ ]), the third step is to construct X∗v ([0, tv∗ ])
which occurs with a higher probability. The sample path X∗u ([0, tu∗ ]) can be divided
into two parts along subtrees Tu−v and Tv−u . Since tvI = 1, then
P r(X∗u ([0, tu∗ ])) = q · P r(X∗u ([0, tu∗ ], Tv−u )|tvI = 1) · P r(X∗u ([0, tu∗ ], Tu−v )).
(7.14)
Suppose in X∗v ([0, tv∗ ]), node u was infected at the first time slot, then
P r(X∗v ([0, tv∗ ])) = q · P r(X∗v ([0, tv∗ ], Tv−u )|tvI = 1) · P r(X∗v ([0, tv∗ ], Tu−v )|tuI = 1).
(7.15)
For the subtree Tv−u , given X∗u ([0, tu∗ ], Tv−u ), in which tvI = 1, the partial sample
path X∗v ([0, tv∗ ], Tv−u ) can be constructed to be identical to X∗u ([0, tu∗ ], Tv−u ) except
that all events occur one time slot earlier, i.e.,
Then,
P r(X∗u ([0, tu∗ ], Tv−u )|t I v = 1) = P r(X∗v ([0, tv∗ ], Tv−u )).. (7.17)
For the subtree Tu−v , X∗v ([0, tv∗ ], Tu−v ) can be constructed such that
X∗v ([0, tv∗ ], Tu−v ) ∈ arg max P r(X̃([0, tv∗ ], tu−v )|tuI = 1).
X̃([0,tv∗ ],tu−v )∈X (tv∗ ,Tu−v )
(7.18)
According to Lemma 7.1, the following inequation is satisfied:
Therefore, given the optimal sample path rooted at u, a sample path rooted at v can
be constructed, which occurs with a higher probability. The lemma holds.
The following lemma gives a useful property of the Jordan infection centers.
Lemma 7.3 On a tree network with at least one infected node, there exist at most
two Jordan infection centers. When the network has two Jordan infection centers,
the two must be neighbors.
The following theorem states that the sample path based estimator is one of the
Jordan infection centers.
Theorem 7.1 Consider a tree network with infinitely many levels. Assume that the
observed infection topology Y contains at least one infected node. Then the source
node associated with X∗ [0, t ∗ ] (the solution to the optimization problem (7.4)) is a
Jordan infection center, i.e.,
Proof Assume the network has two Jordan infection centers: w and u, and assume
ẽ(w) = ẽ(u) = λ. Based on Lemma 7.3, w and u must be adjacent. The following
steps show that, for any a ∈ V \{w, u}, there exists a path from a to u (or w) along
which the infection eccentricity strictly decreases.
First, it is easy to see from Fig. 7.3 that d(γ , w) ≤ λ − 1 ∀γ ∈ Tw−u ∩ I . Then,
there exists a node ξ such that the equality holds. Suppose that d(γ , w) ≤ λ − 2 for
any γ ∈ Tw−u ∩ I , which implies
d(γ , w) ≤ λ
(7.22)
d(γ , u) ≤ λ − 1.
In a summary, ∀ γ ∈ I ,
d(γ , u) ≤ λ − 1. (7.23)
This contradicts the fact that ẽ(w) = ẽ(u) = λ. Therefore, there exists ξ ∈ Tw−u ∩ I
such that
d(ξ, w) = λ − 1. (7.24)
Similarly, ∀ γ ∈ Tu−w ∩ I
d(γ , u) ≤ λ − 1. (7.25)
and there exists ξ ∈ Tw−u ∩ I such that the equality holds. On the other hand,
∀ γ ∈ Tu−w ∩ I
Therefore,
ẽ(a) = λ + β, (7.28)
Repeatedly applying Lemma 7.2 along the path from node a to u, then the
optimal sample path rooted at node u is more likely to occur than the optimal sample
path rooted at node a. Therefore, the root node associated with the optimal sample
path X∗ [0, t ∗ ] must be a Jordan infection center. The theorem holds.
Zhu and Ying [200] further proposed a reverse infection algorithm to identify the
Jordan infection centers. The key idea of the algorithm is to let every infected node
broadcast a message containing its identity (ID) to its neighbors. Each node, after
receiving messages from its neighbors, checks whether the ID in the message has
been received. If not, the node records the ID (say v), the time at which the message
is received (say tv ), and then broadcasts the ID to its neighbors. When a node
receives the IDs of all infected nodes, it claims itself as the propagation source and
the algorithm terminates. If there are multiple nodes receiving all IDs at the same
time, the tie is broken by selecting the node with the smallest tv . The details of
the algorithm is presented in Algorithm 7.1.
Simulations on g-regular trees were conducted. The infection probability q was
chosen uniformly from (0, 1) and the recovery probability p was chosen uniformly
from (0, q). The infection process propagates t was uniformly chosen from [3, 20].
The experimental results show that the detection rates of both the reverse infection
and closeness centrality algorithms increase as the degree increases and is higher
than 60% when g > 6.
8.1 Introduction
We first introduce the network model used in [146]. The underlying network where
a malicious attack takes place is modeled by a finite, undirected graph G(V , E),
where the vertex/node set V has N nodes, and the edge set E has L edges (see
Fig. 8.1 for illustration). Assume that the graph G is a known priori. The propagation
source, s ∗ ∈ G, is the vertex that initiates the propagation. Suppose s ∗ as a random
variable with uniform distribution over the set V , i.e., so that any node in the network
is equally likely to be the source.
The propagation process is modeled as follows. At time t, each vertex u ∈ G
presents one of two possible statuses: (a) infected, if it has already been attacked
from any of its neighbors; or (b) susceptible, if it has not been infected/attacked so
far. Let V (u) denote the set of vertices directly connected to u, i.e., the neighborhood
or vicinity of u. Suppose u is in the susceptible state and, at time tu , receives the
attack for the first time from one neighbor, say s, thus becoming infected. Then, u
will re-transmit the malicious attack to all its other neighbors, so that each neighbor
v ∈ V (u)\s receives the attack at time tu + θuv , where θuv denotes the random
propagation delay associated with edge (u, v). The random variables {θuv } for
different edges (u, v) have a known, arbitrary joint distribution. The propagation
process is initiated by the source x ∗ at an unknown time t = t ∗ .
Let O := {ok }K k=1 ∈ G denote the set of K observers/sensors, whose location on
the network G is already known. Each observer measures from which neighbor and
at what time it received the attack. Specifically, if tv,o denotes the absolute time at
which observer o receives the attack from its neighbor v, then the observation set is
composed of tuples of direction and time measurements, i.e., O := {(o, v, tv,o )}, for
all o ∈ O and v ∈ V (o).
Pinto et al. [146] recovered the source location from measurements taken at the
observers by adopting a maximum probability of localization criterion, which
corresponds to designing an estimator ŝ(·) such that the localization probability
Ploc := P(ŝ(O = s ∗ )) is maximized. Since s ∗ is assumed to be uniformly
random over the network G, the optimal estimator is the maximum likelihood (ML)
estimator,
Consider the case of an underlying tree T . Because a tree does not contain cycles,
only a subset Oa ∈ O of the observers will receive the attack emitted by the
unknown source. Then, Oa = {ok }K a
k=1 is called the set of Ka active observers.
The observations made by the nodes in Oa provide two types of information. (a)
The first information is the direction in which attack arrives to the active observers,
which uniquely determines a subset Ta ∈ T of regular nodes. Hence, Ta is called
active subtree, (see the left figure in Fig. 8.2). (b) The second information is the
timing at which the attack arrives to the active observers, denoted by {tk }K a
k=1 , which
is used to localize the source within the set Ta . It is also convenient to label the edges
of Ta as E(Ta ) = {1, 2, · · · , Ea }, so that the propagation delay associated with edge
i ∈ E is denoted by θi (see the left figure in Fig. 8.2). Assume that the propagation
delays associated with the edges of T are independent identically distributed random
variables with Gaussian distribution N(μ, σ 2 ), where the mean μ and variance σ 2
are known. Based on these definitions, the following result can be concluded.
92 8 Source Identification Under Sensor Observations: A Gaussian Source Estimator
a b
Fig. 8.2 (a) Active tree Ta . The vector next to each candidate source s is the normalized
deterministic delay µ̃s := µs /μ. The normalized delay covariance for this tree is à := A/σ 2 =
[5, 2; 2, 4]. (b) Equiprobability contours of the probability density function P(d|s ∗ = s) for all
s ∈ Ta , and the corresponding decision regions. For a given observation d, the optimal estimator
chooses the source s that maximizes P(d|s ∗ = s)
1
ŝ = arg max µTs A−1 (d − µs ), (8.2)
s∈Ta 2
where d is the observed delay, µs is the deterministic delay, and A is the delay
covariance, given by
for k, i = 1, · · · , Ka − 1, with |P (u, v)| denoting the number of edges (i.e., the
length) of the path connecting vertices u and v. Readers could refer to [146] for the
detailed proof of this proposition.
Note that, when node s is chosen as the source, µs and A represent, respectively,
the mean and covariance of the observed delay d (see Fig. 8.1 for illustration).
Proposition 8.1 reduces the estimator in (8.1) to a tractable expression, where the
parameters can be simply obtained from path lengths in the tree T . Furthermore,
8.5 Source Estimator on a General Graph 93
the complexity of Eqs. (8.2)–(8.5) scales as O(N ) with the number of nodes N in
the tree. The full proof of the complexity is given in the Supplemental Material in
[146]. The sparsity implies that the distance between observers is large, and so is
the number of random variables of the sum
dk = tk+1 − t1 = θi − θi . (8.6)
i∈P (s ∗ ,ok+1 ) i∈P (s ∗ ,o1 )
Based on the central limit theorem, the observer delay vector d can be closely
approximated by a Gaussian random vector.
Now consider the most general case of source estimation on a general graph
G. When the malicious attack is propagated on the network, there is a tree
corresponding to the first time each node gets informed, which spans all nodes
in G. Note that, the number of spanning trees can be exponentially large. Pinto
et al. [146] assume that the actual propagation tree is a breadth-first search (BFS)
tree. This corresponds to assuming that the attack travels from the source to each
observer along a minimum-length path, which is intuitively satisfying. Then, the
resulting estimator can be written as
the BFS tree Tbf s,s rooted at s. It can easily shown that the complexity of Eq. (8.7)
scales subexponentially with N , as O(N 3 ).
Therefore, Eqs. (8.2) and (8.7) gives the propagation source estimator for a tree
graph and a general graph, respectively. The computational complexity of these two
estimators are O(N ) and O(N 3 ), respectively. We call these estimators as Gaussian
source estimators.
To test the effectiveness of the proposed approach, Pinto et al. [146] used the
well-documented case of cholera outbreak that occurred in the KwaZulu-Natal
province, South Africa, in 2000. Propagation source identification was performed
by monitoring the daily cholera cases reported in K communities (the observers).
The experimental results show that by monitoring only 20% of the communities,
the Gaussian estimator achieves an average error of less than four hops between
the estimated source and the first infected community. This small distance error
may enable a faster emergency response from the authorities in order to contain an
outbreak.
Chapter 9
Comparative Study and Numerical
Analysis
Shah and Zaman [160, 161] introduced rumor centrality for source identification.
They assume that information spreads in tree-like networks and the information
propagation follows the SI model. They also assume each node receives information
from only one of its neighbors. Since we consider the complete observations of
networks, the source node must be one of the infected nodes. This method is
proposed for the propagation of rumors originating from a single source. Assuming
an infected node as the source, its rumor centrality is defined as the number of
distinct propagation paths originating from the source. The node with the maximum
rumor centrality is called the rumor center. For regular trees, the rumor center is
considered as the propagation origin. For generic networks, researchers employ BFS
trees to represent the original networks. Each BFS tree corresponds to a probability
ρ of a rumor that chooses this tree as the propagation path. In this case, the source
node is revised as the one that holds the maximum product of rumor centrality and ρ.
In essence, the method is to seek a node from which the propagation matches the
complete observation the best. As proven in [160, 161], the rumor center is equiva-
lent to the closeness center for a tree-like network. However, for a generic network,
the closeness center may not equal the rumor center. The effectiveness of the method
is further examined by the work in [162]. The authors proved that the rumor center
method can still provide guaranteed accuracy when relaxing two assumptions: the
exponential spreading time and the regular trees. This method was further explored
in the snapshot scenario that nodes reveal whether they have been infected with
probability μ [89]. When μ was large enough, the authors proved the accuracy of
the rumor center method can still be guaranteed. Wang et al. [178] extended the
discussion of the single rumor center into a more complex scenario with multiple
snapshots. Although snapshot only provided partial knowledge of rumor spreading,
the authors proved that multiple independent snapshots could dramatically improve
9.1 Comparative Study 97
temporally sequential snapshots. The analysis in [178] suggested that the complete
observation of rumor propagation could be approximated by multiple independent
snapshots.
There are several strong assumptions far from reality. First, it is considered
on a very special class of networks: infinite trees. Generic networks have to be
reconstructed into BFS trees before seeking propagation origins. Second, rumors
are implicitly assumed to spread in a unicast way (i.e., an infectious node can only
infect one of its neighbors at one time step). Third, the infection probability between
neighboring nodes is equal to 1. In the real world, however, networks are far more
complex than trees, with rumors often spreading in multicast or broadcast ways,
and the infection probability between neighboring nodes differing from each other.
Following the assumptions in the rumor center method, Dong et al. [44] proposed a
local rumor center method to identify rumor sources. This method designates a set of
nodes as suspicious sources. Therefore, it reduces the scale of seeking origins. They
extended the approaches and results in [160] and [161] to identify the source of
propagation in networks. Following the definition of the rumor center, they defined
the local rumor center as the node with the highest rumor centrality compared to
other suspicious infected nodes. The local rumor center is considered as the rumor
source.
For regular trees with every node having degree d, the authors analyze the
accuracy γ of the local rumor center method. To construct a regular tree, the degree
d of each node should be at least 2. For regular trees, Dong et al. [44] derived the
following conclusions.√(1) When d = 2, the accuracy of the local rumor center
method follows O(1/ n), where n is the number of infected nodes. Therefore,
when n is sufficiently large, the accuracy is close to 0. (2) When the suspicious set
degenerates into the entire network, the accuracy γ grows from 0.25 to 0.307 as
d increases from 3 to +∞. This means that the minimum accuracy γ is 25% and
the maximum accuracy is 30.7%. (3) When the suspicious nodes form a connected
subgraph of the network, the accuracy γ significantly exceeds 1/k when d = 3,
where k is the number of suspicious nodes. (4) When there are only two suspect
nodes, the accuracy γ is at least 0.75 if d = 3, and γ increases with the distance
between the two suspects. (5) When multiple suspicious nodes form a connected
subgraph, the accuracy γ is lower than when these nodes form several disconnected
subgraphs.
The local rumor center is actually the node with the highest rumor centrality
in the priori set of suspects. The advantage of the local rumor center method is
that it dramatically reduces the source-searching scale. However, it has the same
drawbacks as the single rumor center method.
98 9 Comparative Study and Numerical Analysis
Luo et al. [115] extended the single rumor center method to identify multiple
sources. In addition to the basic assumptions, they further assumed the number of
sources was known for the method of identifying multiple rumor centers. Based on
the definition of rumor centrality for a single node, Luo et al. [115] extended rumor
centrality to a set of nodes, which is defined as the number of distinct propagation
paths originating from the set. They proposed a two-source estimator to compute
the rumor centrality when there were only two sources. For multiple sources, they
proposed a two-step method. In the first step, they assumed a set of infected nodes
as sources. All infected nodes were divided into different partitions by using the
Voronoi partition algorithm [76] on these sources. The single rumor center method
was then employed to identify the source in each partition. In the second step,
estimated sources were calibrated by the two-source estimator between any two
neighboring partitions. These two steps were iterated until the estimated sources
become steady.
Luo et al. [115] are the first to employ the rumor center method to identify
multiple rumor sources. They further investigate the performance of the two-source
estimator on geometric trees [161]. The accuracy approximates to 1 when the
infection graph becomes large. This method has also been extended to identify
multiple sources with snapshot observations. Because snapshots only provide partial
knowledge about the spreading dynamics of rumors in networks, Zang et al. [197]
introduce a score-based method to assess the states of other nodes in networks,
which indirectly form a complete observation on networks.
According to the definition of rumor centrality of a set of nodes, we need to
calculate the number of distinct propagation paths originating from the node set. It
is too computationally expensive to obtain the result. Even though Luo et al. have
proposed a two-step method to reduce the complexity, the two-step method still
needs O(N k ) computations, where k is the number of rumor sources. This method
can hardly be used in the real world, especially for large-scale networks.
Prakash et al. [147, 148] proposed a minimum description length (MDL) method
for source identification. This method is considered for generic networks. They
assumed rumor propagation following the SI model. Given an arbitrary infected
node as the source node, minimum description length corresponds to the probability
of obtaining the infection graph. For generic networks, it is too computationally
expensive to obtain the probability. Instead, Prakash et al. [148] introduced an
upper bound of the probability and detected the origin by maximizing the upper
bound. They claimed that to maximize the upper bound is to find the smallest
eigenvalue λmin and the corresponding eigenvector umin of the Laplacian matrix
of the infection graph. The Laplacian matrix is widely used in spectral graph theory
and has many applications in various fields. This matrix is mathematically defined
9.1 Comparative Study 99
as L = D −A, where D is the diagonal degree matrix and A is the adjacency matrix.
According to Prakash et al.’s work in [147, 148], the node with the largest score in
the eigenvector umin refers to the propagation source.
This method can also be used to seek multiple sources. The authors adopted the
minimum description length (MDL) cost function [75]. This was used to evaluate
the ‘goodness’ of a node being in the source set. To search the next source node, they
first removed the previous source nodes from the infected set. Then, they replayed
the process of searching the single source in the remaining infection graph. These
two steps were iterated until the MDL cost function stopped decreasing.
Due to the high complexity in computing matrix eigenvalues, generally O(N 3 ),
the DML method is not suitable for identifying sources in large-scale networks.
Moreover, the number of true sources is generally unknown. Further to this, the gap
between the upper bound and the real value of the probability has not been studied,
and therefore, the accuracy of this method is not guaranteed.
Fioriti et al. [58] introduced the dynamic age method for source identification in
generic networks. The assumption for this method is the same as the MDL method.
Fioriti et al. took the advantage of the correlation between the eigenvalue and the
‘age’ of a node in a network. The ‘oldest’ nodes which were associated to those
with largest eigenvalues were considered as the sources of a propagation [199].
Meanwhile, they utilized the dynamical importance of node in [150]. It essentially
calculated the reduction of the largest eigenvalue of the adjacency matrix after a
node had been removed. A large reduction after the removal of a node implied the
node was relevant to the ‘aging’ of a propagation. By combing these two techniques,
Fioriti et al. proposed the concept of dynamical age for an arbitrary node i as
follows,
where λm was the maximum eigenvalue of the adjacency matrix, and λim was the
maximum eigenvalue of the adjacency matrix after node i was removed. The nodes
with the highest dynamic age were considered as the sources.
This method is essentially different from the previous MDL method. The MDL
method is to find the smallest eigenvalues and the corresponding eigenvectors of
Laplacian matrices, while the dynamic age method is to find the largest eigenvalues
of the adjacency matrix.
Similar to the MDL method, the dynamic age method is not suitable for
identifying sources in large-scale networks due to the complexity of calculating
eigenvectors. Moreover, since there is no threshold to determine the oldest nodes,
the number of source nodes is uncertain.
100 9 Comparative Study and Numerical Analysis
Zhu and Ying [201] proposed the Jordan center method for rumor source identifica-
tion. They assumed rumor propagated in tree-like networks and the propagation
followed SIR model. All infected nodes were given, but the susceptible nodes
and recovered nodes were indistinguishable. This method was proposed for single
source propagation. Zhu and Ying [201] proposed a sample path based approach to
identify the propagation source. An optimal sample path was the one which most
likely leaded to the observed snapshot of a network. The source associated with
the optimal sample path was proven to be the Jordan center of the infection graph.
Jordan center was then considered as the rumor source.
Zhu and Ying [202] further extended the sample path based approach to the het-
erogeneous SIR model. Heterogeneous SIR model means the infection probabilities
between any two neighboring nodes are different, and the recovery probabilities of
infected nodes differ from each other. They proved that on infinite trees, the source
node associated with the optimal sample path was also the Jordan center. Moreover,
Luo et al. [114, 116] investigated the sample path based approach in the SI and SIS
models. They obtained the same conclusion as in the SIR model.
Similar to rumor center based methods, the Jordan center method is considered
on infinite tree-like networks which are far away from real-world networks.
rumor center and Jordan center based methods). Lokhov et al. [111] claimed that
the DMP source identification method dramatically outperformed the previous
centrality based methods.
An important prerequisite of the DMP method is that we must know the
propagation time t. However, the propagation time t is generally unknown. Besides,
the computational complexity of this method is O(tN 2 d), where N is the number
of nodes in a network and d is the average degree of the network. If the underlying
network is strongly connected, it will be computationally expensive to use the DMP
method to identify the propagation source.
Fig. 9.2 Illustration of wavefronts in the shortest path tree v . Readers can refer to the work “The
Hidden Geometry of Complex, Network-driven Contagion Phenomena” [24] for the details of the
wavefronts
102 9 Comparative Study and Numerical Analysis
Distinguished from the DMP method which adopts the message-passing propaga-
tion model (see Sect. 9.1.2.2), Altarelli et al. [7] proposed using the Bayesian belief
propagation model to compute the probabilities of each node being at any state. This
method can work with different types of observations and in different propagation
scenarios, however guaranteed accuracy is only obtained in tree-like networks. This
method consists of three steps. The propagation of rumors are first presented by SI,
SIR or other isomorphic models [176]. Second, given an observation on the infection
of a network, either through a group of sensors or a snapshot at an unknown time, the
belief propagation equations are derived for the posterior distribution of past states
on all network nodes. By constructing a factor graph based on the original network,
these equations provide the exact computation of posterior marginal in the models.
Third, belief propagation equations are iterated with time until they converge. Nodes
are then ranked according to the posterior probability of being the source.
This method provides the exact identification of source in tree-like networks.
This method is also effective for synthetic and real networks with cycles, both in a
static and a dynamic context, and for more general networks, such as DTN [204].
This method relies on belief propagation model in order to be used with different
observations and in various scenarios.
9.1 Comparative Study 103
Agaskar and Lu [2] proposed a fast Monte Carlo method for source identification in
generic networks. They assume propagation follows the heterogeneous SI model in
which the infection probabilities between any two neighboring nodes are different.
In addition, the observation of sensors is obtained in a fixed time window. This
method consists of two steps. In the first step, assuming an arbitrary node as the
source, they introduce an alternate representation for the infection process initiated
from the source. The alternate representation is derived in terms of the infection
time of each edge. Based on the alternate representation, they sample the infection
time for each sensor. In the second step, they compute the gap between the observed
infection time and the sampled infection time of sensors. They further use the Monte
Carlo approach to approximate the gap. The node which can minimize the gap is
considered as the propagation origin.
The computational complexity of this method is O(LNlog(N)/ε), where L is the
number of sensor nodes, and ε is the assumed error. The complexity is lower than
other source identification methods, which are normally O(N 2 ) or even O(N 3 ).
When sampling infection time for each edge, Agaskar and Lu [2] assume that
information always spreads along the shortest paths to other nodes. However, in the
104 9 Comparative Study and Numerical Analysis
real world, information generally reaches other nodes by a random walk. Therefore,
this method may not be suitable for other propagation schemes, such as random
spreading or multicast spreading.
Xie et al. proposed a post-mortem technique on traffic logs to seek the origin of
a worm (a kind of computer virus) [191]. There are four assumptions for this
technique. First, it focuses on the scanning worm [181]. This kind of worm spreads
on the Internet by making use of OS vulnerabilities. Victims will proceed to scan
the whole IP space for vulnerable hosts. Famous related examples includes Code
Red [206] and Slammer [126]. Second, logs of infection from sensors cover the
majority of the propagation processes. Third, the worm propagation forms a tree-
like structure from its origin. Last, the attack flows of a worm do not use spoofed
source IP addresses. Based on traffic logs, the network communication between
end-hosts are modeled by a directed host contact graph. Propagation paths are then
created by sampling edges from the graph according to the time of corresponding
logs. The creation of each path stops when there is no contiguous edge within t
seconds to continue the path. As the sampling is performed, a count is kept of how
many times each edge from the contact graph is traversed. If the worm propagation
follows a tree-like structure, the edge with maximum count will most likely be the
top of the tree. The start of this directed edge will be considered as the propagation
source.
There are issues on this technique that need to be further analyzed. First, it is
reasonable to assume worm do not use the IP spoof technique. In the real world,
the overwhelming majority of worm traffic involved in propagation is initiated
by victims instead of the original attacker. Spoofed IP addresses would only
decrease the number of successful attacks without providing further anonymity
to the attacker. Second, IP trace back techniques [157] are related to Moonwalk
and other methods discussed in this article. However, trace back on its own is not
sufficient to track worms to their origin, as trace back only determines the true
source of the IP packets received by a destination. In an epidemic attack, the source
of these packets is almost never the origin of the attack, but just one infected victims.
The methods introduced in this article are still needed to find the hosts higher up in
the propagation casual trees. Third, this method relies only on traffic logs. This
feature benefits itself on its ability to work without any a priori knowledge about the
worm attack.
Nowadays, the number of scanning worms has largely decreased due to advances
in OS development and security techniques [189]. Therefore, the usage of Moon-
walk, which can only seek the propagation origin of the scanning worm, is largely
limited. Moreover, a full collection of infection logs is hardly achieved in the real
world. Finally, current computer viruses are normally distributed by Botnet [205].
Moonwalk, which can only seek single origin, may not be helpful in this scenario.
9.2 Numerical Analysis 105
Seo et al. [158] proposed a four-metric source estimator to identify single source
node in directed networks. They assume propagation follows the SI model. The
sensor nodes who transited from susceptible states to infected states are regarded as
positive sensors. Otherwise, they are considered as negative sensors. Seo et al. [158]
use the intuition that the source node must be close to the positive sensor nodes, but
far away from the negative sensor nodes. They propose four metrics to locate the
source. First, they find out a set of nodes which are reachable to all positive sensors.
Second, they filter the set of nodes by choosing the ones with the minimum sum of
distances to all positive sensor nodes. Third, they further choose the nodes that are
reachable to the minimum number of negative sensor nodes. Finally, the node which
satisfies all of the above three metrics and has the maximum sum of distances to all
negative sensor nodes is considered as the source node.
Seo et al. [158] studied and compared different methods of choosing sensors,
such as randomly choosing (Random), choosing the nodes with high betweenness
centrality values (BC), choosing the nodes with a large number of incoming edges
(NI), and choosing the nodes which are at least d hops away from each other (Dist).
Different sensor selection methods produce different sets of sensor nodes, and have
different accuracies in source identification. They show that the NI and BC sensor
selection methods outperform the others.
For the four-metric source estimator, it needs to compute the shortest paths from
the sensors to any potential source. Generally, the computational complexity is
O(N 3 ). It is too computationally expensive to use this method.
From Sect. 9.1, we see that some existing methods of source identification are
considered on tree-like networks. In the previous subsection, we have shown the
9.2 Numerical Analysis 107
Fig. 9.4 Crosswise comparison of existing methods on two synthetic networks. (a) 4-regular tree,
(b) Small-world network
From Sect. 9.1, we note that some existing methods of source identification are
based on the assumption that information propagates along the BFS trees in
networks. This means propagation follows the broadcast scheme. However, in the
real world, propagation may follow various propagation schemes. We focus on three
most common propagation schemes: snowball, random walk and contact process
[34]. Their definitions are given below.
108 9 Comparative Study and Numerical Analysis
Fig. 9.5 The impact of network topologies. (a) Random tree, (b) Small-world network
Fig. 9.6 Illustration of different propagation schemes. The black node stands for the source.
The numbers indicate the hierarchical sequence of nodes getting infected. (a) Random walk, (b)
Contact process, (c) Snowball
• Random Walk: A node can deliver a message randomly to one of its neighbors.
• Contact Process: A node can deliver a message to a group of its neighbors that
have expressed interest in receiving the message.
• Snowball Spreading: A node can deliver a message to all of its neighbors.
An illustration of these three propagation schemes is shown in Fig. 9.6. We examine
different propagation schemes on both regular trees and small-world networks.
Figure 9.7a shows the experiment results of the methods with propagation
following the random-walk propagation scheme on a 4-regular tree. It is clear the
Gaussian source estimator outperforms the others, with estimated sources around 1–
2 hops away from the true sources. The performances of the rumor center method,
the dynamic age method and the Jordan center method are similar to each other, with
estimated sources around 5 hops away from the true sources. The DMP method has
the worst performance. Figure 9.8a shows experiment results of the methods with
propagation following the contact-process propagation scheme on a 4-regular tree.
It is clear the results in Figs. 9.7a and 9.8a are similar to each other. This means the
methods have similar performances on both the random-walk and contact-process
9.2 Numerical Analysis 109
Fig. 9.7 The impact of propagation schemes: random-walk scheme. (a) 4-regular tree, (b) Small-
world network
Fig. 9.8 The impact of propagation schemes: contact-process scheme. (a) 4-regular tree, (b)
Small-world network
propagation schemes. Figure 9.9a shows the experiment results of the methods
with propagation following the snowball propagation scheme on a 4-regular tree.
The results show a big difference from the results of the previous two propagation
schemes. The DMP method and the Jordan center method outperformed the others,
with estimated sources around 1–2 hops away from the true sources. The rumor
center method and the Gaussian method also showed good performances, with
estimated sources around 2–3 hops away from the true sources. The dynamic age
method had the worst performance.
The experiment results of the methods with propagation following different
propagation schemes on a small-world network are shown in Figs. 9.7b, 9.8b,
and 9.9b. The results are dramatically different from the results on the 4-regular
110 9 Comparative Study and Numerical Analysis
Fig. 9.9 The impact of propagation schemes: snowball scheme. (a) 4-regular tree, (b) Small-world
network
tree. From Fig. 9.8 we can see the Gaussian source estimator obtains the best
performance, followed by the DMP method. The rumor center method, the dynamic
age method and the Jordan center method show identifying sources by randomly
choosing. From Fig. 9.8b, it is clear the Jordan center method, the DMP method and
the Gaussian method show similar performances. These three methods outperform
the others. From Fig. 9.9b we can see the Jordan center method outperforms the
others, with estimated sources around 1 hop away from the true sources. The sources
estimated using the DMP method are around 1–2 hops away from the true sources.
The Gaussian source estimator has the worst performance.
From the experiment results, we see the source identification methods are also
sensitive to propagation schemes. The methods of source identification show better
performance when propagation follows the snowball propagation scheme rather
than the random-walk or contact-process propagation schemes.
Fig. 9.10 The impact of infection probability. (a) q = 0.5, (b) q = 0.95
(A) (B)
Fig. 9.11 Sample topologies of two real-world networks. (a) Enron email network, (b) Power grid
network
DMP method outperform the other methods, with estimated sources around 1 hop
away from the true sources. The dynamic age method and the Gaussian method have
the worst performance.
From the experiment results, we can see only the DMP method is sensitive to the
infection probability and performs better when the infection probability is lower.
The other methods show slightly difference in their performance when applied with
various infection probabilities.
112 9 Comparative Study and Numerical Analysis
Fig. 9.12 Source identification methods applied on real networks. (a) Enron email, (b) Power grid
9.3 Summary
10.1 Introduction
Rumor spreading in social networks has long been a critical threat to our society
[143]. Nowadays, with the development of mobile devices and wireless techniques,
the temporal characteristic of social networks (time-varying social networks) has
deeply influenced the dynamic information diffusion process occurring on top of
them [151]. The ubiquity and easy access of time-varying social networks not only
promote the efficiency of information diffusion but also dramatically accelerate the
speed of rumor spreading [91, 170].
For either forensic or defensive purposes, it has always been a significant work
to identify the source of rumors in time-varying social networks [42]. However,
the existing techniques for rumor source identification generally require firm
connections between individuals (i.e., static networks), so that administrators can
trace back along the determined connections to reach the diffusion sources. For
example, many methods rely on identifying spanning trees in networks [160, 178],
then the roots of the spanning trees are regarded as the rumor sources. The firm
connections between users are the premise of constructing spanning trees in these
methods. Some other methods detect rumor sources by measuring node centralities,
such as degree, betweenness, closeness, and eigenvector centralities [146, 201]. The
individual who has the maximum centrality value is considered as the rumor source.
All of these centrality measures are based on static networks. Time-varying social
networks, where the involved users and interactions always change, have led to great
challenges to the traditional rumor source identification techniques.
In this chapter, a novel source identification method is proposed to overcome the
challenges, which consists the following three steps: (1) To represent a time-varying
social network, we reduce it to a sequence of static networks, each aggregating
all edges and nodes present in a time-integrating window. This is the case, for
instance, of rumors spreading in Bluetooth networks, for which the fine-grained
temporal resolution is not available, whose spreading can be studied through
different integrating windows t (e.g., t could be minutes, hours, days or even
months). In each integrating window, if users did not activate the Bluetooth on their
devices (i.e., offline), they would not receive or spread the rumors. If they moved out
the Bluetooth coverage of their communities (i.e., physical mobility), they would not
receive or spread the rumors. (2) Similar to the detective routine in criminology, a
small set of suspects will be identified by adopting a reverse dissemination process
to narrow down the scale of the source seeking area. The reverse dissemination
process distributes copies of rumors reversely from the users whose states have been
determined based on various observations upon the networks. The ones who can
simultaneously receive all copies of rumors from the infected users are supposed
to be the suspects of the real sources. (3) To determine the real source from the
suspects, we employ a microscopic rumor spreading model to analytically estimate
the probabilities of each user being in different states in each time window. Since
this model allows the time-varying connections among users, it can feature the
dynamics of each user. More specifically, assuming any suspect as the rumor source,
we can obtain the probabilities of the observed users to be in their observed states.
Then, for any suspect, we can calculate the maximum likelihood (ML) of obtaining
the observation. The one who can provide the maximum ML will be considered as
the real rumor source.
In this section, we introduce the primer for rumor source identification in time-
varying social networks, including the features of time-varying social networks, the
state transition of users when they hear a rumor, and the categorization of partial
observations in time-varying social networks.
The essence of social networks lies in its time-varying nature. For example, the
neighborhood of individuals moving over a geographic space evolves over time
10.2 Time-Varying Social Networks 119
Fig. 10.1 Example of a rumor spreading in a time-varying network. The random spread is located
on the black node, and can travel on the links depicted as line arrows in the time windows. Dashed
lines represent links that are present in the system in each time window
(i.e., physical mobility), and the interaction between the individuals appears and
disappears in online social networks (i.e., online/offline) [151]. Time-varying social
networks are defined by an ordered stream of interactions between individuals. In
other words, as time progresses, the interaction structure keeps changing. Examples
can be found in both face-to-face interaction networks [27], and online social
networks [170]. The temporal nature of such networks has a deep influence on
information spreading on top of them. Indeed, the spreading of rumors is affected
by duration, sequence, and concurrency of contacts among people.
Here, we reduce time-varying networks to a series of static networks by
introducing a time-integrating window. Each integrating window aggregates all
edges and nodes present in the corresponding time duration. In Fig. 10.1, we show
an example to illustrate the time-integrating windows. In the time window t − 1 (or,
at time t − 1), a rumor started to spread from node S who had interaction with five
neighbors in this time window. In the next time window t, nodes B, D and F were
successfully infected. In this time window, we notice that node O moved next to
B (i.e., physical mobility), and node G had no interaction with its neighbors (i.e.,
offline). Other examples of physical mobility or online/offline status of nodes can be
found in the time window t + 1.
Figure 10.2 shows the state transition graph of an arbitrary user in this model. Every
user is initially susceptible (Sus.). They can be infected (Inf.) by their neighbors
with probability v(i, t), and then recover (Rec.) with probability q(i). Rumors will
be spread out from infected users to their social neighbors until they get recovered.
There are also many other models of rumor propagation, including the SI, SIS, SIRS
models [117, 128]. In present work, we adopt the SIR model because it can reflect
the state transition of users when they hear a rumor, from being susceptible to being
recovered. Generally, people will not believe the rumor again after they know the
truth. Therefore, recovered users will not transit their states any more. For other
propagation models, readers can refer to Sect. 10.6 for further discussion.
To more precisely describe node states under different types of observations, we
introduce two sub-states of nodes being infected: ‘contagious’ (Con.) and ‘misled’
(Mis.), see Fig. 10.2. An infected node first becomes contagious and then transit
to being misled. The Con. state describes the state of nodes newly infected. More
specifically, a node is Con. at time t means this node is susceptible at time t − 1 but
becomes infected at time t. An misled node will stay being infected until it recovers.
For instance, sensors can record the time at which they get infected, and the infection
time is crucial in detecting rumor sources because it reflects the infection trend and
speed of a rumor. Hence, the introduction of contagious and misled states is intrinsic
to the rumor spreading framework.
Fig. 10.3 Three types of observations in regards to the rumor spreading in Fig. 10.1. (a)
Wavefront; (b) Snapshot; (c) Sensor observation
Current methods of source identification need to scan every node in the underlying
network. This is a bottlenecks of identifying rumor sources: scalability. It is
necessary to narrow down a set of suspects, especially in large-scale networks. In
this section, we develop a reverse dissemination method to identify a small set of
suspects. The details of the method are presented in Sect. 10.3.1, and its efficiency
will be evaluated in Sect. 10.3.2.
In this subsection, we first present the rationale of the reverse dissemination method.
Then, we show how to apply the reverse dissemination method into different types
of partial observations on networks.
10.3.1.1 Rationale
Fig. 10.4 Illustration of the reverse dissemination process in regards to the wavefront observation
in Fig. 10.3a. (a) The observed nodes broadcast labeled copies of rumors to their neighbors in time
window t; (b) The neighbors who received labeled copies will relay them to their own neighbors
in time window t − 1
10.3 Narrowing Down the Suspects 123
the rumor spreading in Fig. 10.1 and the wavefront observation in Fig. 10.3a. All
wavefront nodes OI = {E, C, I, K, O} observed in time window t + 1 are labeled
as black in Fig. 10.4a. The whole process is composed of two rounds of reverse
dissemination. In round 1 (Fig. 10.4a), all observed nodes broadcast labeled copies
reversely to their neighbors in time window t. For example, nodes S and O received
copies of node C (S, O ← C), and node D received copies of three observed
nodes C, I and K (D ← C, I, K). In round 2 (Fig. 10.4b), the neighbors who
have received labeled copies will relay them to other neighbors in time window
t − 1. In each round, the labels will be recorded in each relay node. We can see
from Fig. 10.4b that node S has received all copies from all the observed nodes
(S ← C, E, K, I, O). Then, node S is chosen to be a suspect.
We notice that the starting time for each observed node starting their reverse
dissemination processes varies in different types of observations. For a wavefront,
since all the observed nodes are supposed to be contagious in the latest time window,
all the observed nodes need to simultaneously start their reverse dissemination
processes. For a snapshot, the observed nodes stay in their states in the latest time
window. Therefore, the reverse dissemination processes will also simultaneously
starts from all the observed nodes. However, for a sensor observation, because
the infected sensors record their infection time, the starting time of reverse
dissemination for each sensor will be determined by ti . More specifically, the latest
infected sensors first start their reverse dissemination processes, and then the sensors
infected in the previous time window, until the very first infected sensors.
10.3.1.2 Wavefront
10.3.1.3 Snapshot
where PS (u, t|oi ), PI (u, t|oi ) and PR (u, t|oi ) denote the probabilities of u to be
susceptible, infected or recovered after time t, respectively, given that the reverse
dissemination started from oi .
10.3.1.4 Sensor
For sensor observations, according to our previous discussion, we let infected sensor
oi ∈ OI start to reversely disseminate copies of the rumor at time tˆi = T − ti ,
where T = max{ti |oi ∈ OI }. We also let the susceptible sensors oj ∈ OS start to
reversely disseminate copies of rumors at time t=0. To match a sensor observation, it
is expected a suspect u needs to satisfy the following two principles at time t. First,
copies of rumors disseminated from susceptible sensors oi ∈ OS cannot reach node
u at time t (i.e., node u is still susceptible). Second, copies of rumors disseminated
from all infected sensors oj ∈ OI can be received by node u at time t (i.e., node
u becomes contagious). Mathematically, we determine the suspects by computing
their maximum likelihood, as in
L(u, t) = ln(PC (u, t + tˆi |oi ))
oi ∈OI
(10.3)
+ ln(PS (u, t|oj )).
oj ∈OS
The values of PS (u, t|oi ), PC (u, t|oi ), PI (u, t|oi ) and PR (u, t|oi ) will be
calculated by the model introduced in Sect. 10.4.2. We summarize the reverse
dissemination method in Algorithm 10.1.
10.3 Narrowing Down the Suspects 125
Fig. 10.5 Accuracy of the reverse dissemination method in networks. (a) MIT; (b) Sigcom09; (c)
Enron Email; (d) Facebook
communications from 45,813 users during December 29th, 2008 and January 3rd,
2009. All of these datasets reflect the physical mobility and online/offline features of
time-varying social networks. According to the study in [151], an appropriate tem-
poral resolution t is important to correctly characterize the dynamical processes
on time-varying networks. Therefore, we need to be cautious when we choose the
time interval of size t. Furthermore, many social networks have been shown small-
world, i.e., the average distance l between any two nodes is small, generally l ≤ 6.
Previous extensive works show that rumors can spread quickly in social networks,
generally after 6–10 time ticks of propagation (see [42]). Hence, we divided the
social networks into 6–10 time windows. Therefore, for the datasets used in this
chapter, we uniformly divide each into 6–10 discrete time windows [151]. For other
division of temporal resolution, readers could refer to [151] for further discussion.
Figure 10.5 shows the experiment results in the four real datasets. We find the
proposed method works quite well in reducing the number of suspects. Especially
for snapshots, the searching scale can be narrowed to 5% of all users for the MIT
dataset, 15% for the Sigcom09 dataset, and 20% for the Enron Email and Facebook
datasets. The number of suspects can be reduced to 45% of all users in the MIT
reality dataset under snapshot and wavefront observations. For the Enron Email and
Facebook datasets, the number of suspects can be reduced to 20% of all users. The
worst case occurred in the Sigcom09 dataset with wavefronts, but our method still
achieved a reduction of 35% in the total number of users.
The experiment results on real time-varying social networks show that the
proposed method is efficient in narrowing down the suspects. Real-world social
networks usually have a large number of users. Our proposed method addresses
the scalability in source identification, and therefore is of great significance.
consider the BFS trees instead of the original networks. These violate the rumor
spreading processes. In this section, we adopt an innovative maximum-likelihood
based method to identify the real source from the suspects. A novel rumor spreading
model will also be introduced to model rumor spreading in time-varying social
networks.
The key idea of the ML-based method is to expose the suspect from set U that
provides the largest maximum likelihood to match the observation. It is expected
that the real source will produce a rumor propagation which not only temporally
but also spatially matches the observation more than other suspects. Given an
observation O = {o1 , o2 , . . . , on } in a time-varying network, we let the spread
of rumors start from an arbitrary suspect u ∈ U from the time window that
is tu before the latest time window. For an arbitrary observed node oi , we use
PS (oi , tu |u) to denote the probability of oi being susceptible at time tu , given that the
spread of rumors starts from suspect u. Similarly, we have PC (oi , tu |u), PI (oi , tu |u)
and PR (oi , tu |u) representing the probabilities of oi being contagious, infected
and recovered at time tu , respectively. We use L̃(tu , u) to denote the maximum
likelihood of obtaining the observation when the rumor started from suspect u.
Among all the suspects in U , we can estimate the real source by choosing the
maximum value of the ML, as in
The result of Eq. (10.4) suggests that suspect u∗ can provide a rumor propagation not
only temporally but also spatially matches the observation better than other suspects.
We also have an estimation of infection scale I (t ∗ , u∗ ) as a byproduct, as in
N
I (t ∗ , u∗ ) = PI (i, t ∗ |u∗ ). (10.5)
i=1
Later, we can justify the effectiveness of the ML-based method by examining the
accuracy of t ∗ and I (t ∗ , u∗ ).
10.4.1.2 Wavefront
In a wavefront, all observed nodes are contagious in the time window when the
wavefront is captured. Supposing suspect u is the rumor source, the maximum
128 10 Identifying Propagation Source in Time-Varying Networks
10.4.1.3 Snapshot
10.4.1.4 Sensor
Note that, PS (u, t|oi ), PC (u, t|oi ), PI (u, t|oi ), and PR (ũ, t|oi ) can be calculated
in the rumor spreading model in Sect. 10.4.2. We summarize the method of
determining rumor sources in Algorithm 10.2.
10.4 Determining the Real Source 129
where, Ni denotes the set of neighbors of user i. Then, we can compute the
probability of an arbitrary user to be susceptible at time t as in
Once a user gets infected, he/she becomes contagious. We then have the probability
that an arbitrary user is contagious at time t as in
Since an infected user can be either contagious or misled, we can obtain the value
of PI (i, t) as in
This model analytically derives the probabilities of each user in various states in an
arbitrary time t. This in addition constitutes the maximum likelihood L(u, t) of an
arbitrary user u being a suspect in time window t in Sect. 10.3.1. This also supports
the calculation of the maximum likelihood L̃(t, u) to match the observation in time
window t, given that the rumor source is the suspicious user u in Sect. 10.4.1.
10.5 Evaluation
In this section, we evaluate the efficiency of our source identification method. The
experiment settings are the same as those presented in Sect. 10.3.2. Specifically,
we let the sampling ratio α range from 10% to 30%, as the reverse dissemination
method has already achieved a good performance with α dropping in this range.
Fig. 10.6 The distribution of error distance (δ) in the MIT Reality dataset. (a) Sensor; (b)
Snapshot; (c) Wavefront
Fig. 10.7 The distribution of error distance (δ) in the Sigcom09 dataset. (a) Sensor; (b) Snapshot;
(c) Wavefront
Fig. 10.8 The distribution of error distance (δ) in the Enron Email dataset. (a) Sensor; (b)
Snapshot; (c) Wavefront
For the other two categories of observations, although our method cannot identify
real sources with very high accuracy, the estimated sources are very close to the
real sources, with an average of 1–2 hops away in the sensor observations, and 1–3
hops away for the wavefronts. Figure 10.8 shows the performance of our method
in the Enron Email dataset. When the sampling ratio α ≥ 20%, our method can
132 10 Identifying Propagation Source in Time-Varying Networks
Fig. 10.9 The distribution of error distance (δ) in the Facebook dataset. (a) Sensor; (b) Snapshot;
(c) Wavefront
identify the real sources with an accuracy of 80% for the snapshots, and more than
45% for the wavefronts. The estimated sources are very close to the real sources,
with an average 1–3 hops away in the sensor observations. Figure 10.9 shows the
performance of our method in the Facebook dataset. Similarly, when the sampling
ratio α ≥ 20%, the proposed method can identify the real sources with an accuracy
of around 40% for the snapshots. The estimated sources are very close to the real
sources, with an average of 1–3 hops away from the real sources under the sensor
and wavefront observations.
Compared with previous work, our proposed method is superior because our
method can work in time-varying social networks rather than static networks. Our
method can achieve around 80% of all experiment runs that accurately identify the
real source or an individual very close to the real source. However, the previous
work of [178] has theoretically proven their accuracy was at most 25% or 50% in
tree-like networks, and their average error distance is 3–4 hops away.
We justify the effectiveness of our ML-based method from three aspects: the
correlation between the ML of the real sources and that of the estimated sources,
the accuracy of estimating rumor spreading time, and the accuracy of estimating
rumor infection scale.
We investigate the correlation between the real sources and the estimated sources by
examining the correlation between their maximum likelihood values. For different
types of observation, the maximum likelihood of an estimated source can be
obtained from Eqs. (10.6), (10.7) or Eq. (10.8), i.e., L̃(t ∗ , u∗ ). The maximum
10.5 Evaluation 133
Fig. 10.10 The correlation between the maximum likelihood of the real sources and that of the
estimated sources in the MIT reality dataset. (a) Sensor observation; (b) Snapshot observation; (c)
Wavefront observation
Fig. 10.11 The correlation between the maximum likelihood of the real sources and that of the
estimated sources in the Sigcom09 dataset. (a) Sensor observation; (b) Snapshot observation; (c)
Wavefront observation
Fig. 10.12 The correlation between the maximum likelihood of the real sources and that of the
estimated sources in the Enron Email dataset. (a) Sensor observation; (b) Snapshot observation;
(c) Wavefront observation
Fig. 10.13 The correlation between the maximum likelihood of the real sources and that of the
estimated sources in the Facebook dataset. (a) Sensor observation; (b) Snapshot observation;
(c) Wavefront observation
correlation results still tend to be clustered in a line. These exactly reflect the
accuracy of identifying rumor sources in Fig. 10.7. The results in the Enron Email
dataset are shown in Fig. 10.12. We see that the maximum likelihood values
are highly correlated in both snapshot and wavefront observations, and slightly
correlated in sensor observations. These exactly reflect the accuracy of identifying
rumor sources in Fig. 10.8. Similar results can be found in the Facebook dataset
in Fig. 10.13, which precisely reflects the accuracy of identifying rumor sources in
Fig. 10.9.
The strong correlation between the ML values of the real sources and that of the
estimated sources in time-varying social networks reflects the effectiveness of our
ML-based method.
As a byproduct, our ML-based method can also estimate the spreading time (in
Eq. (10.4)) of rumors. In order to justify the effectiveness of our proposed method,
10.5 Evaluation 135
Fig. 10.14 The accuracy of estimating infection scale in real networks. (a) MIT; (b) Sigcom09;
(c) Enron Email; (d) Facebook
different types of observations. As shown in Fig. 10.14c, the worst result occurred
in the Enron Email dataset after time tick 4. According to our investigation, this was
caused by a great deal of infected nodes that tend to be in the recovered stage in the
SIR scheme, which leads to a fairly large uncertainty in the estimate.
To summarize, all of the above evaluations reflect the effectiveness of our method
from different aspects: the high correlation between the ML values of the real
sources and that of the estimated sources, the high accuracy in estimating spreading
time of rumors, and the high accuracy of the infection scale.
10.6 Summary
the overall infection status of rumor propagation, but they also can estimate the
probability of an arbitrary node being in an arbitrary state [146]. In the field of
identifying propagation sources, researchers generally choose microscopic models,
because it requires to estimate which specific node is the first one getting infected.
As far as we know, so far there is no work that is based on the macroscopic models
to identify rumor sources in social networks. Future work may also investigate
combining microscopic and macroscopic models, or even adopting the mesoscopic
models [118, 124], to estimate both the rumour sources and the trend of the
propagation. There are also many other microscopic models other than the SIR
model adopted in this chapter, such as the SI, SIS, and SIRS models [146, 201].
As we discussed in Sect. 10.2.2, people generally will not believe the rumor again
after they know the truth, i.e., after they get recovered, they will not transit to other
states. Thus, the SIR model can reflect the state transition of people when they hear
a rumor. We also evaluate the performance of the proposed method on the SI model.
Since the performance of our method on the SI model is similar to that on the SIR
model, we only present the results on the SIR model in this chapter.
Chapter 11
Identifying Multiple Propagation Sources
The global diffusion of epidemics, computer viruses and rumors causes great
damage to our society. One critical issue to identify the multiple diffusion sources so
as to timely quarantine them. However, most methods proposed so far are unsuitable
for diffusion with multiple sources because of the high computational cost and
the complex spatiotemporal diffusion processes. In this chapter, we introduce an
effective method to identify multiple diffusion sources, which can address three
main issues in this area: (1) How many sources are there? (2)Where did the diffusion
emerge? (3) When did the diffusion break out? For simplicity, we use rumor source
identification to present the approach.
11.1 Introduction
11.2 Preliminaries
where Ni denotes the set of neighbors of node i. This model analytically derives
the probability of each node in various states at an arbitrary time. To address real
problems, the length of each time tick relies on the real environment. It can be
1 min, 1 h or 1 day. We also need to set the propagation probability ηij between
nodes properly.
Brockmann and Helbing [24] recently proposed a new measure, effective distance,
which can disclose the hidden pattern geometry of complex diffusion. The effective
distance from a node i to a neighboring node j is defined as
where ηij is again the propagation probability from i to j . This concept reflects
the idea that a small propagation probability from i to j is effectively equivalent
to a large distance between them, and vice versa. To illustrate this measure, a
simple example is shown in Fig. 11.2. For instance, the propagation probability
is 0.8 between node S and A, and is only 0.1 between S and B (see Fig. 11.2a).
Correspondingly, the effective distance between S and A is 1.22 which is much less
than that between S and B (see Fig. 11.2b).
Based on the effective distances between neighboring nodes, the length λ() of a
path = {u1 , . . . , uL } is defined as the sum of effective lengths along the edges of
the path. Moreover, the effective distance from an arbitrary node i to another node j
is defined by the length of the shortest path in terms of effective distance from node
i to node j , i.e.,
From the perspective of diffusion source s, the set of shortest paths in terms
of effective distance to all the other nodes constitutes a shortest path tree s .
Brockmann and Helbing obtain that the diffusion process initiated from node s on
the original network can be represented as wave patterns on the shortest path tree
s . In addition, they conclude that the relative arrival time of the diffusion arriving
at a node is independent of diffusion parameters and is linear with the effective
distance from the source to the node of interest.
11.3 Problem Formulation 143
Fig. 11.2 An example of altering an infection graph using effective distance. (a) An example
infection graph with source S. The weight on each edge is the propagation probability. The two
dot circles represent the first-order and second-order neighbors of source S. The colors indicate the
infection order of nodes, e.g., nodes A, C, D and F are infected after the first time tick. Notice that
the diffusion process is spatiotemporally complex. (b) The altered infection graph. The weight on
each edge is the effective distance between the corresponding end nodes. Notice that the effective
distances from source S to the infected nodes can accurately reflect their infection orders
In this chapter, we will alter the original network by utilizing effective distance
through converting the propagation probability on each edge to the corresponding
effective distance. Then, by using the linear relationship between the relative arrival
time and the effective distance of any infected node, we derive a novel method to
identify multiple diffusion sources.
Before we present the problem formulation derived in this chapter, we firstly show
an alternate expression of an arbitrary infection graph by using effective distance
(see again Fig. 11.2). Figure 11.2a shows an example of an infection graph with
diffusion source S. The colors indicate the infection order of nodes (e.g., nodes A,
C, D and F were infected after the first time tick T = 1, similarly for the other
nodes). Notice that the diffusion process is spatiotemporally complex, because the
first-order neighbors of source S can be infected after the second time tick (e.g.,
node E) or even the third time tick (e.g., node B), similarly for the second-order
and third-order neighbors. We then alter the infection graph by replacing the weight
on each edge with the effective distance between the corresponding end nodes (see
Fig. 11.2b). We notice that the effective distances from source S to all the infected
nodes can accurately reflect the infection order of them. This exactly shows that the
144 11 Identifying Multiple Propagation Sources
relative arrival time of an arbitrary node getting infected is linear with the effective
distance between the source and the node of interest.
Suppose that at time T = 0, there are k(≥ 1) sources, S ∗ = {s1 , . . . , sk }, starting
the diffusion simultaneously [58, 115]. Several time ticks after the diffusion started,
we got n infected nodes. These nodes form a connected infection graph Gn , and
each source si has its infection region Ci (⊆ Gn ). Let C ∗ = A˛ ∪ki=1 Ci be a partition
of the infection graph such that Ci ∩ Cj = ∅. for i = j . Each partition Ci is a
connected subgraph in Gn and consists of the nodes whose infection can be traced
back to the source node si . For an arbitrary infected node vj ∈ Ci , suppose it can
be infected in the shortest time, then according to our previous analysis, it will have
shorter effective distance to source si than to any other source. Therefore, we need
to divide the infection graph Gn into k partitions so that each infected node belongs
to the partition with the shortest effective distance to the partition center. The final
partition centers are considered as the diffusion sources.
Given an infection graph Gn , from the above analysis, we know that our goal is
to identify a set of diffusion sources S ∗ and the corresponding partition C ∗ of the
infection graph Gn . To be precise, we aim to find a partition C ∗ of Gn , to minimize
the following objective function,
k
minC ∗ f = d(vj , si ), (11.6)
i=1 vj ∈Ci
where node vj belongs to partition Ci associated with source si , and d(vj , si ) is the
shortest-path distance in terms of effective distance between vj and si .
Equation (11.6) is the proposed formulation for multi-source identification
problem. Since, we need to find out the k centers of the diffusion from Eq. (11.6),
we name the proposed method of solving the multi-source identification problem as
the K-center method, which we will detail in the following section.
(l)
C (l) = ∪ki=1 Ci . (11.9)
(l)
Find the new center in each partition Ci as follows,
(l)
si = argminv (l) d(vj , vx ), i = 1, . . . , k. (11.10)
j ∈Ci
(l)
vx ∈Ci
distances between each infected node and its corresponding partition center. From
the previous subsection, if we randomly choose a set of sources S, Voronoi partition
can split the network into subnets such that each node is associated with its nearest
source. Thus, Voronoi partition can find a local optimal partition of Gn with a fixed
set of sources S. However, to optimize the partition C ∗ , we need to adjust the
center of each partition so as to minimize the objective function in Eq. (11.6). In
this chapter, we adjust the center of each partition by choosing a new center as the
node that has the minimum sum of effective distances to all the other nodes in the
partition. Therefore, we call this method as the K-center method. This is similar to
the rumor-center method and the Jordan-center method that consider rumor centers
or Jordan centers as the diffusion sources. As the name suggests, the K-center
method is more specific to the multi-source identification. The detailed process of
the K-center method is shown in Algorithm 11.2.
The following two theorems show the convergence of the proposed K-center
method and its computational complexity.
Theorem 11.1 The objective function in Eq. (11.6) is monotonically decreasing in
iterations. Therefore, the K-center method is convergent.
Proof Suppose that at iteration t, St = {s1t , . . . , skt } are the estimated sources. We
then use Algorithm 11.1 to partition the infection graph Gn as C t = ∪ki=1 Cik . Thus,
the objective function at iteration t becomes
k
ft = d(vj , sit ). (11.11)
i=1 vj ∈Cit
11.4 The K-Center Method 147
k
f˜t = d(vj , sit+1 ). (11.13)
i=1 vj ∈Cit
f˜t ≤ f t . (11.14)
We then re-partition the infection graph Gn with centers S t+1 = {s1t+1 , . . . , skt+1 }
such that each infected node vj ∈ Gn will be associated to a nearest center sit+1 ,
and obtain a new partition C t+1 = ∪ki=1 Cit+1 of Gn . Thus, the objective function at
iteration t + 1 becomes
k
f t+1 = d(vj , sit+1 ). (11.15)
i=1 vj ∈Cit+1
Therefore, the objective function in Eq. (11.6) is monotonically decreasing, i.e., the
K-center method is convergent.
Theorem 11.2 Given a infection graph Gn with n nodes and m edges, the compu-
tational complexity of the K-center method is O(mnlogα), where α = α(m, n) is
the very slowly growing inverse-Ackermann function [144].
Proof From Algorithm 11.2, we know that the main difficulty of the K-center
method stems from the calculation of the shortest paths between node pairs in the
altered infection graph Gn . Other computation in this algorithm can be treated as
a constant. In this chapter, we adopt the Pettie-Ramachandran algorithm [144],
to compute all-pairs shortest paths in Gn . The computational complexity of the
148 11 Identifying Multiple Propagation Sources
The spreading time T based on hops has simplified the modeling process. In the real
world, the spreading time of different paths with the same number of hops may vary
from each other. We have solved this temporal problem of the SI model in another
chapter [186]. In this field, the majority of current modeling is based on spreading
hops [176]. To be consistent with previous work, we adopt the simplified hop-based
SI model to study the source identification problem.
11.5 Evaluation 149
11.5 Evaluation
In this section, we evaluate the proposed K-center method in three real network
topologies: the North American Power Grid [180], the Yeast protein-protein inter-
action network [82], and the Facebook network [171]. The Facebook network
topology is crawled from December 29th, 2008 to January 3rd, 2009. The basic
statistics of these networks are shown in Table 11.1, and their degree distributions
Fig. 11.3 Degree distribution. (a) Power grid; (b) Yeast; (c) Facebook
are shown in Fig. 11.3. We adopt the classic SI model, and suppose all infections
are independent of each other. In simulations, we typically set the propagation
probability on each edge, ηij , uniformly distributed in (0, 1). As previous work
[186, 207] has proven that the distribution of propagation probability will not
affect the accuracy of the SI model, uniform distribution is enough to evaluate the
performance of the proposed method. Similar propagation probability setting can be
found in [116, 203] and [32]. We randomly choose a set of sources S ∗ , and let the
number of diffusion sources |S ∗ | range from 2 to 5. For each type of network and
each number of diffusion sources, we perform 100 runs. The number of 100 comes
from the discussion in the previous work of [207]. The implementation is in C++
and Matlab2012b.
We firstly show the convergence of the proposed method. Figure 11.4 shows the
objective function values in iterations when the number of sources is 2 in the three
real network topologies. It can be seen that the objective function is monotonically
decreasing in iterations. Similar results can be found when we choose different
number of sources. This, therefore, justifies Theorem 11.1 in Sect. 11.4.2.
11.5 Evaluation 151
Fig. 11.4 The monotonically decreasing of the objective functions. (a) Power grid; (b) Yeast; (c)
Facebook
We compare the performance of the proposed K-center method with two competing
methods: the dynamic age method [58] and the multi-rumor-center method [115].
To quantify the performance of each method, we firstly match the estimated sources
Ŝ = {ŝ1 , . . . , ŝk } with the real sources S ∗ = {s1 , . . . , sk } so that the sum of the
error distances between each estimated source and its match is minimized [160].
The average error distance is then given by
|S ∗ |
1
= ∗ h(si , ŝi ). (11.20)
|S |
i=1
We expect that our method can accurately capture the real sources or at least a set
of sources very close to the real sources (i.e., is as small as possible).
The average error distances for the three real network topologies are provided in
Table 11.2. From this table we can see that the proposed method outperforms the
other two methods, that the estimated sources are closer to the real sources. To have
a clearer comparison between our proposed method and the other two methods,
we show the histogram of the average error distances () in Figs. 11.5 and 11.6,
when |S ∗ | = 2 or 3, respectively. We can see that the proposed K-center method
outperforms the others. When |S ∗ | = 2, the estimated sources are very close to
the real sources in the Power Grid, with the average error distances are generally
1–2 hops. However, the average error distances are around 3–4 hops when using
the multi-rumor-center method, and around 3–5 hops when using the dynamic age
method. For the Yeast network, the diffusion sources estimated by the proposed
method are with an average of 2–3 hops away from the real sources. However, the
152 11 Identifying Multiple Propagation Sources
Fig. 11.5 Histogram of the average error distances () in various networks when S ∗ = 2. (a)
Power grid; (b) Yeast; (c) Facebook
Fig. 11.6 Histogram of the average error distances () in various networks when S ∗ = 3. (a)
Power grid; (b) Yeast; (c) Facebook
11.5 Evaluation 153
sources estimated by using the multi-rumor-center method are averagely 2–4 hops
away from the real sources, and averagely 3–4 hops away when using the dynamic
age method. For the Facebook network, the proposed method can estimate the
diffusion sources with an average of 2–3 hops away from the real sources. However,
the estimated sources are averagely 3–4 hops away from the real sources when using
the other two methods. Similarly, when |S ∗ | = 3, the diffusion sources estimated by
the proposed method are much closer to the real sources in these real networks.
We have compared the performance of our method with two competeting
methods. From the experiment results (Figs. 11.5 and 11.6, and Table 11.2), we
see that our proposed method is superior to previous work. Around 80% of all
experiment runs identify the nodes averagely 2–3 hops away from the real sources
when there are two diffusion sources. Moreover, when there are three diffusion
sources, there are also around 80% of all experiment runs identifying the nodes
averagely 3 hops away from the real sources.
Fig. 11.7 Estimate of the number of sources. (a) Yeast; (b) Power grid; (c) Facebook
vertical axis indicates the percentage of experiment runs estimating the correspond-
ing the number sources. For the Yeast network, we see that 70% experiment runs
can accurately estimate the number of sources when |S ∗ | = 1. More than 80% of
experiment runs can accurately estimate the number of sources when |S ∗ | = 2, and
around 60% when |S ∗ | = 3. For the Power Grid network, it can be seen that around
50% of the total experiment runs can accurately detect the number of sources when
|S ∗ | ranges from 1 to 3. The accuracy is about 68% on Facebook when |S ∗ | ranges
from 1 to 3.
The high accuracy in estimating both the spreading time and the number of
diffusion sources reflects the efficiency of our method from different angles.
We justify the effectiveness of the proposed K-center method from two different
aspects. Firstly, we examine the correlation between the objective function values
in Eq. (11.6) of the estimated sources and those of the real sources. If they
are highly correlated with each other, the objective function in Eq. (11.6) will
accurately describe the multi-source identification problem. Secondly, at each time
tick, we examine the average effective distances from the newly infected nodes to
their corresponding diffusion sources. The linear correlation between the average
effective distances and the spreading time will justify the effectiveness of using
effective distance in estimating multiple diffusion sources.
11.5 Evaluation 155
We investigate the correlation between the estimated sources and the real sources
by examining the correlation of their objective function values in Eq. (11.6). If the
estimated sources are exactly the real sources, their objective function values f
should present high correlations.
Figures 11.8 and 11.9 shows the correlation results of the objective function
values when |S ∗ | is 2 or 3, respectively. We can see that their objective function
values approximately form linear relationships. This means that the real sources and
the estimated sources are highly correlated with each other. The worst results occur
in Figs. 11.8a and 11.9a in the Power Grid network. However, the majority of the
correlation results in these two figures still tend to be clustered in a line. The strong
correlation between the real sources and estimated sources reflects the effectiveness
of the proposed method.
We further investigate the correlation between the relative arrival time of nodes
getting infected and the average effective distance from them to their corresponding
sources. The experiment results in different networks when |S ∗ | is 2 or 3 are shown
in Figs. 11.10 and 11.11, respectively.
As shown in Figs. 11.10 and 11.11, at each time tick, the effective distance from
the nodes infected at this time tick to their corresponding sources are indicated as
blue circles. Their average effective distance to the corresponding sources at each
time tick is indicated as red square. It can be seen that the average effective distance
is linear with the relative arrival time. This therefore justifies that the proposed K-
center method is well-developed.
Fig. 11.8 The correlation between the objective function of the estimated sources and that of the
real sources when S ∗ = 2. (a) Power grid; (b) Yeast; (c) Facebook
156 11 Identifying Multiple Propagation Sources
Fig. 11.9 The correlation between the objective function of the estimated sources and that of the
real sources when S ∗ = 3. (a) Power grid; (b) Yeast; (c) Facebook
Fig. 11.10 The effective distances between the nodes infected at each time tick and their
corresponding sources when S ∗ = 2. (a) Power grid; (b) Yeast; (c) Facebook
Fig. 11.11 The effective distances between the nodes infected at each time tick and their
corresponding sources when S ∗ = 3. (a) Power grid; (b) Yeast; (c) Facebook
11.6 Summary 157
11.6 Summary
The global diffusion of epidemics, rumors and computer viruses causes great
damage to our society. It is critical to identify the diffusion sources and promptly
quarantine them. However, one critical issue of current methods is that they are
far are unsuitable for large-scale networks due to the computational cost and
the complex spatiotemporal diffusion processes. In this chapter, we introduce a
community structure based approach to efficiently identify diffusion sources in large
networks.
12.1 Introduction
quickly and accurately identify rumor sources and therefore eliminate the socio-
economic impact of dangerous rumors.
In the past few years, researchers have proposed a series of methods to identify
diffusion sources in networks. However, those methods are either with high
computational complexity or with relative lower complexity but for particularly
structured networks (e.g., trees and regular networks). For example, the initial
methods of rumor source identification are designed for tree networks, including
the rumor center method [161] and the Jordan center method [201]. Even in tree
networks, the problem of identifying diffusion sources is proven to be #P-complete
[161]. Later, the constraints on trees were relaxed but with complete or snapshot
observations through heuristic strategies, including Bayesian inference [7, 146],
spectral techniques [58], and centralities methods [34]. Most of them are based
on scanning the whole network. However, real networks are far more complex
than tree networks and it is impractical to scan the whole network to locate the
diffusion source, especially for large networks. Recently, Pinto et al. [146] proposed
to identify rumor sources based on sensor observations. The proposed Gaussian
method chooses sensors randomly or set up sensors on high degree nodes. In fact,
the selection of sensors is crucial in identifying rumor sources since well chosen
sensors can reflect the spreading direction and speed of the diffusion. Seo et al.
[158] compared different strategies of choosing sensors, and concluded that high
betweenness or degree sensors are more efficient in identifying rumor sources. They
proposed a Four-metric source estimator which is also based on scanning the whole
network and view the diffusion source as the node which not only can reach the
infected sensors with the minimum sum of distances but also is the furthest away
from the non-infected sensors. In a nutshell, current methods are not suitable for
large-scale networks due to the expensive computational complexity and the large
scale of real networks. Readers could refer to [87] for a detailed survey in this area.
In this chapter, we propose a community structure based approach to identify
diffusion sources in large-scale networks. It not only addresses the scalability issue
in this area, but also shows significant advantages. Firstly, to effectively set up sparse
sensors, we detect the community structure of a network and choose the community
bridge nodes as sensors. According to the earliest infected bridge sensors, we can
easily determine the very first infected community where the diffusion started and
spread out to the rest of the network. Consequently, this narrows the suspicious
sources down to the very first infected community. Therefore, this overcomes
the scalability issue of current methods. According to a fundamental property of
communities that links inside are much denser than those connecting outside nodes,
bridge sensors will be very sparse. Secondly, to accurately locate the diffusion
source from the first infected community, we use the intrinsic property of the
diffusion source that the relative infection time of any node is linear with its effective
distance from the source. The effective distance between any pair of nodes is based
on not only the number of hops but also the propagation probabilities along the paths
between them [24]. It reflects the idea that a small propagation probability between
nodes is effectively equivalent to a large distance between them, and vice versa.
Finally, we use correlation coefficient to measure the degree of linear dependence
12.2 Community Structure 161
between the relative infection time and effective distances for each suspect, and
consider the one that has the largest correlation coefficient as the diffusion source.
The main contribution of this chapter is threefold.
• We address the scalability issue in source identification problems. Instead of
randomly choosing sensors or setting up high centrality nodes as sensors in
previous methods, we assign sensors on community bridges. According to the
infection time of bridge sensors, we can easily narrow the suspicious sources
down to the very first infected community.
• We propose a novel method which can efficiently locate diffusion sources from
the suspects. Here, we use the intrinsic property of the real diffusion source that
the effective distance to any node is linear with the relative infection time of
that node. The effective distance makes full use of the propagation probability
and the number of hops between node pairs, which dramatically enhances the
effectiveness of our method.
• We evaluate our method in two large networks collected from Twitter. The exper-
iment results show significant advantages of our method in identifying diffusion
sources in large networks. Especially, when the average size of communities
shrinks, the accuracy of our method increases dramatically.
Fig. 12.1 Illustration of network communities and community bridges. (a) Separated communi-
ties. Community bridges are the nodes associated with between-community edges, e.g., nodes A
and D connecting the blue community and the green community. (b) Overlapping communities.
Community bridges are not only the nodes associated with between-community edges but also the
nodes shared by different communities, e.g., nodes H , I and J shared by the green community and
the yellow community
will apply both separated and overlapping community detection methods on various
real networks− Infomap [153] and Link Clustering [3].
In this chapter, we will use community structures of networks to effectively
assign sensors. More specifically, we set sensors on community bridges. Community
bridges are nodes shared by two or more different communities or associated with
inter-community links (See Fig. 12.1). This is fundamentally different from previous
methods which choose high centrality nodes as sensors or even randomly set up
sensors.
see that community bridges play a crucial role in transmitting information from
one community to another. They can reflect the spreading direction and speed of
diffusion. Thus, we choose community bridges as sensors.
To assign sensors on community bridges, we first need to detect the community
structure of a network. According to Sect. 12.2, community structures can generally
be divided into two categories: separated communities and overlapping communi-
ties. For separated communities, community bridges are the nodes associating with
the inter-community edges. For example in Fig. 12.1a, the green community and
red community are connected by bridges E and F , and the bridges B, H , C and G
connect the blue community and the red community. For overlapping communities,
community bridges correspond not only to the nodes associated with the inter-
community edges, but also the nodes shared by different communities. For example
in Fig. 12.1b, the green community and the yellow community are connected by
shared bridges H , I and J , and bridges G, C and D connect the green community
and the purple community.
When we assign sensors on community bridges, we need to pay attention to the
number of sensors. The more sensors we set up, the more information we will collect
from them. However, in the real world, setting up more sensors will require more
money to buy equipment and more labor to maintain them. Generally, we can control
the number of bridges by regulating the average size of communities. The larger
the average size of communities, the smaller the number of community bridges,
and vice versa. Here are two extreme examples to explain this. (1) If we divide a
network into two communities, and choose one node as the first community and all
the remaining nodes in the second community, the number of bridge nodes will be
d + 1, where d is the degree of the node in the first community. (2) If we set every
single node as a community, the number of bridges will be the number of nodes in
the whole network. Furthermore, the number of bridges will be very small because
of the intrinsic property of communities that the links between communities is much
sparser than those within communities. In Sect. 12.4.2, we will analyze in detail the
influence of the average size of communities in detecting diffusion sources.
Compared with the existing sensor selection methods, which randomly choose
sensors or select high centrality nodes as sensors, the proposed community structure
based sensor selection method can additionally reflect the diffusion direction and
speed. We will compare different sensor-selection methods in various real networks
in Sect. 12.4.4.
The proposed community structure based approach consists of two steps. In the first
step, we determine the very first infected communities. Given a diffusion process
running for some time in a network, we obtain sparse observations from the sensors
assigned by the scheme in the previous subsection. Assume there are k sensors
having been infected, denoted as O = {o1 , . . . , ok }, and {t1 , . . . , tk } represents the
164 12 Identifying Propagation Source in Large-Scale Networks
time at which the infection arrives at these sensors. Then, according to the first
infected sensor(s), we can determine which community started the diffusion since
the diffusion has to go through community bridges to infect other communities. For
example in Fig. 12.1a, if sensors {H, F, E, B} are observed as infected and node
H is the first infected one, we can determine that the diffusion started from the red
community. In Fig. 12.1b, if sensors {K, F, H, J, G} are observed infected and node
K is the first infected one, we can determine that the diffusion could have started
from the blue community or the yellow community. We denote the set of nodes in
the first infected communities as
U = {u1 , u2 , . . . , um }. (12.1)
Since we do not have an absolute time reference, we have knowledge only about
the relative infection time. Choosing an arbitrary infected sensor, say o1 , as the
reference node, we can obtain the relative infection time of all the infected sensors
as in
τ = {0, t2 − t1 , . . . , tk − t1 }. (12.2)
In the second step, we investigate each suspect in the set U and identify the real
diffusion source. According to the properties of effective distance in, we know that
the relative infection time of any infected node is linear with its effective distance
from the real diffusion source. Therefore, to identify the diffusion source, we aim to
find the suspect with the best linear correlation between sensors’ relative infection
time and their effective distances from this suspect. Here, we use the correlation
coefficient, which is widely used as a measure of the degree of linear dependence
between two variables [104]. The correlation coefficient between two vectors x =
{x1 , x2 , . . . , xn } and y = {y1 , y2 , . . . , yn } is defined as,
n
(xi − x̄)(yi − ȳ)
e = i=1 , (12.3)
n n
(x
i=1 i − x̄) 2 (y
i=1 i − ȳ)
where x̄ and ȳ are the means of xi and yi , respectively. The correlation coefficient
ranges from -1 to 1. A value of 1 implies that a linear equation describes the
relationship between x and y perfectly, with all data points lying on a line for which
y increases as x increases. A value of -1 implies that all data points lie on a line
for which y decreases as x increases. A value of 0 implies that there is no linear
correlation between the variables. Therefore, we need to find a suspect with the
maximum correlation coefficient. More precisely, we aim to find a suspect in U
to maximize Eq. (12.3) in terms of the relative infection time of sensors and their
effective distance from the suspect. The detailed process of the proposed approach
is given in Algorithm 12.1.
Compared with current methods of identifying diffusion sources, the proposed
approach is superior, as many of the existing methods ignore the propagation
12.3 Community-Based Method 165
τ = {0, t2 − t1 , t3 − t1 , . . . , tk − t1 }.
Find the communities that contain sensor o1 , and combine the nodes in these communities,
denoted as
U = {u1 , u2 , . . . , um }.
Step 2: Calculate the correlation coefficient for each node in U and find the one which has
the largest correlation coefficient as follows.
for (each ui in U ) do
Compute the effective distance between ui and any infected sensor oj , denoted as
probabilities [87]. The proposed method utilizes the effective distance between
nodes which precisely reflects not only the propagation probability but also the
number of hops between nodes. This makes our algorithm more accurate and
effective. The comparison of our method to many competeting methods is shown
in Sect. 12.4.4.
From Algorithm 12.1, we see that the computation of our method is dominated
by Step 2 of calculating the correlation coefficient e for each suspect ui in the
very first infected community U . More specifically, the majority of computation
is in the calculation of effective distance between ui and any infected sensor oj
(∈ {o1 , o2 , . . . , ok }). Here, we use Dijkstra’s algorithm [63] to compute the shortest
paths (i.e., the effective distances) to all infected sensors from each ui . Dijkstra’s
algorithm requires O(M + N logN) computations to find the shortest paths from
one node to every other node in a network, where M is the number of edges and N
is the number of nodes in the network. However, in Algorithm 12.1, we only need to
calculate the effective distance between each suspect ui and any infected sensor oj ,
i.e., [D(ui , o1 ), D(ui , o2 ), . . . , D(ui , ok )] in Eq. (12.4). Therefore, the complexity
will be far less than O(M + N logN). Suppose the average size of communities in
the network is m and the number of infected sensors is k. Then, the computational
complexity of the proposed method is far less than O(L(M + N logN )), where
L = min{k, m}. Thus, if the average community size is smaller, it requires less time
to identify the diffusion source.
Current methods are far more complex than the proposed method. They need to
scan the whole network, and calculate the shortest path from each sensor to any
other node. For example, the computational complexity of the Gaussian method
[146] is O(N 3 ) since it requires constructing the BFS tree rooted at each node in
a network, and it also needs to calculate the inverse of the covariance matrix for
each BFS tree. The computational complexity of the Monte Carlo method [2] is
O(k(M + N logN )/ 2 ), where k is the number of infected sensors. The majority
of the computation is in calculating the shortest paths from an arbitrary node i to
all the sensors in order to sample the infection time of all sensors assuming that
node i is the diffusion source. By the central limit theorem, O(1/ 2 ) samples are
needed to achieve an error o(). For the Four-metric method [158], the majority of
the computation is also in computing the lengths of the shortest paths from each
node to all the sensors, both infected and non-infected. Thus, the computational
complexity of this method is O(n(M + N logN )), where n is the number of
sensors.
Compared with these existing methods of identifying diffusion sources based
on sensor observations, we see that the computational complexity of the proposed
method is much less than that of current methods. Furthermore, the proposed
method takes advantages of the relative infection time of sensors, the propagation
probabilities and the number of hops between nodes. However, the existing methods
either require the generation of the infection time of each sensor (e.g., the Gaussian
and Monte Carlo methods) or ignore the propagation probabilities (e.g., the Four-
metric method) between nodes. Thus, the proposed method is superior and is able
to work in large networks.
12.4 Evaluation 167
12.4 Evaluation
where z is the number of contacts between node i and j , and μ is the median of
contacts between neighbors.
To demonstrate the robustness of the proposed method across different types
of community structure, we apply separated (Infomap [153]) and overlapping
(Link Clustering [139]) community detection methods on these two networks. The
Infomap method shows communities of a network in a hierarchical structure from
which we can choose different levels of communities. In each level, the number
of communities will be different. The deeper the level is, the more communities
we will obtain and the smaller the average size of each community will be. In our
experiments, we typically choose the second, third and fourth-level communities,
denoted by β = 2, 3 and 4. On the other hand, we can adjust the parameter α in the
Link Clustering method to regulate the number of communities of a network. The
larger α is, the more communities we obtain, and similar to the previous method, the
smaller the average size of the communities will be. We typically set α = 0.10, 0.15
and 0.20 in our experiments. In each experiment, we randomly choose a diffusion
source in each run over 100 runs. The number of 100 comes from the discussion in
the previous work of [208]. The implementation is in conducted in C++.
Fig. 12.2 Degree distribution of the two large networks. (a) The mention network; (b) The retweet
network
Fig. 12.3 The accuracy of the proposed method in identifying diffusion sources. (a) and (c) show
the accuracy of our method in the mention network and the retweet network having overlapping-
community structure with parameter α ∈ {0.10, 0.15, 0.20}. (b) and (d) show the accuracy of our
method in these networks having separated-community structure with parameter β ∈ {2, 3, 4}
structure. When α is 0.10, around 48% of the experiment runs can accurately
identify the diffusion sources. When α increases to 0.15 (equivalently, the average
size of communities becomes small), the accuracy of our method increases to
about 57%. When α increases to 0.20, more than 83% of the experiment runs can
accurately identify the real sources. Figure 12.3b shows the experiment results in
the Mention network with separated-community structure. Similar to the results in
the overlapping-community structure, when β is 2, around 52% of the experiment
runs can precisely identify the real diffusion sources. When β increases to 3 (i.e., the
average size of communities becomes small), the accuracy of our method increases
to around 70%. When β increases to 4, our method achieves an accuracy of around
98%, which means only a few runs could not identify the real sources. Similar
results can be found in the Retweet network in Figs. 12.3c, d.
Furthermore, we notice that the average distance between the estimated sources
and the real sources is very small. For both networks, from Fig. 12.3 we see that the
average error distance is within 1–2 hops. That is to say, even when the proposed
method does not accurately identify the real source, it is on average within a radius
of 1–2 hops from the estimated source. In addition, from Fig. 12.3 we see that
the maximum error distance is also very small (on average 5 hops). Compared
with the existing methods, which have low accuracy and expensive computational
complexity [87], the proposed method shows significantly higher performance in
identifying diffusion sources in large networks.
To summarize, our method performs very well in large networks associated with
either overlapping or separated community structures. Especially, when a network is
associated with a small average community size, our method can accurately identify
diffusion sources.
From the previous subsection, we notice that the accuracy of the proposed method
increases when the parameter α or β becomes large. Equivalently, the performance
of the proposed method improves when the average size of communities becomes
small. In order to analyze the influence of the average size of communities in
the accuracy of our method, we investigate the number of communities, bridges
and suspects when we change the parameters in the separated-community and
overlapping-community detection methods. More specifically, we let the parameter
β range from 2 to 4 for the Infomap method of detecting separated community
structure, and we let the parameter α range from 0.10 to 0.20 for the Link Clustering
method of detecting overlapping community structure.
The distribution of the community sizes of the previous two networks under
different parameter settings is shown in Fig. 12.4. Overall, we can see that the
community size shows power law distribution, i.e., a few communities are of a
significantly larger size but the majority of the communities are of a smaller size.
Furthermore, the number of communities decreases when the parameter α or β
170 12 Identifying Propagation Source in Large-Scale Networks
Fig. 12.4 Community size distribution under different parameter setting. (a) and (b) show the
community size distribution in the mention network and the retweet network having separated
community structure with β ∈ {2, 3, 4}; (c) and (d) show the community size distribution
in the mention network and the retweet network having separated community structure with
α ∈ {0.10, 0.15, 0.20}
becomes smaller (compare the density of blue and green dots in Fig. 12.4). The
detailed statistics of the community structures of these two networks derived by
setting different parameters are shown in Table 12.2. For the Retweet network, when
β = 2, there are 852 communities, 8422 bridge nodes, an average of 2158 suspects,
and the average error distance between the estimated sources and the real sources
is 1.77. When β increases to 3, the average error distance decreases to 1.36 and the
number of suspects shrinks to 925, while the number of communities and bridges
increases. When β increase to 4, the average error distance decreases to 0.48 and
the number of suspects shrinks to 153, while the number of communities bridges
becomes larger. We notice that when the parameter β becomes large, the number
of communities rises, which leads to a decrease in the average size of communities.
Consequently, more bridges are needed to connect communities. We then can obtain
more information from bridge sensors. Thus, we see that the average error distance
between the real sources and the estimated sources becomes smaller. Similar results
can be found in both networks with overlapping-community structure. When the
parameter α increases, the number of bridges increases, and therefore, the average
error distance decreases.
12.4 Evaluation 171
In the real world, it requires a lot of money and energy to set up sensors and
maintain them. Hence, we need to choose as few sensors as possible and start
to identify diffusion sources when partial sensors get infected. Here, we select a
moderate-size set of sensors and then analyze the accuracy of our method when
only a small ratio of sensors are infected (see Fig. 12.5). More specifically, we
choose β = 2 for the Infomap method and α = 0.10 for the Link Clustering
method. Figure 12.5 shows the average error distance between the real sources and
the estimated sources when the ratio of infected sensors ranges from 10% to 100%.
We see that when more than 30% of sensors are infected, our method can identify a
node on average less than 2 hops away from the real source. When there are more
than 50% of sensors are infected, the average error distance between the real source
and the estimated source is approximately 1. Therefore, the proposed method can
identify diffusion sources with high accuracy even if only a small ratio of sensors
are infected.
From Figs. 12.4, 12.5, and Table 12.2, we see that the performance of the
proposed method improves when the average community size becomes smaller.
Even if the average community size is large and only a small ratio of sensors are
infected, our method can still accurately identify the real diffusion source or a node
very close to the real diffusion source.
In the second step of the proposed method, we utilize the linear correlation between
the relative infection time of any sensor and its effective distance from the diffusion
source. The suspect with the highest correlation coefficient is considered as the
diffusion source. In order to justify the effectiveness of the proposed method, we
examine the relationship between the relative infection time of any infected node
and its effective distance from the diffusion source, especially when the diffusion
starts from sources of different degrees.
In the previous two networks, we let the diffusion start from a small, moderate
and large degree source respectively, and compare the correlation coefficient of the
172 12 Identifying Propagation Source in Large-Scale Networks
Fig. 12.6 Justification of our method on the mention network. (a) Linear correlation between the
relative infection time of sensors and their average effective distance from the diffusion source.
Specifically, we let the diffusion start from sources with different degrees: small degree, moderate
degree and large degree. (b), (c) and (d) show the correlation coefficient value for each suspect
Fig. 12.7 Justification of our method on the retweet network. (a) Linear correlation between the
relative infection time of sensors and their average effective distance from the diffusion source.
(b), (c) and (d) show the correlation coefficient value for each suspect
real source and that of all the suspects. Figure 12.6 shows the experiment results on
the Mention network. From Fig. 12.6a, we can see that, with the diffusion starting
from sources of different degrees, the relative infection time of infected nodes is
linear with their average effective distance from the diffusion source. We notice that
when the diffusion starts from a large degree source, the scatter plot begins to curve
beginning at time tick 15. According to our investigation, almost all of the nodes
have been infected by time tick 15. Then, in the remaining time, only the nodes
which refused to be infected before time tick 15 can get infected. However, their
smallest number of hops from the diffusion source is fixed. Therefore, according to
Eq. (11.5), its effective distance from the diffusion source will be relatively short.
In Fig. 12.6b, c, and d, we show the correlation coefficient of all suspects. It can
be seen that the real sources (see the red dots) have a high correlation coefficient
whenever the diffusion starts from a source of small, moderate, or large degree.
Figure 12.7 shows the experiment results on the Retweet network. Similar to the
12.4 Evaluation 173
results on the Mention network, the relative infection time of any infected node is
linear to its effective distance from the diffusion sources of different degrees. When
the diffusion starts from a large degree source, their relation starts to curve towards
the end. This is because almost all of the nodes have been infected by time tick 13.
Figure 12.7b, c, and d show the correlation coefficient of all suspects. We can see
that the real sources all have high correlation coefficient.
To summarize, the linear correlation between relative infection time of nodes and
their average effective distance from diffusion source justifies the effectiveness of
our proposed method.
In this section, we compare the proposed community structure based approach with
three competeting methods of identifying diffusion sources in networks based on
sensor observations. They include:
• The Gaussian method [146],
• The Monte Carlo method [2], and
• The Four-metric method [158].
According to Sect. 12.1, these methods are all susceptible to the scalability issue
because they need to scan every node in a network, which leads to very high
computational complexity (see Sect. 12.3.3). Especially, for the Gaussian method,
which is designed for tree networks, it needs to construct the BFS trees rooted at
each node in a general network and the inverse of the covariance matrix for each
BFS tree. Therefore, these methods are too computationally expensive to be applied
in large networks. In addition, among these methods only the Four-metric method
investigated and compared different sensor selection methods. Both the Gaussian
method and the Monte Carlo method set up sensors on high degree nodes or even
randomly choose nodes as sensors.
In the following, we first choose four relatively small networks to compare
the performance of the proposed method to that of the three methods. Then, we
introduce two well studied methods to select sensors for the three methods. Finally,
we present the detailed comparison results.
In order to compare with the three competeting methods, we choose four relatively
small networks:
• The Western U.S. Power Grid network [180],
• The Yeast protein interaction (PPI) network [82],
• Mention: the network of political communication between Twitter users [35].
• Retweet: the network of political communication between Twitter users [35].
174 12 Identifying Propagation Source in Large-Scale Networks
Fig. 12.8 Degree distribution of the four networks. (a) Political mention; (b) Political retweet; (c)
Power grid; (d) Yeast PPI network
Fig. 12.9 Betweenness distribution of the four networks. (a) Political mention; (b) Political
retweet; (c) Power grid; (d) Yeast PPI network
ness distributions of these networks are shown in Fig. 12.9. Correspondingly, the
betweenness distributions also show the scale-free or exponential phenomenon of
these networks.
We use the above high degree or betweenness strategies to set up sensors for the
existing methods. We let the number of sensors account for no more than 50% of the
total number of nodes in each network. For the proposed community structure based
method, in order to select fewer sensors, we typically set α = 0.10 for the Link
Clustering method in detecting overlapping community structures, and set β = 2
for the Infomap method in detecting separated community structures. The number of
communities and bridges of these four networks under different experiment settings
are shown in Table 12.3. We see that the number of communities is very small and
the number of sensors account for less than 30% of the number of nodes in each
network.
In the experiments, the diffusion probability is chosen uniformly from (0, 1), and the
diffusion process propagates t time steps where t is uniformly chosen from [8, 10].
We use detection rate to measure the accuracy of identifying diffusion sources. The
detection rate is defined as the fraction of experiments that accurately identify the
real diffusion sources. The higher detection rate, the better performance.
We first compare the proposed method to the existing methods associated with
high degree sensors. More specifically, we utilized the Infomap method [153] to
detect separated community structure of each network in this group of experiments.
The experiment results are shown in Fig. 12.10. We can see that the detection rate
of the proposed method is higher than that of the existing methods in each of the
networks. For the two political networks (see Fig. 12.10a, b), 30% of experiment
runs can accurately identify the diffusion sources. More than 90% of experiment
runs can identify a node within 2 hops away from the real source. Nearly 100%
that the real source is within 3 hops around the estimated source. Furthermore, the
average error distance from the real sources to the estimated sources is very small.
However, for the existing methods, only few experiment runs can accurately identify
176 12 Identifying Propagation Source in Large-Scale Networks
Fig. 12.10 Comparison of the proposed method with other methods in the accuracy of identifying
diffusion sources when setting sensors at high-degree nodes in four moderate-scale networks. (a)
Political mention; (b) Political retweet; (c) Power grid; (d) Yeast
Fig. 12.11 Comparison of the proposed method with other methods in the accuracy of identifying
diffusion sources when setting sensors at high-betweenness nodes in four moderate-scale networks.
(a) Political mention; (b) Political retweet; (c) Power grid; (d) Yeast
the real diffusion sources. Similar results can be found in the Yeast PPI network (see
Fig. 12.10d). More than 90% of the experiment runs can accurately identify the real
sources by using the proposed method, while few experiment runs can accurately
identify the real sources by using the existing methods. The average error distance
is much larger compared with that of the proposed method. The proposed method
also outperforms the existing methods in the Power Grid network (see Fig. 12.10c).
We then compare the proposed method to the existing methods associated with
high-betweenness sensors. In this group of comparisons, we utilized the Link
Clustering method [140] in detecting the overlapping community structure of each
network. The experiment results on the four networks are shown in Fig. 12.11.
Similar to the results in Fig. 12.10, the detection rate of the proposed method is
higher than that of the existing methods in each of the four networks. For the
two political networks, more than 50% of experiment runs accurately identified
the real diffusion sources in the political Mention network, and more than 70%
of experiment runs accurately identified the real diffusion sources in the political
Retweet network. However, for the existing methods, few of the experiment runs
accurately identified the diffusion sources. Furthermore, the average error distance
is larger compared with that of the proposed method. Similar results can be found in
the Power Grid network and the Yeast PPI network. By using the proposed method,
more than 45% of experiment runs accurately identified the diffusion sources in the
Power Grid network, and more than 60% for the Yeast PPI network. However, for
12.4 Evaluation 177
the existing methods, few of the experiment runs accurately identified the diffusion
sources. Furthermore, the average error distance from the estimated sources and the
real sources are larger compared that of the proposed method.
From Figs. 12.10 and 12.11, we see that the existing methods show different
performances in the two different sensor selection methods. For example in
Fig. 12.10d, the Monte Carlo method outperforms the Four-metric method with high
degree sensor selection method. However, in Fig. 12.11d, the Four-metric method
outperforms the Monte Carlo method with the high betweenness sensor selection
method, and the Gaussian method shows similar performances. In order to see the
impact of using different sensor selection methods on the existing methods, we show
in Fig. 12.12 the correlation between nodes’ degree and their average betweenness
of the four networks. As we can see, many nodes with the high degree tend to have
high betweenness. However, this is not always the case, as there are also some nodes
with low betweenness, especially for the political Retweet network and the Power
Grid network. This explains why the existing methods show different performances
in different sensor selection methods in Figs. 12.10 and 12.11.
Figure 12.13 shows the linear relation between relative infection time of nodes
and their average effective distance from the diffusion sources in the four relatively
Fig. 12.12 The relationship between degree and the average betweenness at each degree of the
four networks. (a) Political mention; (b) Political retweet; (c) Power grid; (d) Yeast PPI network
small networks. We can see that relative infection time is linear with the average
effective distance in all these networks. Similar to the results in Figs. 12.6a
and 12.7a, the scatter plot curves towards the end because almost all of the nodes
have been infected by then. The linear correlation in these networks again justifies
the effectiveness of the proposed method.
To summarize, we see that the proposed community structure based method
outperforms the existing methods in identifying diffusion sources based on sen-
sor observations in various networks. The majority of the experiment runs can
accurately identify the real diffusion source or a node that is close to the real
source. However, the existing methods show low performance, and the average error
distance between the estimated sources and the real diffusion sources is very large.
12.5 Summary
13.4 Conclusion
methods for measuring the influence of network hosts and different approaches for
restraining the propagation of malicious attacks. For identifying the propagation
source of malicious attacks, we discussed about current methods in regards to three
different categories of observations on the propagation, and analyzed their pros and
cons based on real-world datasets. Furthermore, we discussed about three critical
research issues about propagation source identification: identifying propagation
source in time-varying networks, identifying multiple propagation sources, and
identifying propagation source in large-scale networks. For each research issue, we
introduced one representative state-of-the-art method.
Malicious attack propagation and source identification still have lots of unknown
potentials, and literature summarized in this book can be a starting point for
exploring new challenges in the future. Our goal is to give an overview of existing
works on malicious attack propagation and source identification to show their
usefulness to the newcomers as well as practitioners in various fields. We also hope
the overview can help avoiding some redundant, ad hoc effort, both from researchers
and from industries.
References
15. A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. science,
286(5439):509–512, 1999.
16. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. science,
286(5439):509–512, 1999.
17. A. Beuhring and K. Salous. Beyond blacklisting: Cyberdefense in the era of advanced
persistent threats. Security & Privacy, IEEE, 12(5):90–93, 2014.
18. S. Bhagat, A. Goyal, and L. V. Lakshmanan. Maximizing product adoption in social networks.
In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining,
WSDM ’12, pages 603–612. ACM, 2012.
19. V. Blue. Cryptolocker’s crimewave: A trail of millions in laundered bitcoin.[en línea] 22 de
diciembre de 2013.[citado el: 22 de enero de 2014.].
20. P. Bonacich. Factoring and weighting approaches to status scores and clique identification.
Journal of Mathematical Sociology, 2(1):113–120, 1972.
21. P. Bonacich. Power and centrality: A family of measures. American journal of sociology,
pages 1170–1182, 1987.
22. Y. Boshmaf, I. Muslukhov, K. Beznosov, and M. Ripeanu. Design and analysis of a social
botnet. Computer Networks, 57(2):556–578, 2013.
23. A. Braunstein and A. Ingrosso. Inference of causality in epidemics on temporal contact
networks. Scientific reports, 6:27538, 2016.
24. D. Brockmann and D. Helbing. The hidden geometry of complex, network-driven contagion
phenomena. Science, 342(6164):1337–1342, 2013.
25. C. Budak, D. Agrawal, and A. El Abbadi. Limiting the spread of misinformation in social
networks. In Proceedings of the 20th international conference on World wide web, WWW
’11, pages 665–674. ACM, 2011.
26. S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, and E. Shir. From the cover: A model of
internet topology using k-shell decomposition. PNAS, Proceedings of the National Academy
of Sciences, 104(27):11150–11154, 2007.
27. C. Cattuto, W. Van den Broeck, A. Barrat, V. Colizza, J.-F. Pinton, and A. Vespignani.
Dynamics of person-to-person interactions from distributed rfid sensor networks. PloS one,
5(7):e11596, 2010.
28. CFinder. Clusters and communities, 2013.
29. D. Chakrabarti, J. Leskovec, C. Faloutsos, S. Madden, C. Guestrin, and M. Faloutsos.
Information survival threshold in sensor and p2p networks. In INFOCOM 2007. 26th IEEE
International Conference on Computer Communications. IEEE, pages 1316–1324, 2007.
30. W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD ’09, pages 199–208. ACM, 2009.
31. Y. Chen, G. Paul, S. Havlin, F. Liljeros, and H. E. Stanley. Finding a better immunization
strategy. Phys. Rev. Lett., 101:058701, Jul 2008.
32. Z. Chen, K. Zhu, and L. Ying. Detecting multiple information sources in networks under the
sir model. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on,
pages 1–4. IEEE, 2014.
33. A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large
networks. Phys. Rev. E, 70:066111, Dec 2004.
34. C. H. Comin and L. da Fontoura Costa. Identifying the starting point of a spreading process
in complex networks. Phys. Rev. E, 84:056105, Nov 2011.
35. M. Conover, J. Ratkiewicz, M. Francisco, B. Gonçalves, F. Menczer, and A. Flammini.
Political polarization on twitter. In ICWSM, 2011.
36. K. L. Cooke and P. Van Den Driessche. Analysis of an seirs epidemic model with two delays.
Journal of Mathematical Biology, 35(2):240–260, 1996.
37. G. Cowan. Statistical data analysis. Oxford university press, 1998.
38. D. Dagon, C. C. Zou, and W. Lee. Modeling botnet propagation using time zones. In NDSS,
volume 6, pages 2–13, 2006.
39. D. J. Daley and D. G. Kendall. Epidemics and rumours. 1964.
References 185
40. C. I. Del Genio, T. Gross, and K. E. Bassler. All scale-free networks are sparse. Phys. Rev.
Lett., 107:178701, Oct 2011.
41. Z. Dezső and A.-L. Barabási. Halting viruses in scale-free networks. Phys. Rev. E, 65:055103,
May 2002.
42. B. Doerr, M. Fouz, and T. Friedrich. Why rumors spread so quickly in social networks.
Commun. ACM, 55(6):70–75, June 2012.
43. P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings
of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD ’01, pages 57–66. ACM, 2001.
44. W. Dong, W. Zhang, and C. W. Tan. Rooting out the rumor culprit from suspects. In
Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, pages
2671–2675. IEEE, 2013.
45. S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin. Structure of growing networks with
preferential linking. Physical review letters, 85(21):4633, 2000.
46. N. Eagle and A. Pentland. Reality mining: sensing complex social systems. Personal and
ubiquitous computing, 10(4):255–268, 2006.
47. D. Easley and J. Kleinberg. Networks, crowds, and markets: Reasoning about a highly
connected world. Cambridge University Press, 2010.
48. H. Ebel, L.-I. Mielsch, and S. Bornholdt. Scale-free topology of e-mail networks. Phys. Rev.
E, 66:035103, Sep 2002.
49. H. Ebel, L.-I. Mielsch, and S. Bornholdt. Scale-free topology of e-mail networks. Phys. Rev.
E, 66:035103, Sep 2002.
50. C. Economics. Malware report: The economic impact of viruses, spyware, adware, botnets,
and other malicious code. Irvine, CA: Computer Economics, 2007.
51. Economist. A thing of threads and patches. Economist, August 25, 2012.
52. P. Erd0s. Graph theory and probability. Canad. J. Math, 11:34G38, 1959.
53. P. ERDdS and A. R&WI. On random graphs i. Publ. Math. Debrecen, 6:290–297, 1959.
54. ESET. Virus radar, November 2014.
55. M. R. Faghani and U. T. Nugyen. Modeling the propagation of trojan malware in online social
networks. arXiv preprint arXiv:1708.00969, 2017.
56. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet
topology. In Proceedings of the conference on Applications, technologies, architectures, and
protocols for computer communication, SIGCOMM ’99, pages 251–262. ACM, 1999.
57. X. Fan and Y. Xiang. Modeling the propagation of peer-to-peer worms. Future Generation
Computer Systems, 26(8):1433–1443, 2010.
58. V. Fioriti, M. Chinnici, and J. Palomo. Predicting the sources of an outbreak with a spectral
technique. Applied Mathematical Sciences, 8(135):6775–6782, 2014.
59. S. Fortunato. Community detection in graphs. Physics reports, 486(3):75–174, 2010.
60. S. Fortunato, A. Flammini, and F. Menczer. Scale-free network growth by ranking. Physical
review letters, 96(21):218701, 2006.
61. M. Fossi and J. Blackbird. Symantec internet security threat report 2010. Technical report,
Symantec Corporation, March, 2011.
62. C. Fraser, C. A. Donnelly, S. Cauchemez, W. P. Hanage, M. D. Van Kerkhove, T. D.
Hollingsworth, J. Griffin, R. F. Baggaley, H. E. Jenkins, E. J. Lyons, et al. Pandemic potential
of a strain of influenza a (h1n1): early findings. science, 324(5934):1557–1561, 2009.
63. M. L. Fredman and R. E. Tarjan. Fibonacci heaps and their uses in improved network
optimization algorithms. Journal of the ACM (JACM), 34(3):596–615, 1987.
64. L. C. Freeman. A set of measures of centrality based upon betweenness. Sociometry, 40:35–
41, 1977.
65. L. C. Freeman. Centrality in social networks conceptual clarification. Social networks,
1(3):215–239, 1978.
66. L. C. Freeman. Centrality in social networks: conceptual clarification. Social Networks,
1:215–239, 1979.
186 References
92. L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43,
1953.
93. M. J. Keeling and K. T. Eames. Networks and epidemic models. Journal of the Royal Society
Interface, 2(4):295–307, 2005.
94. M. J. Keeling and P. Rohani. Modeling infectious diseases in humans and animals. Princeton
University Press, 2008.
95. D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social
network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’03, pages 137–146, 2003.
96. M. Kitsak, L. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. Stanley, and H. Makse.
Identification of influential spreaders in complex networks. Nature Physics, 6(11):888–893,
Aug 2010.
97. J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins. The web as a
graph: measurements, models, and methods. In International Computing and Combinatorics
Conference, pages 1–17. Springer, 1999.
98. D. Koschützki, K. A. Lehmann, L. Peeters, S. Richter, D. Tenfelde-Podehl, and O. Zlotowski.
Centrality indices. In Network analysis, pages 16–61. Springer, 2005.
99. P. L. Krapivsky and S. Redner. Organization of growing random networks. Physical Review
E, 63(6):066123, 2001.
100. M. J. Krasnow. Hacking, malware, and social engineering—definitions of and statistics about
cyber threats contributing to breaches. Expert Commentary: Cyber and Privacy Risk and
Insurance, January 2012.
101. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic
models for the web graph. In Foundations of Computer Science, 2000. Proceedings. 41st
Annual Symposium on, pages 57–65. IEEE, 2000.
102. H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media?
In WWW ’10: Proceedings of the 19th international conference on World wide web, pages
591–600. ACM, 2010.
103. K. Labs. Facebook malware poses as flash update, infects 110k users, February 2015.
104. I. Lawrence and K. Lin. A concordance correlation coefficient to evaluate reproducibility.
Biometrics, pages 255–268, 1989.
105. B. Li. An in-depth look into malicious browser extensions, October 2014.
106. F. Li, Y. Yang, and J. Wu. Cpmc: An efficient proximity malware coping scheme in
smartphone-based mobile networks. In INFOCOM, 2010 Proceedings IEEE, pages 1–9,
2010.
107. Y. Li, W. Chen, Y. Wang, and Z.-L. Zhang. Influence diffusion dynamics and influence
maximization in social networks with friend and foe relationships. In Proceedings of the sixth
ACM international conference on Web search and data mining, WSDM ’13, pages 657–666.
ACM, 2013.
108. Y. Li, P. Hui, D. Jin, L. Su, and L. Zeng. Optimal distributed malware defense in mobile
networks with heterogeneous devices. Mobile Computing, IEEE Transactions on, 2013.
Accepted.
109. Y. Li, B. Zhao, and J.-S. Lui. On modeling product advertisement in large-scale online social
networks. Networking, IEEE/ACM Transactions on, 20(5):1412–1425, 2012.
110. Y. Y. Liu, J. J. Slotine, and A. laszlo Barabasi. Controllability of complex networks. Nature,
473:167–173, 2011.
111. A. Y. Lokhov, M. Mézard, H. Ohta, and L. Zdeborová. Inferring the origin of an epidemy
with dynamic message-passing algorithm. arXiv preprint arXiv:1303.5315, 2013.
112. A. Louni and K. Subbalakshmi. A two-stage algorithm to estimate the source of information
diffusion in social media networks. In Computer Communications Workshops (INFOCOM
WKSHPS), 2014 IEEE Conference on, pages 329–333. IEEE, 2014.
113. R. D. Luce and A. D. Perry. A method of matrix analysis of group structure. Psychometrika,
14(2):95–116, 1949.
188 References
114. W. Luo and W. P. Tay. Finding an infection source under the sis model. In Acoustics, Speech
and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 2930–2934,
2013.
115. W. Luo, W. P. Tay, and M. Leng. Identifying infection sources and regions in large networks.
Signal Processing, IEEE Transactions on, 61(11):2850–2865, 2013.
116. W. Luo, W. P. Tay, and M. Leng. How to identify an infection source with limited
observations. IEEE Journal of Selected Topics in Signal Processing, 8(4):586–597, 2014.
117. W. Luo, W. P. Tay, and M. Leng. Rumor spreading and source identification: A hide and seek
game. arXiv preprint arXiv:1504.04796, 2015.
118. Y. Ma, X. Jiang, M. Li, X. Shen, Q. Guo, Y. Lei, and Z. Zheng. Identify the diversity
of mesoscopic structures in networks: A mixed random walk approach. EPL (Europhysics
Letters), 104(1):18006, 2013.
119. D. MacRae. 5 viruses to be on the alert for in 2014.
120. H. E. Marano. Our brain’s negative bias. Technical report, Psychology Today, June 20, 2003.
121. M. Mathioudakis, F. Bonchi, C. Castillo, A. Gionis, and A. Ukkonen. Sparsification of
influence networks. In Proceedings of the 17th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’11, pages 529–537. ACM, 2011.
122. R. M. May and A. L. Lloyd. Infection dynamics on scale-free networks. Phys. Rev. E,
64:066112, Nov 2001.
123. A. R. McLean, R. M. May, J. Pattison, R. A. Weiss, et al. SARS: A case study in emerging
infections. Oxford University Press, 2005.
124. S. Meloni, A. Arenas, S. Gómez, J. Borge-Holthoefer, and Y. Moreno. Modeling epidemic
spreading in complex networks: concurrency and traffic. In Handbook of Optimization in
Complex Networks, pages 435–462. Springer, 2012.
125. S. Milgram. The small world problem. Psychology today, 2(1):60–67, 1967.
126. D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver. Inside the slammer
worm. IEEE Security and Privacy, 1(4):33–39, July 2003.
127. D. Moore, C. Shannon, et al. Code-red: a case study on the spread and victims of an internet
worm. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment, pages
273–284. ACM, 2002.
128. Y. Moreno, M. Nekovee, and A. F. Pacheco. Dynamics of rumor spreading in complex
networks. Physical Review E, 69(6):066130, 2004.
129. T. Nepusz and T. Vicsek. Controlling edge dynamics in complex networks. Nature, 8:568–
573, 2012.
130. NetMiner4. Premier software for network analysis, 2013.
131. M. E. Newman. The structure and function of complex networks. SIAM review, 45(2):167–
256, 2003.
132. M. E. Newman. A measure of betweenness centrality based on random walks. Social
networks, 27(1):39–54, 2005.
133. M. E. Newman. The mathematics of networks. The new palgrave encyclopedia of economics,
2:1–12, 2008.
134. M. E. Newman and J. Park. Why social networks are different from other types of networks.
Physical Review E, 68(3):036122, 2003.
135. M. E. Newman, D. J. Watts, and S. H. Strogatz. Random graph models of social networks.
Proceedings of the National Academy of Sciences, 99(suppl 1):2566–2572, 2002.
136. M. E. J. Newman. Networks: An Introduction, chapter 17 Epidemics on networks, pages 700–
750. Oxford University Press, 2010.
137. M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks.
Phys. Rev. E, 69:026113, Feb 2004.
138. N. P. Nguyen, T. N. Dinh, S. Tokala, and M. T. Thai. Overlapping communities in dynamic
networks: their detection and mobile applications. In Proceedings of the 17th annual
international conference on Mobile computing and networking, MobiCom ’11, pages 85–96.
ACM, 2011.
References 189
139. G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure
of complex networks in nature and society. Nature, 435(7043):814–818, 2005.
140. G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure
of complex networks in nature and society. Nature, 435:814–818, 2005.
141. R. A. Pande. Using plant epidemiological methods to track computer network worms. PhD
thesis, Virginia Tech, 2004.
142. C. Pash. The lure of naked hollywood star photos sent the internet into meltdown in new
zealand. Business Insider Australia, September 7 2014, 4:21 PM.
143. F. Peter. ‘bogus’ ap tweet about explosion at the white house wipes billions off us markets,
April 23 2013. Washington.
144. S. Pettie and V. Ramachandran. A shortest path algorithm for real-weighted undirected
graphs. SIAM Journal on Computing, 34(6):1398–1431, 2005.
145. A.-K. Pietilainen. CRAWDAD data set thlab/sigcomm2009 (v. 2012-07-15). Downloaded
from http://crawdad.org/thlab/sigcomm2009/, July 2012.
146. P. C. Pinto, P. Thiran, and M. Vetterli. Locating the source of diffusion in large-scale networks.
Phys. Rev. Lett., 109, Aug 2012.
147. B. A. Prakash, J. Vreeken, and C. Faloutsos. Spotting culprits in epidemics: How many and
which ones? In Proceedings of the 2012 IEEE 12th International Conference on Data Mining,
ICDM ’12, pages 11–20, Washington, DC, USA, 2012. IEEE Computer Society.
148. B. A. Prakash, J. Vreeken, and C. Faloutsos. Efficiently spotting the starting points of an
epidemic in a large graph. Knowledge and Information Systems, 38(1):35–59, 2014.
149. A. Rapoport. Spread of information through a population with socio-structural bias: I.
assumption of transitivity. The bulletin of mathematical biophysics, 15(4):523–533, 1953.
150. J. G. Restrepo, E. Ott, and B. R. Hunt. Characterizing the dynamical importance of network
nodes and links. Phys. Rev. Lett., 97:094102, Sep 2006.
151. B. Ribeiro, N. Perra, and A. Baronchelli. Quantifying the effect of temporal resolution on
time-varying networks. Scientific reports, 3, 2013.
152. M. Rosvall and C. T. Bergstrom. An information-theoretic framework for resolving com-
munity structure in complex networks. Proceedings of the National Academy of Sciences,
104(18):7327–7331, 2007.
153. M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal
community structure. Proceedings of the National Academy of Sciences, 105(4):1118–1123,
2008.
154. G. Sabidussi. The centrality index of a graph. Psychometrika, 31(4):581–603, 1966.
155. M. Sales-Pardo, R. Guimera, A. A. Moreira, and L. A. N. Amaral. Extracting the hierar-
chical organization of complex systems. Proceedings of the National Academy of Sciences,
104(39):15224–15229, 2007.
156. S. Savage, D. Wetherall, A. Karlin, and T. Anderson. Practical network support for ip
traceback. ACM SIGCOMM Computer Communication Review, 30(4):295–306, 2000.
157. V. Sekar, Y. Xie, D. A. Maltz, M. K. Reiter, and H. Zhang. Toward a framework for internet
forensic analysis. In ACM HotNets-III, 2004.
158. E. Seo, P. Mohapatra, and T. Abdelzaher. Identifying rumors and their sources in social
networks. In SPIE Defense, Security, and Sensing, volume 8389, 2012.
159. M. A. Serrano and M. Boguñá. Clustering in complex networks. ii. percolation properties.
Phys. Rev. E, 74:056115, Nov 2006.
160. D. Shah and T. Zaman. Detecting sources of computer viruses in networks: Theory and exper-
iment. In Proceedings of the ACM SIGMETRICS International Conference on Measurement
and Modeling of Computer Systems, SIGMETRICS ’10, pages 203–214. ACM, 2010.
161. D. Shah and T. Zaman. Rumors in a network: Who’s the culprit? IEEE Transactions on
information theory, 57(8):5163–5181, 2011.
162. D. Shah and T. Zaman. Rumor centrality: A universal source detector. SIGMETRICS Perform.
Eval. Rev., 40(1):199–210, June 2012.
190 References
163. Z. Shen, S. Cao, W.-X. Wang, Z. Di, and H. E. Stanley. Locating the source of diffusion in
complex networks by time-reversal backward spreading. Physical Review E, 93(3):032301,
2016.
164. J. Shetty and J. Adibi. The enron email dataset database schema and brief statistical report.
Information Sciences Institute Technical Report, University of Southern California, 4, 2004.
165. S. Shirazipourazad, B. Bogard, H. Vachhani, A. Sen, and P. Horn. Influence propagation in
adversarial setting: how to defeat competition with least amount of investment. In Proceedings
of the 21st ACM international conference on Information and knowledge management, CIKM
’12, pages 585–594. ACM, 2012.
166. L.-P. Song, Z. Jin, and G.-Q. Sun. Modeling and analyzing of botnet interactions. Physica A:
Statistical Mechanics and its Applications, 390(2):347–358, 2011.
167. Symantec. The 2012 norton cybercrime report. Mountain View, CA: Symantec, 2012.
168. W. E. R. Team. Ebola virus disease in west africathe first 9 months of the epidemic and
forward projections. N Engl J Med, 371(16):1481–95, 2014.
169. K. Thomas and D. M. Nicol. The koobface botnet and the rise of social malware. In Malicious
and Unwanted Software (MALWARE), 2010 5th International Conference on, pages 63–70.
IEEE, 2010.
170. M. P. Viana, D. R. Amancio, and L. d. F. Costa. On time-varying collaboration networks.
Journal of Informetrics, 7(2):371–378, 2013.
171. B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi. On the evolution of user interaction
in facebook. In Proceedings of the 2nd ACM workshop on Online social networks, WOSN
’09, pages 37–42, 2009.
172. B. Vladimir and M. Andrej. Pajek: analysis and visualization of large networks. In GRAPH
DRAWING SOFTWARE, pages 77–103. Springer, 2003.
173. M. Vojnovic, V. Gupta, T. Karagiannis, and C. Gkantsidis. Sampling strategies for epidemic-
style information dissemination. Networking, IEEE/ACM Transactions on, 18(4):1013–1025,
2010.
174. K. Wakita and T. Tsurumi. Finding community structure in mega-scale social networks:
[extended abstract]. In Proceedings of the 16th international conference on World Wide Web,
WWW ’07, pages 1275–1276, 2007.
175. Y. Wang, G. Cong, G. Song, and K. Xie. Community-based greedy algorithm for mining
top-k influential nodes in mobile social networks. In Proceedings of the 16th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 1039–
1048. ACM, 2010.
176. Y. Wang, S. Wen, Y. Xiang, and W. Zhou. Modeling the propagation of worms in networks:
A survey. Communications Surveys Tutorials, IEEE, PP(99):1–19, 2013.
177. Y. Wang, S. Wen, Y. Xiang, and W. Zhou. Modeling the propagation of worms in networks:
A survey. Communications Surveys Tutorials, IEEE, 16(2):942–960, Second 2014.
178. Z. Wang, W. Dong, W. Zhang, and C. W. Tan. Rumor source detection with multiple
observations: Fundamental limits and algorithms. In The 2014 ACM International Conference
on Measurement and Modeling of Computer Systems, SIGMETRICS ’14, pages 1–13. ACM,
2014.
179. D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. nature,
393(6684):440–442, 1998.
180. D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. nature,
393(6684):440–442, 1998.
181. N. Weaver, V. Paxson, S. Staniford, and R. Cunningham. A taxonomy of computer worms. In
Proceedings of the 2003 ACM Workshop on Rapid Malcode, WORM ’03, pages 11–18, 2003.
182. S. Wen, J. Jiang, B. Liu, Y. Xiang, and W. Zhou. Using epidemic betweenness to measure
the influence of users in complex networks. Journal of Network and Computer Applications,
78:288–299, 2017.
References 191
183. S. Wen, J. Jiang, Y. Xiang, S. Yu, and W. Zhou. Are the popular users always important for
the information dissemination in online social networks? Network, IEEE, pages 1–3, October
2014.
184. S. Wen, J. Jiang, Y. Xiang, S. Yu, W. Zhou, and W. Jia. To shut them up or to clarify:
restraining the spread of rumors in online social networks. Parallel and Distributed Systems,
IEEE Transactions on, 25(12):3306–3316, 2014.
185. S. Wen, W. Zhou, Y. Wang, W. Zhou, and Y. Xiang. Locating defense positions for thwarting
the propagation of topological worms. Communications Letters, IEEE, 16(4):560–563, 2012.
186. S. Wen, W. Zhou, J. Zhang, Y. Xiang, W. Zhou, and W. Jia. Modeling propagation dynamics of
social network worms. Parallel and Distributed Systems, IEEE Transactions on, 24(8):1633–
1643, 2013.
187. S. Wen, W. Zhou, J. Zhang, Y. Xiang, W. Zhou, W. Jia, and C. Zou. Modeling and analysis
on the propagation dynamics of modern email malware. Dependable and Secure Computing,
IEEE Transactions on, 11(4):361–374, July 2014.
188. L. Weng, F. Menczer, and Y.-Y. Ahn. Virality prediction and community structure in social
networks. Scientific reports, 3, 2013.
189. P. Wood and G. Egan. Symantec internet security threat report 2011. Technical report,
Symantec Corporation, April, 2012.
190. Y. Xiang, X. Fan, and W. T. Zhu. Propagation of active worms: a survey. International journal
of computer systems science & engineering, 24(3):157–172, 2009.
191. Y. Xie, V. Sekar, D. A. Maltz, M. K. Reiter, and H. Zhang. Worm origin identification using
random moonwalks. In Security and Privacy, 2005 IEEE Symposium on, pages 242–256.
IEEE, 2005.
192. G. Yan, G. Chen, S. Eidenbenz, and N. Li. Malware propagation in online social networks:
nature, dynamics, and defense implications. In Proceedings of the 6th ACM Symposium on
Information, Computer and Communications Security, ASIACCS’11, pages 196–206, 2011.
193. G. Yan and S. Eidenbenz. Modeling propagation dynamics of bluetooth worms (extended
version). Mobile Computing, IEEE Transactions on, 8(3):353–368, 2009.
194. Y. Yan, Y. Qian, H. Sharif, and D. Tipper. A survey on smart grid communication infrastruc-
tures: Motivations, requirements and challenges. Communications Surveys Tutorials, IEEE,
15(1):5–20, First 2013.
195. K. Yang, A. H. Shekhar, D. Oliver, and S. Shekhar. Capacity-constrained network-voronoi
diagram: a summary of results. In International Symposium on Spatial and Temporal
Databases, pages 56–73. Springer, 2013.
196. Y. Yao, X. Luo, F. Gao, and S. Ai. Research of a potential worm propagation model based on
pure p2p principle. In Communication Technology, 2006. ICCT’06. International Conference
on, pages 1–4. IEEE, 2006.
197. W. Zang, P. Zhang, C. Zhou, and L. Guo. Discovering multiple diffusion source nodes in
social networks. Procedia Computer Science, 29:443–452, 2014.
198. Y. Zhou and X. Jiang. Dissecting android malware: Characterization and evolution. In
Security and Privacy (SP), 2012 IEEE Symposium on, pages 95–109. IEEE, 2012.
199. G.-M. Zhu, H. Yang, R. Yang, J. Ren, B. Li, and Y.-C. Lai. Uncovering evolutionary ages of
nodes in complex networks. The European Physical Journal B, 85(3):1–6, 2012.
200. K. Zhu and L. Ying. Information source detection in the sir model: A sample path based
approach. arXiv preprint arXiv:1206.5421, 2012.
201. K. Zhu and L. Ying. Information source detection in the sir model: A sample path based
approach. In Information Theory and Applications Workshop (ITA), pages 1–9, 2013.
202. K. Zhu and L. Ying. A robust information source estimator with sparse observations.
Computational Social Networks, 1(1):1, 2014.
203. K. Zhu and L. Ying. Information source detection in the sir model: a sample-path-based
approach. IEEE/ACM Transactions on Networking, 24(1):408–421, 2016.
192 References
204. Y. Zhu, B. Xu, X. Shi, and Y. Wang. A survey of social-based routing in delay tolerant
networks: Positive and negative social effects. Communications Surveys Tutorials, IEEE,
15(1):387–401, Jan 2013.
205. Z. Zhu, G. Lu, Y. Chen, Z. Fu, P. Roberts, and K. Han. Botnet research survey. In Computer
Software and Applications, 2008. COMPSAC ’08. 32nd Annual IEEE International, pages
967–972, July 2008.
206. C. C. Zou, W. Gong, and D. Towsley. Code red worm propagation modeling and analysis. In
Proceedings of the 9th ACM Conference on Computer and Communications Security, CCS
’02, pages 138–147, 2002.
207. C. C. Zou, D. Towsley, and W. Gong. Modeling and simulation study of the propagation and
defense of internet e-mail worms. IEEE Transactions on dependable and secure computing,
4(2):105–118, 2007.
208. C. C. Zou, D. Towsley, and W. Gong. Modeling and simulation study of the propagation and
defense of internet e-mail worms. IEEE Transactions on Dependable and Secure Computing,
4(2):105–118, 2007.