You are on page 1of 192

Advances in Information Security  73

Jiaojiao Jiang
Sheng Wen
Bo Liu
Shui Yu
Yang Xiang
Wanlei Zhou

Malicious Attack
Propagation
and Source
Identification
Advances in Information Security

Volume 73

Series editor
Sushil Jajodia, George Mason University, Fairfax, VA, USA
More information about this series at http://www.springer.com/series/5576
Jiaojiao Jiang • Sheng Wen • Bo Liu • Shui Yu
Yang Xiang • Wanlei Zhou

Malicious Attack
Propagation and Source
Identification

123
Jiaojiao Jiang Sheng Wen
Swinburne University of Technology Swinburne University of Technology
Hawthorne, Melbourne Hawthorne, Melbourne
VIC, Australia VIC, Australia

Bo Liu Shui Yu
La Trobe University University of Technology Sydney
Bundoora, VIC, Australia Ultimo, NSW, Australia

Yang Xiang Wanlei Zhou


Digital Research & Innovation Capability University of Technology Sydney
Swinburne University of Technology Ultimo, NSW, Australia
Hawthorn, Melbourne
VIC, Australia

ISSN 1568-2633
Advances in Information Security
ISBN 978-3-030-02178-8 ISBN 978-3-030-02179-5 (eBook)
https://doi.org/10.1007/978-3-030-02179-5

Library of Congress Control Number: 2018959747

© Springer Nature Switzerland AG 2019


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

In the modern world, the ubiquity of networks has made us vulnerable to various
malicious attacks. For instance, computer viruses propagate throughout the Internet
and infect millions of computers. Misinformation spreads incredibly fast in online
social networks, such as Facebook and Twitter. Experts say that “fake news” on
social media platforms influenced US election voters. Researchers and manufactur-
ers evolve new methods to produce detection systems to detect suspicious attacks.
However, how can we detect the propagation source of the attacks so as to protect
network assets from fast acting attacks? Moreover, how can we build up effective
and efficient prevention systems to stop malicious attacks before they do damage
and have a chance to infect a system.
So far, extensive work has been done to develop new approaches, to effectively
identify the propagation source of malicious attacks and to efficiently restrain the
malicious attacks. The goal of this book is to summarize and analyze the state-of-
the-art research and investigations in the field of identifying propagation sources
and preventing malicious propagation, so as to provide an approachable strategy
for researchers and engineers to implement this new framework in real-world
applications. The striking features of the book can be illustrated from three basic
aspects:
• A detailed coverage on analyzing and preventing the propagation of malicious
attacks in complex networks. On the one hand, a practical problem in malicious
attack propagation is the spreading influence of initial spreaders. This book
presents and analyzes different methods for influential spreader detection. On
the other hand, various strategies have been proposed for preventing malicious
attack propagation. This book numerically analyzes these strategies, concludes
the equivalences of these strategies, and presents a hybrid strategy by combining
different strategies.
• A rich collection of contemporary research results in identifying the propagation
source of malicious attacks. According to the categories of observations on
malicious attacks, current research can be divided into three types. For each

v
vi Preface

type, we particularly present one representative method and the theory behind
each method. A comprehensive theoretical analysis of current methods is further
presented. Apart from the theoretical analysis, the book numerically analyzes
their pros and cons based on real-world datasets.
• A comprehensive study of critical research issues in identifying the propagation
source of malicious attacks. For each issue, the book presents a brief introduction
to the problem and its challenges, and a detailed state-of-the-art method to solve
the problem.
This book intends to enable readers, especially postgraduate and senior under-
graduate students, to study up-to-date concepts, methods, algorithms, and analytic
skills for building modern detection and prevention systems through analyzing the
propagation of malicious attacks. It enables students not only to master the concepts
and theories in relation to malicious attack propagation and source identification but
also to readily use the material introduced into implementation practices.
The book is divided into three parts: malicious attack propagation, propagation
source identification, and critical research issues in source identification. In the
first part, after an introduction of the preliminaries of malicious attack propagation,
the book presents detailed descriptions on areas of detecting influential spreaders
and restraining the propagation of malicious attacks. In the second part, after a
summary on the techniques involved in propagation source identification under
different categories of observations about malicious attack propagation, the book
then presents a comprehensive study of these techniques and uses real-world
datasets to numerically analyze their pros and cons. In the third part, the book
explores three critical research issues in the research area of propagation source
identification. The most difficult one is the complex spatiotemporal diffusion
process of malicious attacks in time-varying networks, which is the bottleneck
of current approaches. The second issue lies in the expensively computational
complexity of identifying multiple propagation sources. The third important issue
is the huge scale of the underlying networks, which makes it difficult to develop
efficient strategies to quickly and accurately identify propagation sources. These
weaknesses prevent propagation source identification from being applied in a
broader range of real-world applications. This book systematically analyzes the
state of the art in addressing these issues and aims at making propagation source
identification more effective and applicable.

Hawthorne, Melbourne, VIC, Australia Jiaojiao Jiang


Hawthorne, Melbourne, VIC, Australia Sheng Wen
Bundoora, VIC, Australia Bo Liu
Ultimo, NSW, Australia Shui Yu
Hawthorne, Melbourne, VIC, Australia Yang Xiang
Ultimo, NSW, Australia Wanlei Zhou
September 2018
Acknowledgments

We are grateful to many research students and colleagues at Swinburne University


of Technology in Melbourne and the University of Technology Sydney in Sydney,
who have made a lot of comments to our presentations as their comments inspire
us to write this book. We would like to acknowledge some support from research
grants we have received, in particular the Australian Research Council Grant no.
LP120200266, DP140103649, and DP180102828. Some interesting research results
presented in the book are taken from our research papers that indeed (partially) were
supported through these grants. We also would like to express our appreciations to
the editors at Springer, especially Susan Lagerstrom-Fife and Caroline Flanagan, for
the excellent professional support. Finally we are grateful to the family of each of
us for their consistent and persistent supports. Without their support, the book may
just become some unpublished discussions.

Hawthorne, Melbourne, VIC, Australia Jiaojiao Jiang


Hawthorne, Melbourne, VIC, Australia Sheng Wen
Bundoora, VIC, Australia Bo Liu
Ultimo, NSW, Australia Shui Yu
Hawthorne, Melbourne, VIC, Australia Yang Xiang
Ultimo, NSW, Australia Wanlei Zhou
September 2018

vii
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Malicious Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Examples of Malicious Attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Propagation Mechanism of Malicious Attacks . . . . . . . . . . . . . . . . . . . . . 4
1.4 Source Identification of Malicious Attack Propagation . . . . . . . . . . . . 6
1.5 Outline and Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Part I Malicious Attack Propagation


2 Preliminary of Modeling Malicious Attack Propagation. . . . . . . . . . . . . . . 11
2.1 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Community Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Information Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 User Influence in the Propagation of Malicious Attacks . . . . . . . . . . . . . . . 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Epidemic Betweenness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Information Propagation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Epidemic Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Computation of Epidemic Betweenness . . . . . . . . . . . . . . . . . . . 28
3.3.4 Simple Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Accuracy in Measuring Influence . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Comparison with Other Measures of Influence . . . . . . . . . . . 34
3.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.1 Correlation with Traditional Betweenness . . . . . . . . . . . . . . . . 35
3.5.2 Correlation with Classic Centrality Measures . . . . . . . . . . . . . 37
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

ix
x Contents

4 Restrain Malicious Attack Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Methods of Restraining Rumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Controlling Influential Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2 Controlling Community Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Clarification Through Spreading Truths . . . . . . . . . . . . . . . . . . . 45
4.3 Propagation Modeling Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Modeling Nodes, Topology and Social Factors . . . . . . . . . . . 46
4.3.2 Modeling Propagation Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.3 Modeling People Making Choices . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.4 The Accuracy of the Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Block Rumors at Important Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 Theoretical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Clarify Rumors Using Truths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.1 Impact of the Truth Injection Time . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.2 Impact of the Truth Propagation Probability . . . . . . . . . . . . . . 57
4.6 A Hybrid Measure of Restraining Rumors . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6.1 Measures Working Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6.2 Equivalence of Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7.1 The Robustness of the Contagious Ability . . . . . . . . . . . . . . . . 61
4.7.2 The Fairness to the Community Bridges . . . . . . . . . . . . . . . . . . 62

Part II Source Identification of Malicious Attack Propagation


5 Preliminary of Identifying Propagation Sources . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 Observations on Malicious Attack Propagation . . . . . . . . . . . . . . . . . . . . 65
5.2 Maximum-Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Efficiency Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Source Identification Under Complete Observations:
A Maximum Likelihood (ML) Source Estimator . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 The SI Model for Information Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Rumor Source Estimator: Maximum Likelihood (ML) . . . . . . . . . . . . 70
6.4 Rumor Source Estimator: ML for Regular Trees . . . . . . . . . . . . . . . . . . . 71
6.5 Rumor Source Estimator: ML for General Trees . . . . . . . . . . . . . . . . . . . 72
6.6 Rumor Source Estimator: ML for General Graphs . . . . . . . . . . . . . . . . . 73
6.7 Rumor Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.7.1 Rumor Centrality: Succinct Representation . . . . . . . . . . . . . . . 75
6.7.2 Rumor Centrality Versus Distance Centrality . . . . . . . . . . . . . 76
Contents xi

7 Source Identification Under Snapshots: A Sample Path Based


Source Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 The SIR Model for Information Propagation . . . . . . . . . . . . . . . . . . . . . . . 80
7.3 Maximum Likelihood Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.4 Sample Path Based Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5 The Sample Path Based Estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.6 Reverse Infection Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8 Source Identification Under Sensor Observations: A Gaussian
Source Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2 Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3 Source Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.4 Source Estimator on a Tree Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.5 Source Estimator on a General Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9 Comparative Study and Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.1 Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.1.1 Methods Based on Complete Observations . . . . . . . . . . . . . . . 95
9.1.2 Methods Based on Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.1.3 Methods Based on Sensor Observations. . . . . . . . . . . . . . . . . . . 102
9.2 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2.1 Comparison on Synthetic Networks . . . . . . . . . . . . . . . . . . . . . . . 106
9.2.2 Comparison on Real-World Networks . . . . . . . . . . . . . . . . . . . . . 112
9.3 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Part III Critical Research Issues in Source Identification


10 Identifying Propagation Source in Time-Varying Networks . . . . . . . . . . . 117
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.2 Time-Varying Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.2.1 Time-Varying Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.2.2 Security States of Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.2.3 Observations on Time-Varying Social Networks . . . . . . . . . 120
10.3 Narrowing Down the Suspects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10.3.1 Reverse Dissemination Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.4 Determining the Real Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10.4.1 A Maximum-Likelihood (ML) Based Method . . . . . . . . . . . . 127
10.4.2 Propagation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.5.1 Accuracy of Rumor Source Identification . . . . . . . . . . . . . . . . . 130
10.5.2 Effectiveness Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
xii Contents

11 Identifying Multiple Propagation Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.2 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.2.1 The Epidemic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.2.2 The Effective Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.3 Problem Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.4 The K-Center Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.4.1 Network Partitioning with Multiple Sources . . . . . . . . . . . . . . 145
11.4.2 Identifying Diffusion Sources and Regions . . . . . . . . . . . . . . . 145
11.4.3 Predicting Spreading Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.4.4 Unknown Number of Diffusion Sources . . . . . . . . . . . . . . . . . . 149
11.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.5.1 Accuracy of Identifying Rumor Sources . . . . . . . . . . . . . . . . . . 151
11.5.2 Estimation of Source Number and Spreading Time . . . . . . 153
11.5.3 Effectiveness Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
12 Identifying Propagation Source in Large-Scale Networks . . . . . . . . . . . . . 159
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
12.2 Community Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.3 Community-Based Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.3.1 Assigning Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.3.2 Community Structure Based Approach. . . . . . . . . . . . . . . . . . . . 163
12.3.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.4.1 Identifying Diffusion Sources in Large Networks . . . . . . . . 168
12.4.2 Influence of the Average Community Size . . . . . . . . . . . . . . . . 169
12.4.3 Effectiveness Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
12.4.4 Comparison with Current Methods . . . . . . . . . . . . . . . . . . . . . . . . 173
12.5 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
13 Future Directions and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
13.1 Continuous Time-Varying Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
13.2 Multiple Attacks on One Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.3 Interconnected Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Chapter 1
Introduction

1.1 Malicious Attacks

With the remarkable advances in computer technologies, our social, financial and
professional existences become increasingly digitized, and governments, healthcare
and military infrastructures rely more on computer technologies. They, meanwhile,
present larger and more lucrative targets for malicious attacks [81]. A malicious
attack is an attempt to forcefully abuse or take advantage of a computer system or
a network asset. This can be done with the intent of stealing personal information
(logins, financial data, even electronic money), or to reduce the functionality of a
target computer. According to statistics, worldwide financial losses due to malicious
attacks averaged $12.18 billion per year from 1997 to 2006 [50] and increased to
$110 billion between July 2011 and the end of July 2012 [167].
Typical malicious attacks include viruses, worms, Trojan horses, spyware,
adware, phishing, spamming, rumors, and other types of social engineering. Since
the first computer virus surfaced in the early 1980s, malicious attacks has developed
into thousands of variants that differ in infection mechanism, propagation mecha-
nism, destructive payload, and other features [51].
According to the types of cyber threats that contributed to breaches, mali-
cious attacks are divided into two main categories [100]: malware and social
engineering.
• Malware, short for “malicious software”, is a broad term that refers to a variety
of malicious programs designed to compromise computers and devices in several
ways, steal data, bypass access controls, or cause harm to the host computers.
Malware comes in a number of forms, including viruses, worms, Trojans,
spyware, adware, rootkits and more.
• Social Engineering is the art of getting users to compromise information systems,
where the attackers use human interaction (i.e., social skills) to obtain or
compromise information about an organization or its computer systems. Instead

© Springer Nature Switzerland AG 2019 1


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_1
2 1 Introduction

of technical attacks on systems, social engineers target humans with access


to information, manipulating them into divulging confidential information or
even into carrying out their malicious attacks through influence and persuasion.
Technical protection measures are usually ineffective against this kind of attack.
Social engineering attacks are multifaceted and include physical, social and
technical aspects, which are used in different stages of the actual attack.
Pretexting, Phishing and Baiting are the most common social engineering tactics.

1.2 Examples of Malicious Attacks

Motivated by extraordinary financial or political rewards, malware owners are


exhausting their energy to compromise as many networked computers as they can in
order to achieve their malicious goals. A compromised computer is called a bot, and
all bots compromised by a malware form a botnet. Bot-nets have become the attack
engine of malicious attacks, and they pose critical challenges to cyber defenders. As
many security reports indicate—attack numbers are quickly increasing worldwide,
reaching new, unbelievable peaks. Here are some of the major cyber malicious
attacks in recent years and what we can learn from them.
• ILOVEYOU is considered one of the most virulent computer virus ever created.
The virus managed to wreck havoc on computer systems all over the world,
causing damages totaling in at an estimate of $10 billion. Ten percent of the
world’s Internet-connected computers were believed to have been infected. It was
so bad that governments and large corporations took their mailing system offline
to prevent infection. The virus used social engineering to get people to click on
the attachment; in this case, a love confession. The attachment was actually a
script that poses as a TXT file. Once clicked, it will send itself to everyone in the
user’s mailing list and proceed to overwrite files with itself, making the computer
unbootable.
• CryptoLocker [119] is a form of Trojan horse ransomware targeted at computers
running Windows. It uses several methods to spread itself, such as email, and
once a computer is infected, it will proceed to encrypt certain files on the hard
drive and any mounted storage connected to it with RSA public key cryptography.
Figure 1.1 illustrates the infection chain of CryptoLocker. While it is easy enough
to remove the malware from the computer, the files will still remain encrypted.
The only way to unlock the files is to pay a ransom by a deadline. If the deadline
is not met, the ransom will increase significantly or the decryption keys deleted.
The ransom usually amount to $400 in prepaid cash or bit coin. From data
collected from the raid, the number of infections is estimated to be 500,000,
with the number of those who paid the ransom to be at 1.3%, amounting to $3
million [176].
• Code Red first surfaced on 2001. The worm targeted computers with Microsoft
IIS web server installed, exploiting a buffer overflow problem in the system. It
1.2 Examples of Malicious Attacks 3

Once executed,
Users receive spam with TSPY_ZBOT.VNA exhibites TROJ_CRILOCK.NS locks
TROJ_UPATRE.VNA certain files and then asks
malicious attachment several malicious behavior
connects to certain websites
(TROJ_UPATRE.VNA) including downloading users to purchase a
to download
TROJ_CRILOCK.NS decrypting tool
TSPY_ZBOT.VNA

Fig. 1.1 CryptoLocker infection chain [6]

leaves very little trace on the hard disk as it is able to run entirely on memory,
with a size of 3569 bytes. Once infected, it will proceed to make a hundred copies
of itself but due to a bug in the programming, it will duplicate even more and ends
up eating a lot of the systems resources. It will then launch a denial of service
attack on several IP address, famous among them the website of the White House.
It also allows backdoor access to the server, allowing for remote access to the
machine. It was estimate that it caused $2 billion in lost productivity. A total of
1–2 million servers were affected.
• Conficker is a worm for Windows that made its first appearance in 2008. It
infects computers using flaws in the OS to create a botnet. The malware was
able to infect more than 9 millions computers all around the world, affecting
governments, businesses and individuals. It was one of the largest known
worm infections to ever surface causing an estimate damage of $9 billion. The
worm works by exploiting a network service vulnerability that was present and
unpatched in Windows. Once infected, the worm will then reset account lockout
policies, block access to Windows update and anti-virus sites, turn off certain
services and lock out user accounts among many. Then, it proceeds to install
software that will turn the computer into a botnet slave and scareware to scam
money off the user.
• Mydoom surfacing in 2004, was a worm for Windows that became one of the
fastest spreading email worm since ILOVEYOU. The worm spreads itself by
appearing as an email transmission error and contains an attachment of itself.
Once executed, it will send itself to email addresses that are in a user’s address
book and copies itself to any P2P program’s folder to propagate itself through
that network. The payload itself is twofold: first it opens up a backdoor to
allow remote access and second it launches a denial of service attack on the
controversial SCO Group. It was believed that the worm was created to disrupt
SCO due to conflict over ownership of some Linux code. It caused an estimate
of $38.5 billion in damages and the worm is still active in some form today.
• Flashback is one of the Mac malware. The Trojan was first discovered in 2011.
A user simply needs to have Java enabled. It propagates itself by using com-
promised websites containing JavaScript code that will download the payload.
4 1 Introduction

Once installed, the Mac becomes part of a botnet of other infected Macs. More
than 600,000 Macs were infected, including 274 Macs in the Cupertino area, the
headquarters of Apple.

1.3 Propagation Mechanism of Malicious Attacks

As we can see from the above examples of malicious attacks, a malicious attack
propagation starts from one or few hosts and quickly infects many other hosts.
For example, the ILOVEYOU worm attached in Outlook Mails mailed itself to all
addresses on the host’s mailing list. The Code Red performed network scanning and
propagated to IP addresses connected to the host. In order to fight against malicious
attacks, it is important for cyber defenders to understand malicious attack behavior,
such as membership recruitment patterns, the size of botnets, distribution of bots,
especially the propagation mechanism of malicious attacks.
The diagram shown in Fig. 1.2 illustrates the process of real-world Trojan
malware [54, 103, 105, 169]. Such a process consists of three stages:
• In the first stage, the malware developer creates one or more fake profiles and
infiltrates them into the social network. The purpose of these fake profiles is
to make friends with as many real OSN users as possible. Infiltration has been
shown to be an effective technique for disseminating malicious content in OSNs
such as Facebook [22].
• In the second stage, the malware developer uses social engineering techniques
to create eye-catching web links that trick users into clicking on them. The web

Fig. 1.2 Example of Trojan malware propagation in online social networks [55]
1.3 Propagation Mechanism of Malicious Attacks 5

links, which are posted on the fake users’ walls, lead unsuspecting users to a web
page that contains malicious content. A user simply needs to visit or “drive by”
that web page, and the malicious code can be downloaded in the background and
executed on the user’s computer without his/her knowledge. When security flaws
are absent [198], malware creators resort to social engineering techniques to get
assistance from users to activate the malicious code.
• In the third stage, after a user is infected, the malware also posts the eye-catching
web link(s) on the user’s wall to “ecruit” his/her friends. If a friend clicks on the
link(s) and, as a result, unknowingly executes the malware, the friend’s computer
and profile will become infected and the propagation cycle continues with his/her
own friends.
Note that, malicious attacks are similar to biological viruses on their self-
replicating and propagation behaviors. Thus the mathematical techniques developed
for the study of biological infectious diseases have been adapted to the study of
malicious attack propagation. The basic epidemic model, Susceptible-Infected (SI)
model, separates populations into two groups of nodes changing over time:
• A susceptible node is a node that is vulnerable to malicious attack but otherwise
“healthy”. We use S(t) to denote the number of susceptible nodes at time t.
• An infected node is a node that became infected and may potentially infect other
nodes. We use I (t) to denote the number of infected nodes at time t.
In the SI model, it was assumed that the population is large and steady with n
nodes. If a node got infected, it does not become uninfected. Figure 1.3 presents

Code Red Worm - infected hosts


400000

350000

300000

250000
infected hosts

200000

150000

100000

50000

0
00:00 04:00 08:00 12:00 16:00 20:00 00:00 04:00
07/19 time (UTC) 07/20

Fig. 1.3 Observed Code Red propagation (from Caida.org)


6 1 Introduction

the propagation progress of the Code Red worm [141]. The dataset on the Code Red
worm was collected by Moore et al. during the whole day of July 19th [127]. The
SI model well matches the propagation of Code Red worm [206].
Many other models are derivations of this basic SI form. For example, the
Susceptible-Infected-Recovered (SIR) models the propagation that node can get
recovered or immune from the infectious state and if a node get recovered it will
never again become susceptible. The Susceptible-Infected-Susceptible (SIS) models
the propagation in which recovered node can become susceptible again.

1.4 Source Identification of Malicious Attack Propagation

From both practical and technical aspects, it is of great significance to identify


propagation sources of malicious attacks. Practically, it is important to accurately
identify the ‘culprit’ of the malicious attack for forensic purposes. Moreover,
seeking the malicious attack as quickly as possible can find the causation of the
attack, and therefore, mitigate the damages. Technically, the work in this field
aims at identifying the sources of malicious attacks based on limited knowledge of
network structures and the states of a small portion of nodes. In academia, traditional
identification techniques, such as IP trace back [156] and stepping-stone detection
[157], are not sufficient to seek the sources of malicious attacks, as they only
determine the true source of packets received by a destination. In the propagation of
malicious attacks, the source of packets is almost never the source of the malicious
attack propagation but just one of the many propagation participants [191]. Methods
are needed to find propagation sources higher up in the application level and logic
structures of networks, rather than in the IP level and packets.
Identifying the source of a malicious attack is an extremely desirable but
challenging task, because of the large-scale of networks and the limited knowledge
of hosts’ infection statuses. In general, only a small fraction of hosts can be
observed. Thus, the main difficulty is to develop tractable estimators that can be
efficiently implemented, and that perform well on multiple topologies.
In the past few years, researchers have proposed a series of methods to identify
the diffusion sources of malicious attacks. The first widely discussed research on
this subject has been done by Shah and Zaman [161]. In social networks, Shah
and Zaman introduced rumor centrality of a node as the number of distinct ways a
rumor can spread in the network starting from that node. They showed that the node
with maximum rumor centrality is the Maximum Likelihood Estimator of the rumor
source if the underlying graph is a regular tree. They studied also the detection
performance for irregular geometric trees, small-word networks and scale-free
networks. This method assumes that we know all the connections between nodes
and additionally the infection states of all nodes. Researchers [84, 85, 146, 200] later
relaxed some of these constraints since their algorithm requires information about
state of not every node, but only about some fraction of nodes called observers.
1.5 Outline and Book Overview 7

Based on the types of observations on the attacked underlying network, the


approaches of identifying malicious attack sources can be divided into three main
categories: the complete observation based source detection, the snapshot based
source detection, and the detector/sensor based source detection. The first category
requires a complete observation of the attacked network after a certain time of
the malicious propagation. The second category requires the snapshot (partial
observation) of the attacked network at a certain time instance. The third category
needs to monitor a small set of nodes but all the time in the attacked network.
Regardless of the above division, researchers considered also different epidemic
models [24, 202], spreading at weighted or time-varying graphs [11, 23, 84, 163]
and multi-source detection problems [58, 68, 86].

1.5 Outline and Book Overview

In many ways, current approaches of analyzing the propagation of malicious attacks


and identifying the source of malicious attacks are facing the following critical
challenges.
• Networks portray a multitude of interactions among network assets. Researchers
have found that unsolicited malicious attacks spread extremely fast through
influential spreaders [42]. Hence, analyzing the influence of network hosts and
identifying the most efficient ‘spreaders’ in a network becomes an important step
towards identifying the propagation source of malicious attacks and restraining
the propagation. Indeed, the propagation of malicious attacks is in an epidemic
way (one host can infect multiple hosts simultaneously). However, previous
methods in measuring the influence of network hosts focus on the dissemination
of information from one host to only one other host. In this book, we explore
different influence measures and their application of capturing influential nodes
in the epidemics.
• Restraining the propagation of malicious attacks in a underlying network has
long been an important but difficult problem. Currently, there are mainly two
types of methods: (1) blocking malicious attacks at the most influential users
or community bridges, and (2) spreading the corresponding countermeasures
to restrain malicious attacks. Given a fixed budget in the real world, how can
we manage different methods working together to optimize the restraint of the
propagation of malicious attacks. In this book, we explore the strategy of different
methods working together and their equivalence, so as to find out a better strategy
to restrain the propagation of malicious attacks.
• The underlying networks are often of time-varying topology. For example,
in human contact networks, the neighborhood of individuals moving over a
geographic space evolves over time, and the interaction between the individuals
appears/disappears in online social network websites (such as Facebook and
Twitter) [151]. Indeed, the spreading of malicious attacks is affected by duration,
8 1 Introduction

sequence, and concurrency of contacts among nodes [27, 170]. Then, can we
model the way that malicious attack spreads in time-varying networks? Can we
estimate the probability of an arbitrary node being infected by a malicious attack?
How do we detect the propagation source of a malicious attack in time-varying
networks? Can we estimate the infection scale and infection time of the malicious
attack?
• Malicious attacks often emerge from multiple sources. However, current methods
mainly focus on the identification of a single attack source in networks. A few
approaches are proposed for identifying multiple attack sources but they all
suffer from extremely high computational complexity, which is not practical to
be adopted in real-world networks. In this book, we will answer the following
questions corresponding to multi-source identification. How many sources are
there? Where did the diffusion emerge? When did the diffusion start?
• Another critical challenge in this research area is the scalability issue. Current
methods generally require scanning the whole underlying network of malicious
attack spreading to locate attack sources. However, real-world networks of
malicious attack diffusion are often of a huge scale and extremely complex
structure. Thus, it is impractical to scan the whole network to locate the attack
sources. We develop efficient approaches to identify attack sources by taking the
structural features of networks and the diffusion patterns of malicious attacks into
account, and therefore address the scalability issue.
To address the above challenges, this book aims to summarize the new tech-
nologies and achieve a breakthrough in source identification of malicious attacks to
enable its effective applicability in real world applications. Based on the challenges,
we divide the book into three main parts.
• Part I: Malicious Attack Propagation
1. Primary knowledge of modeling malicious attack propagation.
2. Spreading influence analysis of network hosts in the propagation of malicious
attacks.
3. Restraining the propagation of malicious attacks.
• Part II: Source Identification of Malicious Attack Propagation
1. Source identification under complete observations: a maximum likelihood
(ML) source estimator.
2. Source identification under snapshots: a sample path based source estimator.
3. Source identification under sensor observations: a Gaussian source estimator.
• Part III: Critical Research Issues in Source Identification
1. Identifying propagation source in time-varying networks.
2. Identifying multiple propagation sources.
3. Identifying propagation source in large-scale networks.
The approaches involved in this book include complex network theory, information
diffusion theory, probability theory, and applied statistics.
Part I
Malicious Attack Propagation
Chapter 2
Preliminary of Modeling Malicious
Attack Propagation

2.1 Graph Theory

Graphs are usually used to represent networks in different fields such as computer
science, biology, and sociology. A graph G = (V , E) consists of a set of nodes V to
represent objects, and a set of edges E = {eij |i, j ∈ V } to represent relationships.
For example, in a computer network, a node represents a computer or a server, and
an edge stands for the connection between two computers or servers. In a social
network, a node represents a people and an edge represents the friendship between
two people. Mathematically, a graph can also be represented as an adjacency matrix
A, in which each entry aij labels the weight on the edge eij .
To capture the importance of nodes in a network, many different centrality
measures have been proposed over the years [98]. According to Freeman in 1979
[66]: “There is certainly no unanimity on exactly what centrality is or on its
conceptual foundations, and there is little agreement on the proper procedure for
its measurement.” In this chapter, we introduce some popular centrality measures as
follows.
Degree Given a node i, the degree [66] of node i is the number of edges connected
to node i. In Fig. 2.1a, the black nodes present higher degree values than the
white nodes. A high degree centrality gives an indication of high influence
of the node in the network. For example, the high-degree nodes in computer
networks often serve as hubs or as major channels of data transmission in the
network. Meanwhile, degree measures the local influence of nodes as the value is
computed by considering the number of links of the node to other nodes directly
adjacent to it. The degree D of a node i can be computed as follows:


n
D(i) = eij , (2.1)
j =1

© Springer Nature Switzerland AG 2019 11


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_2
12 2 Preliminary of Modeling Malicious Attack Propagation

where n is the total number of nodes in the network, eij = 1 if and only if i and
j are connected by an edge; otherwise eij = 0.
Betweenness The betweenness of a node quantifies the number of times the node
acts as a bridge along the shortest path between two other nodes [65]. Nodes with
high betweenness facilitate the flow of information as they form critical bridges
between other nodes or groups of nodes (see Fig. 2.1b). To be precise, suppose
(st)
that gi is the number of shortest paths from node s to node t that pass through
node i, and suppose that nst is the total number of shortest paths from s to t.
Then the betweenness of node i is defined as follows:
 (st)
s<t gi /nst
B(i) = . (2.2)
2 n(n − 1)
1

Researchers have found some nodes which do not have large degrees also play a
vital role in the information diffusion [72, 110]. As shown in Fig. 2.1b, the degree
of node E is smaller than nodes A, B, C and D. However, node E is noticeably
more important to information spread as it is the connector of two large groups.
By using betweenness centrality, we can successfully locate these nodes.
Closeness The closeness [65, 136] of a node is defined as the average length of
the shortest path between the node and all other reachable nodes. Mathematically,
the closeness centrality C of a node i can be computed as follows [66]:

n−1
C(i) = n , (2.3)
j =1 d(i, j )

where d(i, j ) denotes the distance of the shortest path from node i to node j .
The closeness of a node can be regarded as a measure of how long a piece of
information will take to spread from the node to all the other nodes sequentially
[136]. The more central a node is, the lower total distance it has to all other nodes
and hence a larger closeness presents. As shown in Fig. 2.1c, nodes A and B are
the most closest nodes to all other nodes.

Fig. 2.1 Illustration of some centrality measures. (a) Degree. (b) Betweenness. (c) Closeness
2.1 Graph Theory 13

Eigenvector centrality The eigenvector centrality [21, 133] of a node is defined


based on the leading eigenvector of the adjacency matrix A of an undirected
graph G. Let x be the eigenvector of the largest eigenvalue λ of the adjacency
matrix A. The eigenvector centrality of node i is defined as the i-th element, xi ,
in the leading eigenvector x. Equivalently, the eigenvector centrality of a node is
proportional to the sum of the eigenvector centrality of all its neighboring nodes.
Mathematically, given a node i and its neighboring nodes Ni , the eigenvector
centrality of i is the sum of the eigenvector centrality of all its neighbors:
 
xi = xj = Aij xj . (2.4)
j ∈Ni j ∈V

Therefore, the eigenvector centrality of i depends on both the number of


neighbors |Ni | and the quality of its connections xj , j ∈ Ni . In the real world,
an influential node is characterized by its connectivity to other influential nodes.
Thus, a node with a high eigenvector centrality is a well-connected node and has
a dominant influence on the surrounding network.
Clustering coefficient In graph theory, a clustering coefficient is a measure of the
degree to which nodes in a graph tend to cluster together [79, 179]. Two versions
of this measure exist: the global and the local. The global version was designed
to give an overall indication of the clustering in the network, whereas the local
gives an indication of the embeddedness of single nodes. Both the global and
local clustering coefficients are defined in undirected graphs. A triplet consists
of three nodes with either two (open triplet) or three (closed triplet) undirected
edges in-between. The global clustering coefficient is defined for a graph as the
proportion of existing closed triplets among all triplets [113].
 
{i, j, k |eij , ej k , eik ∈ E}
CG =   . (2.5)
{i, j, k |eij , ej k ∈ E}

A triangle refers to a set of three nodes with three undirected edges among
them. The local clustering coefficient is defined for a node i as the fraction
of triangles among all the triples of nodes in i’s neighborhood, while both the
selected triangles and node triples should contain i.
 
{ej k |j, k ∈ Ni , ej k ∈ E}
C(i) = , (2.6)
ki (ki − 1)

where Ni = {j |eij ∈ E or ej i ∈ E} is the set of i’s neighbors. The


local clustering coefficient measures how close one’s neighbors are to being a
clique [179].
In the real world, malicious attacks often start from influence nodes in the
underlying network so as to infect a great number of hosts in a fast way. Later in
this book, we will analyze the features of different centrality nodes in propagating
14 2 Preliminary of Modeling Malicious Attack Propagation

malicious attacks, and analyze how to restrain malicious attacks through blocking
influential nodes. For other centrality measures, readers could refer to [21] for
details.

2.2 Network Topologies

The underlying network where malicious attacks propagate is a complex system.


Understanding the topological structure of the system is crucial for restraining the
propagation of malicious attacks. With the advance of mobile devices and Internet
of Things (IoT), the network system varies through the dynamic interactions among
network users or assets. A popular methodology of studying such systems is to
use tools of complex network theory to analyze the dynamics of the networks,
and the topological properties that emerge through the process of the dynamics.
Various models devoted to reproducing the dynamics and evolution of network
topology have been developed to capture those properties, such as those based
on preferential attachment, heterogeneity of nodes, and triadic closure. In the
following, we introduce some popular network generating models.

Erdos-Renyi Random Networks The first network generating model is pro-


posed by Erdos and Renyi in 1959 [53]. The model described the process of
growing a random network: n nodes connected by m edges randomly selected
from all n(n − 1)/2 possible edges with equal probability p. The degree of nodes
in the Erdos-Renyi (ER) network presents a Poisson distribution. The other key
feature is a sudden change of the network connectivity with the increase of p:
when p is small, many clusters are small and isolated, but once p increases to
be larger than a critical value, the network suddenly becomes very dense where
almost all the nodes are linked to each other in a giant connected component. An
illustration of this feature is presented in Fig. 2.2.

Fig. 2.2 The plot of the


mean component size
excluding the giant
component if there is one
(black solid line), and the
giant component size (red
dashed line), for the ER
random network [53, 131]
2.2 Network Topologies 15

Fig. 2.3 The Watts-Strogatz model reproduces the small-world phenomenon by rewiring edges in
a regular network according to the randomness parameter p [179]

Small-World Networks The small-world network originated from the experi-


ment of Milgram [125], in which selected persons were asked to deliver a letter
to a target receiver by only passing the letter to their acquaintances. Among all
the successful instances, the average length of these communication chains was
short, around 6 steps. The phenomenon is well known as “small-world effect” or
“six degrees of separation”. A small-world network has acquaintanceship-based
edges and the distance between a random pair of people is smaller than expected.
In the real world setting, the small-world effect implies that most of the friends
of an individual are people living around, but he may also have a few friends far
away. People are moving around, but the geographic distance limits the strength
of social relationships. The Watts-Strogatz model was designed to reproduce
the small-world phenomenon by rewiring each link in a regular network with
a probability p [179]. As shown in Fig. 2.3, when p = 0, the network is fully
ordered; when p = 1, every edge is rewired so as to create a random network;
when 0 < p < 1, we obtain a small-world network with small average shortest
path and high clustering coefficient [179].
Scale-Free Networks A scale-free network has a power-law degree distribution,
commonly seen in many real-world networks, such as the Internet, the film
actor network, the scientific collaboration network, the citation network, and
many others (see Fig. 2.4) [5, 15, 131]. Highly unbalanced degree distribution
in a social network indicates that, in a large group of people, only a few are
extremely popular and most others do not have too many contacts. It has been
suggested to be the most critical feature of social networks [135]. Among
many models that can capture the heterogeneous distribution in connectivity
[45, 60, 97, 99, 101, 135], Barabasi-Albert model was the first to generate a scale-
free network with two simple mechanisms: continuously adding new nodes into
the system (“growth”) and connecting with other nodes with preference to the
high-degree ones (“preferential attachment”) [15]. Motivated by the structure of
the Web graph, the copying model added a new node into the network and linked
16 2 Preliminary of Modeling Malicious Attack Propagation

Fig. 2.4 The connectivities of various large real-world networks have scale-free distributions, (a)
actor collaboration graph, (b) the World Wide Web, and (c) the power grid network [15]

it to a random existing node or its neighbors [97, 101]. Another model proposed
by Newman et al. [135] aimed to build up a random graph with the arbitrary
degree setting. The ranking model grew the network according to a rank of the
nodes by any given prestige measure; the probability of linking a target node
could be any power law function of its rank, resulting in a power-law degree
distribution [60].

2.3 Community Structure

A community is a group of densely connected nodes in a graph. The community


structure is claimed to be one key property of various complex networks, suggesting
that a network can be partitioned into several clusters so that nodes in one cluster are
densely connected internally but not externally; such clustering might derive from
common interests of people, geographical divisions of power grids, or functional
similarity of proteins [72, 134]. Many methods have been proposed for community
detection, including finding separated/non-overlapping communities (see Fig. 2.5a)
and detecting overlapping communities (see Fig. 2.5b). In the following, we intro-
duce two popular methods: Infomap [153] and Link Clustering [3]. For other
community detection methods, readers could refer to [59].
Infomap The Infomap algorithm aims at finding non-overlapping communities.
The method is built on the assumption that a random walker is more likely to
be trapped in communities than to travel between communities. The path of
a random walker can be encoded, and then compressed given a hierarchical
network partition so that the encoded description is minimum. The duality
between finding community structure in a network and the coding problem is:
to find an efficient code, it looks for a module partition M of n nodes into m
2.3 Community Structure 17

Fig. 2.5 Illustration of network communities. (a) Non-overlapping communities. (b) Overlapping
communities

modules so as to minimize the expected description length of a random walk. By


using the module partition M, the average description length of a single step is
given by

m
L(M) = q H (L ) + i
p H (P i ), (2.7)
i=1

where H (L ) is the entropy of module names in M; H (P i ) is the entropy


of intra-module movements; q gives the probability that the random walk
switches modules on a given step; p i is the sum of the probability of intra-

module movements inside the module i and the probability of exiting i. The first
part of the formula describes the entropy of the movement between communities,
and the second part sums up the entropy within each community. Eventually
Infomap applies computational search algorithm to find the best partition as the
outcome [153].
Link Clustering Different from Infomap, the Link Clustering algorithm aims at
discovering overlapping communities in which a node is allowed to belong to
multiple communities. This algorithm reinvents communities as groups of links
rather than nodes. The set of neighbors of a node i is denoted as Ni . Given a pair
of links with one shared node, eij and ej k , the similarity between these two links
is the Jaccard similarity between neighbor sets of distinct nodes:
|Ni ∩ Nk |
S(eij , ej k ) = . (2.8)
|Ni ∪ Nk |
Then a dendrogram is built up according to these similarities using single-linkage
hierarchical clustering and cutting the dendrogram at some level produces the
overlapped community structure. Given a partition P = {P1 , P2 , . . . , PC }, a
partition density D can be computed by the average partition density weighted
by the fraction of present links in each partition:
18 2 Preliminary of Modeling Malicious Attack Propagation

 mc 2  mc − (nc − 1)
D= Dc = mc , (2.9)
c
M M c (nc − 2)(nc − 1)

where mc and nc are the numbers of edges and nodes in the partition Pc ,
respectively. The cutting threshold in the dendrogram can be determined by
achieving a maximum partition density.

2.4 Information Diffusion Models

Early models concerning communication dynamics were inspired by studies of


epidemic spreading [10, 12, 39, 73, 149]. Similar to how an infectious disease
is transmitted among the population, a piece of information can pass from one
individual to another through social connections and “infected” individuals can, in
turn, propagate the information to others, possibly generating a full-scale contagion.
The Susceptible-Infected (SI) [93, 94], Susceptible-Infected-Recovered (SIR) [10],
and Susceptible-Infected-Susceptible (SIS) [12] models are three classical models
in epidemiology, in which the infected population grows exponentially until the rate
of infection is balanced by the rate of recovery, or the contagion finally dies off when
the recovery rate prevails. As another foundation for this field, different models refer
to different scenarios in seeking propagation origins. Currently, researchers mainly
employ these three epidemic models for modelling the propagation of malicious
attacks:

Susceptible-Infected (SI) Model In this model, nodes are initially susceptible


and can be infected along with the information propagation (Fig. 2.6a). Once a
node is infected, it remains infected. This model focuses on the infection process
S → I , regardless of the recovery process. Suppose β is the rate of infection
possibility of nodes becoming infected through contact, then the average number
of new infections from time t to t + t can be calculated as

I = β ∗ I (t) ∗ S(t). (2.10)

According the assumption, we have S(t) = n − I (t). So, we can rewrite the SI
model as follows
dI (t)
= βI (t)(n − I (t)). (2.11)
dt

Fig. 2.6 Illustration of epidemic spreading models. (a) SI model; (b) SIR model; (c) SIS model
2.4 Information Diffusion Models 19

Susceptible-Infected-Recovered (SIR) Model Recovery processes are consid-


ered in this model (Fig. 2.6b). Similarly, nodes are initially susceptible and can
be infected along with the propagation. Infected nodes can then be recovered,
and never become susceptible again. This model deals with the infection and
curing process S → I → R. Suppose γ is the transition rate of infected status to
recovered status, then the SIR model can be expressed by the following formula:

dS(t) βI (t)S(t)
=− , (2.12)
dt n

dI (t) βI (t)S(t)
= − γ I (t). (2.13)
dt n

dR(t)
= γ I (t). (2.14)
dt

Susceptible-Infected-Susceptible (SIS) Model In this model, infected nodes


can become susceptible again after they are cured (Fig. 2.6c). This model stands
for the infection and recovery process S → I → S. Suppose γ is the transition
ration of infected status to susceptible status, then the SIS model can be expressed
by the following formula:

dS(t) βI (t)S(t)
=− + γ I (t), (2.15)
dt n

dI (t) βI (t)S(t)
= − γ I (t), (2.16)
dt n
The number of infected nodes and susceptible nodes as a function of t based on
the SI, SIR, and SIS models are shown in Fig. 2.7a, b and c, respectively. There
are also many other epidemic models, such as SIRS [166], SEIR [196], MSIR
[78], SEIRS [36]. Readers could refer to the work of [190] and [176] for more
epidemic models.

Fig. 2.7 The propagation process based on different models. (a) SI model. (b) SIR model. (c) SIS
model
Chapter 3
User Influence in the Propagation
of Malicious Attacks

Networks portray a multitude of interactions through which people meet, ideas are
spread, and infectious diseases and malicious rumors propagate within a society.
Recently, researchers have found that unsolicited malicious attacks spread extremely
fast through influential spreaders [42]. For example, in April 23, 2013, the twitter
account of Associated Press was hacked to spread the rumor that explosions at
White House injured Obama. This led to both the DOW Jones industrial average
and Standard & Poor’s 500 Index plunging about 1% before regaining their losses
[143]. Hence, identifying the most efficient ‘spreaders’ in a network becomes an
important step towards restraining spread of malicious attacks. In this chapter, we
investigate the methods of measuring influence of network nodes.

3.1 Introduction

The propagation of malicious attacks has long been a critical problem in various
forms of networks. For example, rumors spread incredibly fast in online social
networks [42]. Computer viruses spread throughout the Internet and compromise
millions of computers [177]. In Smart Grids, isolated failures lead to the rolling
blackouts in cities [194]. Influential users can initiate and conduct the dissemination
of information more efficiently than normal users. Therefore, influential users in
networks are normally more responsible for large cascades of malicious attacks.
Researchers have developed many methods to expose influential users in net-
works. The simplest measure is the degree of node which counts the number of
edges incident on a node [154]. Generally, the large-degree nodes correspond to
the popular users in social networks. The eigenvector centrality [20] is an extension
of the degree measure. Unlike the degree which weights every neighboring nodes
equally, the eigenvector centrality weights the neighboring nodes according to their
importance. The Katz centrality [92] is another extension of the degree measure.

© Springer Nature Switzerland AG 2019 21


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_3
22 3 User Influence in the Propagation of Malicious Attacks

The node degree stands for the number of direct neighbors, while the Katz centrality
counts the number of all reachable nodes and the contributions of distant nodes
are penalized. A more sophisticated centrality measure is closeness [66], which
is the mean geodesic (i.e., shortest-path) distance from the node of interest to all
other reachable nodes. The closeness measures the efficiency of a node distributing
information to any node in networks.
Another important class of centrality measures are betweenness measures. In
1977, Freeman [64] proposed the shortest-path betweenness which is defined as the
fraction of shortest paths between node pairs in a network that pass through the node
of interest. The shortest-path betweenness is the simplest and most widely used
betweenness measure, which is usually regarded as a measure of influence a user
possesses over information spreading between any pair of users. However, in most
networks, information does not spread only along the geodesic paths. To address
this problem, in 1991, Freeman et al. [67] proposed a more complex betweenness
measure, usually known as the flow betweenness. The flow betweenness is based on
the idea of maximum flow, which is defined as the number of flow units through
the node of interest when the maximum flow is transmitted between node pairs.
For these two betweenness measures, the information needs to “know” the ideal
route (shortest or maximum-flow path) from one node to another. However, the
ideal routes between node pairs are normally unknown during the transmission,
and the information wanders around randomly in the network until it reaches the
destination. Accordingly, in 2005, Newman [132] proposed a new betweenness
measure based on random walks. The random-walk betweenness counts how often
a node is traversed by a random walk between node pairs.
In this chapter, we study the betweenness measures and their application of
capturing influential nodes in the epidemics, in which malicious information is
disseminated from one node to multiple neighboring nodes and destinations. The
traditional betweenness measures have proved of great value to the analysis of node
influence in complex networks [34, 136, 187]. However, these measures focus on
the dissemination of information from one node to another, instead of one-to-multi.
This is conceptually not suitable for the real-world malicious attacks, in which
the malicious information propagation through multi-paths to all reachable nodes.
Wen et al. [182] proposed a betweenness measurement, epidemic betweenness, to
measure influence of nodes in epidemics.
As epidemic incidents may start from any nodes in networks, the node of
interest may be the epidemic source or an intermediary forwarding the receiving
to the neighboring nodes. As for the influence of a node, we normally consider a
node to be influential to the epidemics if this node can influence a large number
of following nodes after it becomes influenced by the epidemics. Formally, the
epidemic betweenness of an arbitrary node i, bEP (i), is the expected number of
nodes that are influenced directly or indirectly by node i after i becomes influenced
by epidemics. The value bEP (i) is averaged by the epidemic incidents that start
from all possible sources in the network. Hence, the epidemic betweenness reflects
the potential influence of a node to any epidemic in a complex network.
3.2 Problem Statement 23

3.2 Problem Statement

The traditional betweenness measures (i.e., shortest-path [64], flow [67] and
random-walk [132] betweenness) have long been employed to locate the influential
nodes in complex networks. However, these measures are conceptually not suitable
for the epidemics in which information spreads from one node to multiple receivers
rather than the transmission from one to another. In this section, we discuss the
difference between epidemic betweenness and traditional betweenness measures in
estimating the influence of network nodes.
Epidemic Betweenness vs. Shortest-Path Betweenness The shortest-path
betweenness centrality is defined as the fraction of the geodesic (i.e., shortest)
paths between node pairs that pass through the node of interest in a network. To be
(st)
precise, suppose that gi is the number of geodesic paths from node s to t that pass
through node i, and g (st) is the total number of geodesic paths from s to t. Then the
shortest-path betweenness centrality of node i is

gi(st) /g (st)
s<t
bSP (i) = , (3.1)
(1/2)n(n − 1)

where n is the total number of nodes in the network. The shortest-path betweenness
stands for the ability of a node relaying information between an arbitrary pair of
nodes in the network under the consideration of always choosing the shortest paths.
In epidemics, the geodesic paths that have the same distance between a pair of
nodes may provide different propagation probabilities. The nodes on these paths will
have the same shortest-path betweenness. However, as the information proceeds
their epidemic distributions differently in the paths with different propagation
probabilities, these nodes may have different epidemic influences. Therefore, the
shortest-path betweenness is not suitable for exposing the influence of nodes in
epidemics.
We introduce a simple example to explain the problem. As shown in Fig. 3.1(I),
two large groups are bridged by connections among just a few nodes. The weight
on the path “A − C1 − B” is “0.1 + 0.9”, while on the path “A − C2 − B” it is
“0.5 + 0.5”. All shortest paths between the two groups must pass through C1 or
C2 . As the weights on “A − C1 − B” and “A − C2 − B” are equal to 1, node C1
and C2 will get the same shortest-path betweenness values in this case. However,
the influential scale of node C1 and C2 will be different. When epidemics start from
group 1 to group 2 or in the reverse case, the epidemic distribution probability of
choosing path “A−C1 −B” is 0.09, while the probability of choosing the alternative
path “A − C2 − B” is 0.25 which is much higher than the former. Therefore, node
C2 has larger influence in epidemics than C1 . This example explains the reason why
the shortest-path betweenness cannot reflect the influence of node C1 and C2 in the
network.
24 3 User Influence in the Propagation of Malicious Attacks

Fig. 3.1 Simple examples to


illustrate the shortage of the
traditional betweenness
measures in epidemics. (I)
Node C1 has the same
shortest-path betweenness
with that of node C2 , but they
perform differently in
epidemics. (II) Node C has
low flow betweenness but
high influence in epidemics.
(III) Node C has high
random-walk betweenness
but low influence in
epidemics

Epidemic Betweenness vs. Flow Betweenness The flow betweenness centrality


of an arbitrary node i is defined as the amount of flow through node i when the
maximum flow is transmitted between node pair (s, t), averaged over all s and t,
as in
 (st)
s<t mi
bF L (i) = , (3.2)
m(st)
(st)
where mi is the maximum flow from node s to t that passes through i, and m(st) is
the maximum flow from s to t. In practical terms, one can think of flow betweenness
as measuring the betweenness of nodes in a network in which a maximal amount of
information is continuously pumped between each pair of nodes.
The flow betweenness measure only concerns the paths which contribute to the
maximum flow between node pairs. The nodes on the paths which do not have
contribution to the maximum flow will be neglected by the computation of flow
betweenness. However, these nodes may also receive and forward information in
epidemics. Therefore, the neglected nodes in flow betweenness measure will also
have influence to epidemics. This causes the flow betweenness measure to fail to
reflect the influence of nodes in epidemics.
Consider, for example, the network shown in Fig. 3.1(II), which again has two
large groups joined by a few connections. In this case, the maximum flow from
one group to the other is limited to two units, one unit flowing through node A
and one unit flowing through node B. Therefore, node A and B will get high flow
betweenness value, and node C will be neglected. However, node C can contribute
to epidemics. Suppose an epidemic started from group 1 and arrived at node A, then
3.3 Epidemic Betweenness 25

node C can influence a big number of nodes in group 2 after node C is infected
by A. The same thing occurs when the epidemic starts from group 2. This example
explains the reason why the flow betweenness measure cannot reflect the influence
of node C in the network.
Epidemic Betweenness vs. Random-Walk Betweenness The random-walk
betweenness of an arbitrary node i counts how often node i is traversed by a
random walk between node pair (s, t), average over all s and t, as in
 (st)
s<t Ii
bRW (i) = , (3.3)
(1/2)n(n − 1)

where Ii(st) is the net flow of random walk starting from s to t through i. This
measure is appropriate to a network in which information wanders about essentially
at random until it finds its target.
For the node of interest in a network, it will have random-walk betweenness
value if it can provide possible paths between node pairs. However, if the neighbors
of this node can always receive the information earlier, this node will not have
any contribution to the epidemics as its neighbors have already been influenced.
Therefore, the random-walk betweenness fails to reflect the influence of some nodes
in epidemics.
Consider the network sketched in Fig. 3.1(III) which again has two large groups
joined by a few connections. In this case, since node C is one of the nodes
connecting the two groups, it will get relatively high random-walk betweenness
value. However, the influence of node C will be very low in epidemics. Suppose
an epidemic started from group 1 and arrived at node A, then node C and B will
be influenced by A simultaneously. Then node B will influence the nodes in group
2, while node C cannot continue the epidemics since its neighbors (node A and B)
have already been influenced. As a result, node C has no influence in epidemics. The
same thing occurs when the epidemic starts from group 2. This example explains the
reason why the random-walk betweenness measure cannot reflect the influence of
node C in the network.

3.3 Epidemic Betweenness

In this section, we start by introducing a widely used propagation model which


mathematically presents the spreading dynamics of information in a network
[186, 187]. We further present the computation of the epidemic betweenness
based on this model, and use a simple example to explain the advantages of
epidemic betweenness. The computational complexity of epidemic betweenness is
also analyzed.
26 3 User Influence in the Propagation of Malicious Attacks

3.3.1 Information Propagation Model

In the real world, an arbitrary user can receive information and forward it to the
topological neighbors. Let random variable Xi (t) represent the state of user i at
discrete time t. According to the concept from pathology, the values of Xi (t) can be
represented as follows



⎨Sus., susceptible

Xi (t) = Con., contagious . (3.4)


⎩Inf., inf ected Dor., dormant

Every user is presumed to be susceptible (Xi (0)=Sus.) at the beginning. Users


become infected if they are influenced by the epidemics. Because users seldom
forward the same information multiple times to their neighbors, it is reasonable
to assume that users will distribute the information only once when they become
contagious (Xi (t) = Con.). After that, they will stay dormant (Xi (t) = Dor.) and
never contribute to the epidemics. Figure 3.2 presents the state transition graph for
an arbitrary user i. Note that, the Dor. state is an absorbing state and the contagious
users will directly transfer to the Dor. state.
The network topology is the basic element for an information propagation in a
network. Here, a n × n square matrix with elements ηij (ηij ∈ [0, 1]) is employed
to describe the topology of a network with n nodes. The value of ηij denotes the
probability of information spreading from user i to j . If user i has contact with j ,
ηij = 0, and ηij = 0 otherwise.
Given a network topology with n nodes, the number of susceptible users at time
t, S(t), is


n
S(t) = P (Xi (t) = Sus.), (3.5)
i=1

where P (·) denotes the probability of a variable. Then, the number of infected nodes
at time t, I (t), can be derived as in

I (t) = n − S(t). (3.6)

Fig. 3.2 The state transition


graph of a node in the
topology. The grey states
mean that the users have been
infected
3.3 Epidemic Betweenness 27

As shown in Fig. 3.2, v(i, t) denotes the probability of user i becoming contagious.
Then, the value of P (Xi (t) = Sus.) can be iterated using a discrete difference
equation as in


P (Xi (t) = Sus.) = 1 − v(i, t) · P (Xi (t − 1) = Sus.). (3.7)

R(i, t) denotes the probability of user i not receiving or accepting the information.
Since the information comes from topological neighbors, the value of R(i, t) can be
derived by assuming all the neighbors cannot successfully forward the information
to user i. Then, according to the multiplication principle, we have


R(i, t) = 1 − ηj i · P (Xj (t − 1) = Con.) , (3.8)
j ∈Ni

where Ni denotes the set of user i’s neighbors. Following the definition of R(i, t),
Wen et al. [182] derived the value of v(i, t) as in

v(i, t) = 1 − R(i, t). (3.9)

According to the state transition graph in Fig. 3.2, the value of P (Xi (t) = Con.)
can be derived as in

P (Xi (t) = Con.) = P (Xi (t − 1) = Sus.) · v(i, t). (3.10)

Note that the length of each time tick relies on the real environment. It can be 1 min,
1 h or 1 day.

3.3.2 Epidemic Influence

Given a network and an epidemic incident starting at node s, Wen et al. [182]
introduced the influence of an arbitrary node i to this epidemic incident, Ai|s , as
the expected number of the following nodes which can be infected by node i after
i getting infected. Node i may get infected at any time in the epidemic propagation
dynamics. Therefore, Ati|s denotes the influence of node i if i gets infected at time t.
Based on the mathematical model, Wen et al. [182] estimated the overall influence
of node i in the epidemic incident, Ai|s , as




E(Ai|s ) = P (Xi (t) = Con.) · E(Ati|s ) . (3.11)
t=0
28 3 User Influence in the Propagation of Malicious Attacks

Fig. 3.3 Example of the calculation of Eqs. (3.12) and (3.13). In this example, we have already
obtained the influence of node j , E(At+1 j |s ), in this epidemic incident originating from node s. As
both node h and i can infect node j , we need to calculate the contribution of node h and i to infect
node j . The contribution from node h or i to node j is determined by the ratio of their infection
probabilities, δijt and δhj
t . The influence of node i will be the proportional part of node j ’s influence

and contagious probability

In epidemics, nodes receive and send information to their neighbors. Therefore, the
influence of node i at time t, Ati|s , can be derived from the influence of its neighbors
at time t + 1. Then Ati|s can be computed as in


E(Ati|s ) = δijt E(At+1
j |s ) + P (Xj (t + 1) = Con.) , (3.12)
j ∈Ni

where δijt denotes the ratio of node i’s contribution to the infection of node j at time
t among all the neighboring nodes of node j , and

P (Xi (t) = Con.) · ηij


δijt = 
. (3.13)
P (Xk (t) = Con.) · ηkj
k∈Nj

Figure 3.3 introduces an example to explain the calculation details. As shown in


Fig. 3.2, the Dor. state is an absorbing state. Given a network with finite number
of nodes, we can predict the spread of information finally becomes steady and the
values of Ati|s and P (Xi (t) = Con.) converge to 0 if t 0. Thus, the influence
of nodes in a network can be recursively and inversely calculated by setting a large
final time of the epidemic incident.

3.3.3 Computation of Epidemic Betweenness

The analysis in Sect. 3.3.2 fixes the position of starting nodes in epidemics. In
fact, epidemics may start from any node in the network. Therefore, the epidemic
influence of an arbitrary node i regardless of the starting node s should be averaged
over all the possible positions of the starting nodes in the network. As the epidemic
3.3 Epidemic Betweenness 29

betweenness stands for the epidemic influence regardless of the starting nodes, Wen
et al. [182] computed the epidemic betweenness of an arbitrary node i, bEP (i), as in

1
n

bEP (i) = E Ai|s , (3.14)
n
s=1

where E(Ai|s ) is calculated from (3.11). Note from (3.14) that bEP (i) only relies on
the structure of the topology regardless of the starting node s. Because P (Xi (t) =
Con.) can be rapidly derived by iterations, the epidemic betweenness of each node
bEP (i) can be calculated efficiently.

3.3.4 Simple Examples

In this subsection, a few simple examples are used to illustrate the calculation of
different betweenness measures. The examples are based on the graphs sketched in
Fig. 3.1, with the two groups consisting of a complete graph of five nodes for each.
The details of the examples are shown in Fig. 3.4. Note that the unweighted links in
Fig. 3.4 have weight 1 by default. The results of different betweenness measures are
listed in Table 3.1.
Firstly in Fig. 3.4(I), node C1 and C2 have the same shortest-path betweenness
values (bSP (C1 ) = bSP (C2 ) = 0.1524), but their epidemic betweenness results
are different (bEP (C1 ) = bEP (C2 )). This confirms the previous analysis that node
C2 has high influence than node C1 . Secondly in Fig. 3.4(II), node C has a very

Fig. 3.4 Example networks


of the types sketched in
Fig. 3.1, with the groups
represented by completely
connected graphs of five
vertices each. Note that the
unweighted links have weight
1 by default
30 3 User Influence in the Propagation of Malicious Attacks

Table 3.1 Comparison of different betweenness measures in Fig. 3.4


Betweenness measure
Network Shortest-path Flow Random-walk Epidemic
Network I A 0.3476 0.2012 0.2340 0.1667
B 0.3476 0.2348 0.2340 0.1782
C1 0.1524 0.0295 0.1090 0.0712
C2 0.1524 0.1305 0.1090 0.1197
S 0 0.0505 0.0392 0.0592
T 0 0.0505 0.0392 0.0569
Network II A, B 0.1944 0.1661 0.1071 0.1188
C 0.0111 0.0035 0.0532 0.0535
S, T 0 0.0208 0.0568 0.0495
Network III A, B 0.5000 0.3089 0.2674 0.2110
C 0 0.1309 0.0829 0.0624
S, T 0 0.0314 0.0478 0.0642

low flow betweenness value (bF L (C) = 0.0035), but this node provides a relatively
high epidemic betweenness result (bEP (C) = 0.0535). This confirms the previous
analysis that node C possesses large epidemic influence. Finally in Fig. 3.4(III),
node C has large random-walk betweenness value (bRW (C) = 0.0829), but this
node has less contribution to the epidemics (bEP (C) = 0.0624). This result
confirms the previous analysis that node C has low influence.

3.3.5 Computational Complexity

In Sect. 3.3.3, note that the computation of the epidemic betweenness consists of
two parts: (1) presenting the propagation dynamics and (2) computing the influence
of each node reversely.
The first part is mainly concerned with Eqs. (3.7), (3.8) and (3.10). At each
time tick t, we need to update the probabilities in Eqs. (3.7) and (3.10) for node
i when node i becomes contagious at time t − 1. Therefore, in the worst case, the
computation of these two equations for all nodes is O(n). In Eq. (3.8), at time t, we
need |Ni | multiplications to calculate the probability R(i, t) for each node i. The
probability R(i, t) will be updated when node i is linked to a contagious neighbor.
Therefore, the average computation of Eq. (3.8) becomes the product of the mean
of degree and the number of contagious nodes at time t.
In the following, we show the details of how to calculate the number of
contagious nodes in the worst case at time t. For convenience, we use c1 to denote
the mean degree of a node, i.e.,

c1 = k . (3.15)
3.3 Epidemic Betweenness 31

According to Chap. 13 in [136], the mean number of second neighbors of a node is


 
c2 = k 2 − k , (3.16)

and, the mean number of neighbors at distance t is

ct = c1 (c2 /c1 )t−1 . (3.17)

In the worst case, a source node can infect every reachable node at time t, i.e., the
nodes within t distance from the source node can be contagious at time t. Then, the
number of contagious nodes becomes


t
c2t − c1t 1
Qt = cj = · t−2 . (3.18)
c2 − c1 c1
j =1

Then, the first part of the computation at time t is

O(n) + Qt · k . (3.19)

Suppose the propagation ends at time T , the first part of the computation of the
epidemic betweenness becomes
  
O(n) + Tt=1 Qt · k · T
(3.20)
= O(nT ) + k · T · (c2 QT − c12 T )/(c2 − c1 ),

where QT ≤ n. In most real networks, the average degree of nodes, k, is small
[40]. Therefore, the first part of computation of epidemic betweenness can be
rewritten as

O(nT ) + O(k · nT ) = O(nT ). (3.21)

Given a finite network, the propagation ends within limited steps (T ). In the real
world, because the structure of networks usually has small-world and scale-free
features, the information propagation throughout the real networks becomes very
fast [207]. Therefore, the value of T is usually small (T n).
The second part of the computation is mainly concerned with Eqs. (3.11), (3.12)
and (3.13). Given a propagation source s, the computation of Eq. (3.11) for all
nodes is O(nT ). To calculate Eq. (3.13), we need an average computation of k.
Therefore, the average computation of E(Ati|s ) in Eq. (3.12) is (k)2 . Besides,
similar to Eq. (3.8), the number of contagious nodes needs to be determined at time
t. Therefore, the computation of Eq. (3.12) at time t for all nodes becomes
32 3 User Influence in the Propagation of Malicious Attacks


t
(k) ·
2
cj . (3.22)
j =1

Therefore, the second part of the computation of epidemic betweenness is


  
O(n) + Tt=1 Qt · (k)2 T
(3.23)
= O(nT ) + (k)2 · O(nT ) = O(nT )

By combining the first and second parts, we have the computation of epidemic
betweenness as in

O(n2 T ). (3.24)

The complexity of shortest-path betweenness is O(n3 ) using the Floyd Warshall


algorithm [64]. The complexity of random-walk betweenness is O((m + n)n2 )
[132]. The flow betweenness complexity is O(mn2 ) [67]. Compared with these
three typical betweenness measures, the computation complexity of the epidemic
betweenness is fairly low.

3.4 Evaluations

A series of experiments were carried out to evaluate the accuracy of the epidemic
betweenness. The experiments were conducted on both synthetic networks and real-
world networks. The synthetic networks include the Erdős-Rényi (ER) network [52],
the scale-free network [136] and the small-world network [179]. They are generated
by the widely approved open-source software Pajek [172] with 1000 nodes and
average degree to be 2. The real networks are the Enron Email network [88], the
protein-protein interaction (PPI) network [82] and the U.S. Power-Grid network
[15]. The attributes of these real-world networks are presented in Appendix. In the
experiments, we choose the infection probabilities, ηij , by Gaussian distribution.
Typically, the average infection probability, E(ηij ), is set to be 0.6. The simulation
results are obtained by 1000 runs of experiments. Each run of the simulation stops
when there are no new contagious nodes in the network.

3.4.1 Accuracy in Measuring Influence

To evaluate the accuracy of the epidemic betweenness in measuring node influence,


we simulate the epidemic propagation in complex networks. The influence of an
arbitrary node will be estimated by the number of the following nodes infected by
3.4 Evaluations 33

10 10 10
8 8 8
Deviation of influence

6 6 6
4 4 4
2 2 2
0 0 0
-2 -2 -2
-4 -4 -4
-6 -6 -6
-8 10% away from the average influence -8 -8
-10 -10 -10
0 200 400 600 800 0 100 200 300 400 500 0 200 400 600 800 1000
Node ID Node ID Node ID
a b c

Fig. 3.5 The scatter plot of the difference between the influence and epidemic betweenness of
nodes. The dots indicate their difference. The dash-line pairs indicate 10% away from the averaged
influence. (a) ER; (b) Scale free; (c) Small world

5 15 20
4 15
10
Deviation of influence

3
10
2
5
1 5
0 0 0
-1 -5
-5
-2
10% away from the average influence -10
-3
-10
-4 -15
-5 -15 -20
0 50 100 150 0 500 1000 1500 0 1000 2000 3000 4000 5000
Node ID Node ID Node ID

a b c

Fig. 3.6 The scatter plot of the difference between the influence and epidemic betweenness of
nodes. The dots indicate their difference. The dash-line pairs indicate 10% away from the averaged
influence. (a) Enron Email; (b) PPI; (c) Power Grid

this node, averaged over 1000 runs of experiments. The results will be considered
as the benchmark to evaluate the accuracy of the epidemic betweenness. We expect
the epidemic betweenness to be close to the simulation result on each node. To
be precise, we use b(i) to represent the influence of node i from simulations, and
use b̃ to denote the overall average of b(i). We consider the epidemic betweenness
to be very close to the simulation result when the error between them satisfies
|b(i) − bEP (i)|/b̃ < 10%.
The experiment results on the synthetic networks are shown in Fig. 3.5. We
introduce a pair of red dashed lines in each subplot to indicate the boundaries of
10% × b̃. It can be seen that the majority of the results fall within the boundaries
in each synthetic network. This indicates that the epidemic betweenness measure
can accurately reflect the influence of nodes in synthetic networks. The results on
the real networks are shown in Fig. 3.6. In this case, we also introduce a pair of
red dashed lines to show the boundaries. Similarly, nodes seldom fall outside the
boundaries in each real network, which indicates that the epidemic betweenness
measure can accurately describe the influence of nodes in real networks.
34 3 User Influence in the Propagation of Malicious Attacks

3.4.2 Comparison with Other Measures of Influence

In this part, we compare epidemic betweenness with the traditional betweenness


measures and some classic centrality measures, including degree, closeness, eigen-
vector and Katz centralities. The degree [154] stands for the number of edges
connected to the node of interest. Newman [136] claimed that the degree is a
measure of the popularity of a user. The eigenvector and Katz centrality are
essentially the extensions of the degree measure [20, 92]. The closeness centrality
[66] is defined as the mean geodesic distance from a node to any other reachable
nodes. This measure discloses the nodes which can rapidly distribute information.
Therefore, we further compare the epidemic betweenness to the traditional between-
ness and other classic centrality measures.
We continue to use the simulation results in Sect. 3.4.1 as the benchmark for
the comparisons. We sort the influential nodes exposed by different measures and
simulations, respectively. Given a sampling ratio λ, we fetch λ of the sorted nodes
for each measure and the simulations. The intersection of their sorted results will
reflect the accuracy of different measures in presenting the influence of nodes in
epidemics. The intersection results in the synthetic and real networks are shown in
Figs. 3.7 and 3.8, respectively. We use the red dotted line to show the intersection
percentage of the influential nodes exposed by the epidemic betweenness measure
and the simulations. It can be seen that the red dotted lines achieve higher
intersection ratios in all the subplots, particularly when the sampling ratio is
smaller than 30%. This indicates that the epidemic betweenness measure can
more accurately identify influential nodes than any other measures. Therefore, the
epidemic betweenness measure is superior to the other measures in presenting the
influence of nodes in epidemics.

100 100 100


Percentage of intersection (%)

80 80 80

60 60 60
λ > 30% λ > 30% λ > 30%

40 40 40

20 20 20
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Sampling ratio λ (%) Sampling ratio λ (%) Sampling ratio λ (%)
(A) (B) (C)
Shortest path Flow Random walk Degree Closeness Eigenvector Katz Epidemic

Fig. 3.7 The intersection percentage of the influential nodes identified from simulations and the
influential nodes captured by different measures. (a) ER; (b) Scale free; (c) Small world
3.5 Correlation Analysis 35

100 100 100


Percentage of intersection (%)

80 80 80

60 60 60

40 λ > 30% 40 λ > 30% 40 λ > 30%

20 20 20

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Sampling ratio λ (%) Sampling ratio λ (%) Sampling ratio λ (%)
(A) (B) (C)
Shortest path Flow Random walk Degree Closeness Eigenvector Katz Epidemic

Fig. 3.8 The intersection percentage of the influential nodes identified from simulations and the
influential nodes captured by different measures. (a) Enron Email; (b) PPI; (c) Power Grid

Fig. 3.9 Scatter plots of the epidemic betweenness of nodes in the Enron Email network, against
the traditional betweenness measures: (a) shortest-path betweenness; (b) flow betweenness; (c)
random-walk betweenness. The dotted lines indicate the best linear fits in each case

3.5 Correlation Analysis

3.5.1 Correlation with Traditional Betweenness

In this subsection, we analyze the correlation between epidemic betweenness and


some traditional centrality measures. The analysis is carried out in the three real-
world networks: Enron Email network, PPI network and Power Grid network. The
scatter plots of the epidemic betweenness against the traditional betweenness values
on the three networks are shown in Figs. 3.9, 3.10 and 3.11. As we can see, in the
Enron Email network, the epidemic betweenness is moderately correlated with the
flow betweenness but highly correlated with the shortest-path and random-walk
betweenness. Overall, the nodes with higher traditional betweenness tend to have
higher epidemic betweenness. However, some nodes that have high (low) epidemic
betweenness also possess relatively low (high) traditional betweenness, especially
flow betweenness. In the PPI network and the Power Grid network, the epidemic
betweenness is highly correlated with the shortest-path betweenness but moderately
36 3 User Influence in the Propagation of Malicious Attacks

Fig. 3.10 Scatter plots of the epidemic betweenness of nodes in the PPI network, against the
traditional betweenness: (a) shortest-path betweenness; (b) flow betweenness; (c) random-walk
betweenness. The dotted lines indicate the best linear fits in each case

Fig. 3.11 Scatter plots of the epidemic betweenness of nodes in the Power Grid network, against
the traditional betweenness: (a) shortest-path betweenness; (b) flow betweenness; (c) random-walk
betweenness. The dotted lines indicate the best linear fits in each case

correlated with the random-walk and flow betweenness. Similar to the results on the
Enron Email network, some nodes that have high (low) epidemic betweenness also
correspond to relatively low (high) traditional betweenness.
In general, some, but not all, of the high traditional betweenness nodes also pos-
sess high epidemic betweenness. Since information may not always spread through
the shortest paths, some high shortest-path betweenness nodes will have relatively
low epidemic betweenness. Similarly, information may not always spread through
the maximum-flow paths. Therefore, some high flow-betweenness nodes show low
epidemic betweenness. Furthermore, information spreads randomly in networks but
also complies with propagation probabilities between nodes. Therefore, random-
walk betweenness also cannot accurately describe the influence of nodes. Epidemic
betweenness considers not only the propagation probabilities between nodes but
also the influential scale of a node when the propagation initiated from an arbitrary
node. Thus, it can accurately describe the influence of a node.
We have observed in the experiment that the shortest-path betweenness is
highly correlated with the epidemic betweenness. Although these two measures
are different, we can prove that the shortest-path betweenness is equivalent to the
3.5 Correlation Analysis 37

epidemic betweenness in identifying influential network nodes under the extreme


condition ηij = 1. In fact, when ηij = 1, the information in epidemics spreads
along the shortest-paths in networks. The epidemic influence of an arbitrary node i
is determined by the number of shortest paths between any pair of nodes through
node i. This matches the definition of the shortest-path betweenness. Therefore,
under this extreme condition, the nodes with high epidemic influence also have
larger values of shortest path betweenness, and they are equivalent to each other
in identifying influential nodes.

3.5.2 Correlation with Classic Centrality Measures

In this subsection, we analyze the correlation between the epidemic betweenness


and other classic centrality measures, including degree [154], closeness [66], eigen-
vector [20] and Katz centralities [92] in the three real-world networks. The results in
the Enron Email network are shown in Fig. 3.12. As the figure shows, the epidemic
betweenness is highly correlated with the degree, which means the degree increases
with the epidemic betweenness of nodes. However, the epidemic betweenness is
not highly correlated with the other three measures. The experiment results in
the PPI network are shown in Fig. 3.13. Similarly, the epidemic betweenness is
highly correlated with both the degree and Katz centrality measures. However, the
correlations between the epidemic betweenness and the closeness and Eigenvector
centralities are low. The results in the Power Grid network are shown in Fig. 3.14.
The epidemic betweenness is not correlated with all the classic centrality measures.
The results presented in Figs. 3.12, 3.13 and 3.14 show that the epidemic
betweenness measure is different from the classic centrality measures. As a result,
previous measures cannot replace the epidemic betweenness measure. As the
evaluation in Sect. 3.4 has shown that the epidemic betweenness can accurately
present the influence of nodes in complex networks, it is of great significance to
utilize the epidemic betweenness to evaluate the influence of nodes in complex
networks.

Fig. 3.12 Scatter plots of the epidemic betweenness against other centralities in Enron Email. (a)
Degree; (b) closeness; (c) eigenvector; (d) Katz
38 3 User Influence in the Propagation of Malicious Attacks

Fig. 3.13 Scatter plots of the epidemic betweenness against other centralities in PPI. (a) Degree;
(b) closeness; (c) eigenvector; (d) Katz

Fig. 3.14 Scatter plots of the epidemic betweenness against other centralities in Power Grid. (a)
Degree; (b) closeness; (c) eigenvector; (d) Katz

3.6 Related Work

The techniques of exposing the influential nodes can be divided into two distinct
classes: (1) How can we identify a set of k starting nodes, so that once they are
influenced, they will infect the largest number of susceptible nodes in the network?
(2) How can we identify a set of k nodes, so that when they are immunized to the
epidemic, they will have the largest impact in preventing the susceptible nodes from
being infected? We explain these two classes in the following.
For the methods in class (1), the early work came from P. Domingos et al. [43]
by using a Markov random field model to compute the network ‘value’ of each
node. D. Kempe et al. followed the work in [43] and pointed out that it is an
NP problem to maximize the influence of nodes[95]. Accordingly, they provided
approximation algorithms for maximizing the influence of nodes with provable
performance guarantee. As the algorithm proposed in [95] is computationally
intensive, some approaches were proposed to address the scalability, such as the
work in [30, 121, 175]. W. Chen et al. proposed a ‘degree discount’ heuristic method
to improve the original greedy algorithm in [30]. M. Mathioudakis et al. proposed
a pre-processing method to accelerate the processes of exposing influential nodes
in networks without compromising the accuracy [121]. Y. Wang et al. considered
the identification of influential nodes in mobile networks [175]. They proposed a
two-step method where in the first step, social communities are detected and in the
3.7 Summary 39

second step, a subset of communities is selected to identify the influential nodes.


Their method was empirically proved to be faster than the greedy algorithm in [95].
In addition, S. Bhagat et al. introduced a new information propagation model to
identify the influential nodes [18]. They argued that nodes spread information to
their neighbors even if they decide not to accept the information themselves.
There are also many methods in class (2). A common view for the preferable
nodes to block the propagation of information is at the highly-connected users
[8, 57, 207] or those with most active neighbors [192]. Indeed, large-degree nodes
in a scale-free network and their intuitively short paths to other nodes in a strongly
clustered small world [48] greatly facilitate the propagation of an infection over
the network, particularly at their early stage. However, this viewpoint was argued
by other work in [96, 183, 185] as the authors found the large-degree nodes may
not be the best positions for blocking the propagation of information. Recent
investigation on real datasets has also proved this phenomenon [13]. M. Kitsak et al.
formally proved that the betweenness measure will expose the most suitable nodes
to ‘control’ the network [96]. The methods in this class also include the traditional
betweenness and classic centrality measures.
The epidemic betweenness measure combines the considerations of these two
classes together. Each node can be a starting node or an intermediary in epidemics.
Since an epidemic incident can start from any node in the real world, the epidemic
betweenness measure will be more practical for exposing the influential nodes in
complex networks. Moreover, the epidemic betweenness can be calculated rapidly
through iterations.

3.7 Summary

In this chapter, we study the influence of users in malicious attack propagation. In


particular, we restrict the context of epidemic influence, and study the betweenness
measures and their application of capturing influential nodes. We have proposed a
novel betweenness measure based on epidemics to compute the influence of nodes
in complex networks. The calculation of this measure is based on the mathematical
model and can be calculated rapidly even when the scale of the network is large.
We further show that the influences of nodes estimated by the proposed epidemic
betweenness measure are more accurate than those measured by the traditional
betweenness and other classic centrality measures. This new centrality measure is
of great significance to both academia and industries.
Chapter 4
Restrain Malicious Attack Propagation

Restraining the propagation of malicious attacks in complex networks has long been
an important but difficult problem to be addressed. In this chapter, we particularly
use rumor propagation as an example to analyze the methods of restraining
malicious attack propagation. There are mainly two types of methods: (1) blocking
rumors at the most influential users or community bridges, and (2) spreading truths
to clarify the rumors. We first compare all the measures of locating influential
users. The results suggest that the degree and betweenness measures outperform
all the others in real-world networks. Secondly, we analyze the method of the
truth clarification method, and find that this method has a long-term performance
while the degree measure performs well only in the early stage. Thirdly, in order
to leverage these two methods, we further explore the strategy of different methods
working together and their equivalence. Given a fixed budget in the real world, our
analysis provides a potential solution to find out a better strategy by integrating both
kinds of methods together.

4.1 Introduction

The popularity of online social networks (OSNs) such as Facebook [171], Google
Plus [74] and Twitter [102] has greatly increased in recent years. OSNs have
become important platforms for the dissemination of news, ideas, opinions, etc.
Unfortunately, OSN is a double-edged sword. The openness of OSN platforms also
enables rumors, gossips and other forms of disinformation to spread all around
the Internet. In the real world, rumor has caused great damages to our society.
For example, the rumor “Two explosions in White House and Obama is injured”
happened in April 23, 2013 led to 10 billion USD losses before the rumor was
clarified [143].

© Springer Nature Switzerland AG 2019 41


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_4
42 4 Restrain Malicious Attack Propagation

Currently, there are mainly two kinds of strategies used for restraining rumors in
OSNs, including blocking rumors at important users [41, 49, 83, 96, 122, 129, 190,
193, 206] and clarifying rumors by spreading truths [25, 69, 71, 107, 165, 173]. We
can further categorize the first strategy into two groups according to their measures
in identifying the most important users: the most influential users [34, 70, 80, 96,
110, 159, 185] and the community bridges [31, 33, 106, 137–139, 174].
Every kind of strategy has pros and cons. Each method claims the better
performance among all the others according to their own considerations and
environments. However, there must be one standing out of the rest. Because there
does not exist a universal standard to evaluate all them together, the question of
which method is the best has long been important but difficult to be answered.
Accordingly, previous work mainly focused on the ‘vertical’ comparison (methods
inside their own category), such as the work [96, 110], but not on the ‘horizontal’
comparison (methods from different categories). All these methods are proposed to
restrain the spread of rumors in OSNs.
To numerically evaluate different methods, we introduce a mathematical model
to present the spread of rumors and truths. This is a discrete model so as to easily
locate the most important nodes in the modeling. We can thus implement different
strategies on this mathematical platform in order to evaluate their impacts to the
spread of rumors and truths. Through a series of empirical and theoretical analysis
using real OSNs, we are able to disclose the answer to the unsolved question.
In the real world, blocking rumors at important users may incur criticism since
it has risk of violating human rights. On the other hand, the probability of people
believing the truths varies according to many social factors. Therefore, it is very
important to find out the optimal strategy of restraining rumors, which possibly
should integrate both strategies together. The discussion on which method is the best
will be a small, but an important, step towards the solution of this part of work. Thus,
we are further motivated to explore the numerical relation and equivalence between
different methods. Wen et al. [184] systematically analyzed different strategies for
restraining rumors.

4.2 Methods of Restraining Rumors

Scientists have proposed many methods in order to restrain the propagation of


rumors, such as controlling influential users, controlling bridges of social commu-
nities and clarifying the rumors by spreading the truths. The taxonomy of these
methods is shown in Fig. 4.1.
4.2 Methods of Restraining Rumors 43

Fig. 4.1 The taxonomy of


the methods used to restrain
the spread of malicious
attacks

4.2.1 Controlling Influential Users

The most common but popular method is to monitor a group of influential users and
block their outward communication when rumors are detected on them. According
to the way they choose the influential users, we category current methods into three
types: degree, betweenness and core.
Degree The most direct and intuitive methods are to control the popular OSN
users. In social graphs, these users correspond to the nodes with large degrees
in OSNs. The theoretical bases of these methods are the scale-free and power-
law properties of the Internet matters that a few highly-connected nodes play a
vital role in maintaining the network’s connectivity [136, 147]. We illustrate this
method in Fig. 4.2a. We can see that when adequate popular users are controlled
in OSNs, the spread of rumors will be limited in a small branch of the whole
topology.
Betweenness Researchers have found that some nodes which do not have large
degrees in topologies also play a vital role in the dissemination of social
information. As shown in Fig. 4.2b, the degree of node E is smaller than node
A, B, C and D. However, node E is noticeably more important to the spread
of rumors as it is the connector of two large groups of users. To locate this
kind of nodes in OSNs, scientists introduced the measure of betweenness which
stands for the number of shortest paths passing through a given node [67]. We
can also find some other variants of betweenness, such as the RW betweenness
[132]. The work [34, 70, 80, 110, 185] argued that controlling the nodes with
higher betweenness values is more efficient than controlling those with higher
degrees.
44 4 Restrain Malicious Attack Propagation

Fig. 4.2 Restraining the rumors by controlling the influential nodes. (a): the influential nodes are
those of large degree; (b): the influential nodes are those of large betweenness; (c): the influential
nodes are those in the innermost core

Core In this case, the network topologies are decomposed using the k-shell
analysis. Some researchers have found that the most efficient rumor spreaders
are those located within the core of the OSNs as identified by the decomposition
analysis [96, 159]. We illustrate this viewpoint in Fig. 4.2c. We can see that
the nodes in the innermost component of the network may possibly have
smaller degrees, but they contribute to the kernel of the network and build the
connectivity between the outside components. Thus, the nodes in the core are
more crucial for restraining the rumors in OSNs.

4.2.2 Controlling Community Bridges

Most real OSNs typically contain parts in which the nodes are more highly
connected to each other than to the rest of the network. The sets of such nodes
are usually called communities in OSNs. The existing methods used to identify
communities mainly have two types: finding overlapped communities [138, 139]
and finding separated communities [31, 33, 33, 106, 137, 174, 174].
Overlapped Every OSN user in the real world has numerous roles. For example,
a user is a student so that he or she belongs to a schoolmate community. This
user may also belong to the communities of a family and various hobby groups.
Therefore, most of the actual OSNs are made of highly overlapping cohesive
groups of users [138, 140]. The nodes which locate at more than one community
are the bridges between communities. The bridges forward the information from
one community to another. If we control the bridges and block the spread of
rumors on them, the scale of the rumors propagation will be limited to the local
community. We illustrate this kind of methods [138, 139] in Fig. 4.3a.
4.2 Methods of Restraining Rumors 45

Fig. 4.3 Restraining the rumors by controlling the bridges between communities. (a): communi-
ties are overlapped; (b): communities are separated

Separated Some researchers [31, 33, 33, 106, 137, 174, 174] extract social rela-
tionship graphs by partitioning the topologies of OSNs into numerous separated
communities. The premise of these methods is that users are more likely to
receive and forward information from their social friends. Thus, these separated
communities are representative of the most likely propagation paths of the rumors
and the truths. Compared with the overlapped communities, the bridges are
the nodes which have outward connections to the nodes of other communities.
As shown in Fig. 4.3b, when the bridges between separated communities are
controlled, the spread of rumors will also be limited to a small scale.

4.2.3 Clarification Through Spreading Truths

Except banning the outward communication on those influential users or the


community bridges, people can adopt the strategy of spreading truths [25, 69, 71,
107, 165, 173] to the public in order to eliminate the critical rumors. As shown in
Fig. 4.4, the scale of the rumors’ propagation will be restrained after the truths start
to spread. In the real world, this strategy respects the freedom of speech, but its
efficiency is highly related to the credibility of the truth origins. If the origins of the
truths have high prestige among the masses, people will definitely accept the truths
when both the rumors and the truths are received. Otherwise, people make decisions
using the “minority is subordinate to majority” rule. We will model and elaborate
the processes of people making choices in the following section.
46 4 Restrain Malicious Attack Propagation

Fig. 4.4 Restraining the rumors by spreading truth in OSNs

4.3 Propagation Modeling Primer

We build up in this section the mathematical model in order to analyze the spread
of rumors and investigate the methods of restraining their propagation.

4.3.1 Modeling Nodes, Topology and Social Factors

In the real world, people may believe rumors, truths or have not heard of any
information from OSN. Let random variable Xi (t) represent the state of user i at
discrete time t. We borrow the concepts from epidemics and derive the values of
Xi (t) as follows


⎪ Sus., susceptible



⎪ Def., def ended

⎨ 
Act., active
Xi (t) = Rec., recovered (4.1)

⎪ I mm., immunized

⎪ 

⎪ Con., contagious

⎩ I nf., I nf ected
I mm., misled

Firstly, every user is presumed to be susceptible (Xi (t) = Sus.) at the beginning.
If a user is proactively controlled and will block the rumors, the node of this user
is at the Def. state. An arbitrary user i believes the rumor if Xi (t) = I nf. or the
truth if Xi (t) = Rec. Secondly, seldom users will forward the same messages of the
rumor or the truth multiple times to ‘persuade’ their social friends into accepting
what they have believed. Thus, we assume OSN users distribute the rumor or the
4.3 Propagation Modeling Primer 47

Fig. 4.5 The state transition


graph of a node in the
topology

truth only once at the time when they get infected (Xi (t) = Con.) or recovered
(Xi (t) = Act.). After that, they will stop to spread the rumor (Xi (t) = Mis.) or the
truth (Xi (t) = I mm.). Thirdly, the origins of the true news in the real world usually
have high prestige among the masses. Thus, an infected user can be recovered and
will not be infected again. The user will stay being immunized after he or she trusts
the truth. We provide the state transition graph for an arbitrary user in Fig. 4.5. We
can see that most users will finally believe the truth as the Imm. state is an absorbing
state.
The nodes and the topology are the basic elements for the propagation of OSN
rumors and truths. Given an OSN, we derive the topology of it. A node in the
topology denotes a user in the OSN. Here, we propose employing m × m square
matrix with elements ηij R , ηT  (ηR , ηT ∈ [0, 1]) to describe the topology of an
ij ij ij
OSN with m nodes, as in
⎡ R , ηT  ⎤
R , ηT 
η11 11 ··· η1m 1m
⎢ .. .. ⎥
⎣ . ηij , ηij 
R T . ⎦
ηm1 , ηm1 
R T ··· ηmm , ηmm 
R T

where, ηijR and ηT denote the probability of rumors and truths spreading from user
ij
R = 0, ηT = 0.
i to user j respectively. If user i has contact with user j , we have ηij ij
Otherwise, ηij = 0, ηij = 0.
R T

4.3.2 Modeling Propagation Dynamics

We introduce a widely approved discrete model [9, 29, 109, 185, 186, 195] to present
the propagation of rumors and truths in OSNs. The discrete model can locate each
influential node and evaluate its impact to the spread. Given a topology of an OSN
with m nodes, we can estimate the number of susceptible and recovered users at
time t, S(t) and R(t), as in
48 4 Restrain Malicious Attack Propagation

 
S(t) = m P (Xi (t) = Sus.)
i=1
m (4.2)
R(t) = i=1 P (Xi (t) = Rec.)

where, P (·) denotes the probability of a variable.


Similarly, the number of defended
nodes at time t, D(t), is derived by computing m i=1 P (Xi (t) = Def.). Then, we
can obtain the number of infected nodes at time t, I (t), as in

I (t) = m − S(t) − R(t) − D(t). (4.3)

As shown in Fig. 4.5, a susceptible user may accept the rumor and the node
enters the I nf. state. An infected node may also be recovered if this user accepts
the truth. We use v(i, t) and r(i, t) to denote the probability of user i being
infected or recovered. Then, the values of P (Xi (t) = Sus.), P (Xi (t) = Rec.)
and P (Xi (t) = Def.) can be iterated using the discrete difference equations as in

P (Xi (t) = Sus.) = [1 − v(i, t) − r(i, t)] · P (Xi (t − 1) = Sus.) (4.4)

P (Xi (t) = Rec.) = r(i, t) · [1 − P (Xi (t − 1) = Rec.)] + P (Xi (t − 1) = Rec.)


(4.5)
P (Xi (t) = Def.) = [1 − r(i, t)] · P (Xi (t − 1) = Def.) (4.6)

We introduce Neg(i, t) and P os(i, t) to be the probability of user i not believing


the rumor or the truth. Since the rumor and the truth come from social neighbors, the
values of N eg(i, t) and P os(i, t) can be derived by assuming all social neighbors
cannot convince user i of the rumor or the truth. Then, according to the principle of
multiplication, we have
⎧  
⎨ N eg(i, t) = j ∈N 1 − ηR · P (Xj (t − 1) = Con.)
i
 ji  (4.7)
⎩ P os(i, t) = j ∈N 1 − ηT · P (Xj (t − 1) = Act.)
i ji

where Ni denotes the set of user i’s neighbors. We assume the states of nodes in
the topology are independent. Then, according to the state transitions in Fig. 4.5, the
values of P (Xi (t) = Con.) and P (Xi (t) = Act.) can be derived as in

P (Xi (t) = Con.) = P (Xi (t − 1) = Sus.) · v(i, t) (4.8)

P (Xi (t) = Act.) = [1 − P (Xi (t − 1) = Rec.)] · r(i, t) (4.9)

From the above equations, we adopt discrete time to model the propagation
dynamics. Note that the length of each time tick relies on the real environment.
It can be 1 min, 1 h or 1 day.
4.3 Propagation Modeling Primer 49

4.3.3 Modeling People Making Choices

According to the ways people believe rumors and truths, we drive different values
of v(i, t) and r(i, t). In part, we summarize two major cases on the basis of our
analysis in the real world.
Absolute Belief In this case, we optimistically assume OSN users absolutely
believe the truths except they only receive rumors. Then, we can derive the values
of v(i, t) and r(i, t) as in

v(i, t) = [1 − Neg(i, t)] · P os(i, t)
(4.10)
r(i, t) = 1 P os(i, t)

In the real world, this case happens generally when the origins of true news have
high prestige among the masses. For example, when the rumor “two explosions
in White House and Barack Obama is injured” fast spread in twitter [143], White
House, as an origin which has absolute credibility among most people, swiftly
stopped the rumor by clarifying and spreading the truth “Obama is fine and no
explosion happened”.
Minority is Subordinate to Majority In this case, people do not absolutely trust
the origins of the truths. They believe either the rumor or the truth according to
the ratio of believers among their OSN friends. We can estimate the number of
received rumor and truth copies (CR (i, t) and CT (i, t)) for each user i as in


CR (i, t) = j ∈Ni ηij · P (Xj (t − 1) = Con.)

(4.11)
CT (i, t) = j ∈Ni ηij · P (Xj (t − 1) = Act.)

Then, we derive the values of v(i, t) and r(i, t) as in



[1−N eg(i,t)·P os(i,t)]·CR (i,t)
v(i, t) = CR (i,t)+CT (i,t)
[1−N eg(i,t)·P os(i,t)]·CT (i,t) (4.12)
r(i, t) = CR (i,t)+CT (i,t)

where, the value of Neg(i, t) · P os(i, t) is the probability of people refuting both
kinds of information. In the real world, “minority is subordinate to majority” (M-
S-M) is a more general case. When more friends choose to accept one kind of
information, the probability of the user believe this kind of information is larger
than the probability of choosing the opposite one.

4.3.4 The Accuracy of the Modelling

Before we carry out analysis using the mathematical model, we set up simulations
to validate its correctness. The experiment topologies are two real OSNs: Facebook
50 4 Restrain Malicious Attack Propagation

Fig. 4.6 The accuracy evaluation of the modelling compared with simulations

[171], Google Plus [74]. The simulations are implemented on the basis of existing
simulation work [192]. We mainly focus on the critical rumors (ηij R > 0.5). Thus,

we set the propagation probabilities as ηij R = ηT = 0.75. The spread of rumors


ij
starts at t = 0. Since the truths start to propagate after many users have believed the
rumors, we set the truth injection time, tinf ect , as tinf ect = 3. The implementation
is in C++ and Matlab2012b.
We show the validation results in Fig. 4.6. We can see that the modelling results
are quite accurate compared with the simulations. In Eq. (4.7), we assume the states
of nodes in the topology are independent. The independent assumption has been
widely used in this field, such as the works [29, 109, 185]. However, this assumption
may causes errors in the modeling. Readers could find extensive analysis in the
works [186, 195]. In fact, the errors will be compromised when the modelling results
of conflicting information mutually subtract each other. Here, we simply adopt this
assumption as we mainly focus on the comparison of different defense methods.

4.4 Block Rumors at Important Users

In this section, we analyze the proactive measures in order to find out the most
efficient one for blocking rumors. The degree measure can be directly derived from
the OSN topology. The betweenness measure is worked out using the standard
algorithm [132]. We also implement the k-shell decomposition algorithm [26] to
identify the core of OSNs. To locate community bridges, we use CFinder [28] to
identify the overlapped communities and NetMiner [130] for the separated ones.
We focus on the Facebook network [171] in this section.

4.4.1 Empirical Studies

We first work out all proactive measures and show the sorted results of influential
nodes in Fig. 4.7. For the degree measure (Fig. 4.7a), we can see that the node
4.4 Block Rumors at Important Users 51

Fig. 4.7 The sorted results of the influential nodes in the Facebook topology

Fig. 4.8 The sorted results of community bridges

degrees follow the power-laws [147]. This means the nodes with large degrees
are rare in the topology but have significant contribution to the OSN connectivity.
Similar results can also be observed in the measure of betweenness (Fig. 4.7b). For
the core measure (Fig. 4.7c), we can see that the innermost part finally leaves to be
a quite small group of nodes in the network.
The results of network communities are shown in Fig. 4.8. For the separated
communities (Fig. 4.8a), we find several large communities dominate the majority
of nodes in the network. In Fig. 4.8b, we set k = 5 (refer to CFinder [28]) and obtain
similar results for the overlapped communities.
From the empirical perspective, we examine which proactive measure can be
more efficient. We use λ to denote the defense ratio of nodes in OSNs, and λ ranges
from 1% to 30%. We mainly focus on critical rumors in this chapter (E(ηij R) >

0.5). To be typical, we set E(ηijR ) = E(ηT ) = 0.6 or 0.9. In the real world, since
ij
critical rumors often originate from the most popular users, we let the rumors in
the modelling spread from the node with large degree. The results of the rumor
spreading scale are shown in Fig. 4.9.
Observation 1 If we set the defense ratio (λ) close to 30%, the degree and
betweenness measures will almost stop the spread of rumors. This result is in
52 4 Restrain Malicious Attack Propagation

Fig. 4.9 The final steady amount of infected nodes when we apply proactive measures with
different defense ratios

Fig. 4.10 The propagation dynamics of rumors when we carry out defense according to different
proactive measures

accordance with the percolation ratio used to stop viruses in Email network [207].
However, the real OSNs generally have large-scales. Blocking rumors at 30% users
in OSNs is too many to be realized in the real world.
Observation 2 The betweenness and degree measures outperform all the other
measures, and the betweenness measure performs much better than the degree
measure if λ ≤ 20%. This result is in accordance with the work [110, 185].
Figure 4.9 has presented the final amount of infected users given a rumor spreading
in network. We further investigate the propagation dynamics of those measures
(typically setting λ = 10% or 20%). The results are shown in Fig. 4.10.
Observation 3 The degree measure performs better than the betweenness measure
in the early stage. The degree and betweenness measures outperform all the others
all over the spreading procedure. However, different from the observation 2, the
degree measure has a short-term better efficiency than the betweenness measure.
This degree measure is also suggested by the work [5].
4.4 Block Rumors at Important Users 53

4.4.2 Theoretical Studies

In this subsection, we carry out mathematical analysis in order to theoretically


justify the empirical results. To numerically evaluate different measures, we first
introduce a new concept, the contagious ability.
Definition 4.1 (Contagious Ability) Given an OSN and an incident of rumor
spreading in this network, the contagious ability of an arbitrary node i, Ai , is defined
as the number of the following nodes which can be directly or indirectly infected by
node i after this node being infected.
An arbitrary user i may possibly get infected at any time in the rumor propagation
dynamics. We use Ati to denote the contagious ability of node i if the user of this
node gets infected at time t. On the basis of our mathematical model, we can then
estimate the overall contagious ability of an arbitrary node i as in



E(Ai ) = P (Xi (t) = Con.) · E(Ati ) (4.13)
t=0

OSN users receive and send rumors from and to their neighboring users. We use
Atij to denote the potential contagious ability caused by the rumor spread from node
i to node j at time j . We also introduce Pijt to denote the potential contagious
probability of node j contributed by node i at time t. The mean value of Ati can then
be recursively worked out as in
 
t+1
E(Ati ) = E(Aij ) + Pijt+1 (4.14)
j ∈Ni

We can further compute E(Ait+1 ) and Pijt+1 as in



t+1
E(Aij ) = δijt · E(Ajt+1 )
(4.15)
Pijt+1 = δijt · P (Xj (t + 1) = Con.)

where δijt denotes the ratio of node i’s contribution to the infection of node j at time
t among all the father nodes of node j , and we have

P (Xi (t) = Con.) · ηij


R
δijt =    (4.16)
k∈Nj P (Xk (t) = Con.) · ηkj
R

As shown in Fig. 4.5, the I mm. state is an absorbing state. Given an OSN with finite
number of users, we can predict that the spread of rumors finally becomes steady
and the values of Ati and P (Xi (t) = Con.) converge to zero if 0lleqt ≤ ∞. As a
54 4 Restrain Malicious Attack Propagation

result, the contagious ability of each node in OSNs can be recursively and reversely
worked out by setting a large final time of the spread.
We further calculate the contagious time in order to numerically evaluate the
temporal efficiency of those measures against the spread of rumors.
Definition 4.2 Given an OSN and an incident of rumor spreading in this network,
the contagious time of an arbitrary node i, Ti , is defined as the mean time of node i
getting infected in the whole propagation.
Conceptually, the contagious time of node i, Ti , can be easily computed as in

 P (Xi (t) = Con.) · t
Ti = ∞ (4.17)
t=0 t=0 P (Xi (t) = Con.)·

Among the three observations, we mainly focus on the observations 2 and 3


since the observation 1 is practically infeasible in real OSNs. Moreover, previous
work [207] has proved that the connection ratio and the link remaining ratio almost
reach zero if we remove the top 30% of the most connected nodes from the OSN
topologies. Under this situation, the rumors definitely cannot spread out.
Justification 1 (Observation 2) The contagious ability, Ai , denotes the potential
number of the following nodes infected by node i. Thus, a node with stronger
contagious ability is conceptually more worthwhile for blocking rumors in OSNs.
We sort the nodes according to the contagious abilities and choose the result as
a benchmark. With different values of λ, we work out the intersection between
the benchmark and the sorted nodes of various proactive measures. The results
are shown in Fig. 4.11. We can see that the betweenness and degree measures
capture more nodes with higher contagious abilities. This may be the reason why
the betweenness measure performs best and the second best belongs to the degree
measure.

Fig. 4.11 The intersection


ratio between the sorted
nodes of contagious ability
and various proactive
measures
4.5 Clarify Rumors Using Truths 55

Fig. 4.12 The average


contagious time of the degree
and betweenness measures
when λ < 10%

Justification 2 (Observation 3) Let a rumor spread in the network, we then


calculate the contagious time of each node in order to justify the superior short-
1 
term performance of the degree measure. We can use |ω| i∈ω Ti to estimate the
average contagious time among the nodes in ω. Given a defense ratio λ, ω is the
set of nodes chosen for blocking rumors. The results are shown in Fig. 4.12. We
can see that the average contagious time of nodes chosen by the degree measure is
much less than the nodes chosen by the betweenness measure. This means the nodes
with large degrees will be infected earlier. Thus, if we use the nodes chosen by the
degree measure to block rumors, the spread in the short-term will be restrained faster
compared with the nodes chosen by the betweenness measure.

4.5 Clarify Rumors Using Truths

In this section, we will analyze the remedial measure using the mathematical model.
T ). They can greatly affect the inject
There are mainly two factors, tinj ect and E(ηij
efficiency of restraining rumors by spreading truths.

4.5.1 Impact of the Truth Injection Time

R ) = E(ηT ) =
To exclusively investigate the impact of tinj ect , we typically set E(ηij ij
0.75. Based on the spreading dynamics shown in Fig. 4.10, we assign tinj ect as
• truth starts with rumor,
• truth starts in the early stage of rumor spread,
• truth starts in the late stage of rumor spread.
56 4 Restrain Malicious Attack Propagation

R ) = 0.75
Fig. 4.13 The number of infected users by varying truth injecting time. Setting: E(ηij

Fig. 4.14 The number of the contagious and the active nodes at any time t in the propagation.
R ) = 0.75
Setting: E(ηij

The experiments are executed on both the Facebook and Google Plus networks, and
with both the cases of the people making absolute choices and and making M-S-M
choices. The results are shown in Fig. 4.13.
Observation 4 The truth clarification method performs better if the spread of truths
starts earlier, but if not, this method has a weak performance in the early stage since
the rumors are distributed incredibly fast. We can see that the propagation scale will
decrease dramatically after we inject the truth into the network. Both the spread of
rumors and truths will finally become steady. The results in Fig. 4.13 indicate that
the remedial measure of spreading truth mainly perform a long-term effectiveness
in restraining rumors. 
We further investigate thenumber of the contagious nodes ( m i P (Xi (t) =
Con.)) and the active nodes ( m i P (Xi (t) = Act.)) at any time t during the spread.
The results are shown in Fig. 4.14. We can see from Figs. 4.14(A1) and 4.14(C1) that
tinj ect has some effect on restraining the number of contagious nodes when people
4.5 Clarify Rumors Using Truths 57

making absolute choices. However in Figs. 4.14(B1) and 4.14(D1), we find tinj ect
has no obvious effect when people making M-S-M choices. Moreover, we can see
from Fig. 4.14(A2–D2) that the number of active nodes will take effect according
to the value of tinj ect . The results of Fig. 4.14, both from the number of contagious
nodes and active nodes in the propagation dynamics, have well explained the impact
of tinj ect observed in Fig. 4.13.

4.5.2 Impact of the Truth Propagation Probability

To exclusively examine the impact of the truth propagation probability E(ηij T ), we

typically set tinj ect = 3 and E(ηij ) = 0.6. The value of E(ηij ) will be set as
R T

• 0.3: people are not willing to believe the truth,


• 0.6: people fairly believe the truth,
• 0.9: people most likely believe the truth.
Both the Facebook and Google Plus networks will be used in the experiments.
Similarly, the cases of people making absolute choices or M-S-M choices will also
be considered. The results are shown in Fig. 4.15.
Observation 5 The efficiency of restraining rumors using the remedial measure
largely decreases when people are not willing to spread the truths. In accordance
with the reality, we find E(ηijT ) has extraordinary impact on restraining rumors by

spreading
m truths in OSNs. We additionally examine the number of active nodes
i P (X i (t) = Act.) at any time t during the spread dynamics. As shown in
Fig. 4.16, a smaller value of E(ηij T ) will lead to a smaller number of active nodes.

This exactly corresponds to the limited efficiency of the remedial measure shown in
Fig. 4.15.
Given a critical rumor spreading in the network E(ηij R ) > 0.5, we can summarize

two real cases according to the values E(ηij R ) as follows:

T ). Setting:
Fig. 4.15 The number of infected nodes and recovered nodes with different values E(ηij
tinj ect = 3, E(ηij ) = 0.75
R
58 4 Restrain Malicious Attack Propagation

inj ect = 3,
T ). Setting: t
Fig. 4.16 The number of active nodes with different values E(ηij
E(ηij ) = 0.6
R

R ) > 0.5
E(ηij In the real world, through the propaganda or other measurements,
people may be willing to believe and spread truths. According to previous
analysis, the truth holder can receive an acceptable or even better results by
spreading truths to restrain rumors when E(ηijR ) > 0.5.
R ) < 0.5
E(ηij According to the results of Fig. 4.15, the remedial measure may not
be able to counter the spread of rumors at this case. Actually, this is a common
phenomenon always happened in the real world.

4.6 A Hybrid Measure of Restraining Rumors

In this section, we investigate the pros and cons when different measures work
together. We also explore the equivalence of these measures.
To numerically evaluate the effectiveness of these measures, we use the maximal
number of infected users (Imax ) and the final number of infected users (If inal ) to
present the damage caused by rumors. In the real world, when either Imax or If inal
becomes larger, more damages will be caused to the society.

4.6.1 Measures Working Together

Firstly, we examine the values of Imax and If inal on the basis of the mathematical
model. We typically set tinj ect = 3 and E(ηij T ) ranges from 0.1 to 0.9. The results

are shown in Fig. 4.17. We can see that the values of Imax always stay large while
T ). This indicates the
the values of If inal gradually decrease with the increasing E(ηij
remedial measure cannot alleviate the damage denoted by Imax . On the contrary, the
proactive measures are able to reduce Imax .
Secondly, the spread of rumors and truths actually presents a common issue in
the psychology field when E(ηij T ) < 0.5 < E(ηR ). That is the “rumor has wings
ij
while truth always stays indoors” since people naturally have ‘negativity bias’ on the
4.6 A Hybrid Measure of Restraining Rumors 59

Fig. 4.17 The maximum number of infected users (Imax ), the final number of infected users
(If inal ) and the final number of recovered users (Rf inal ). Settings: tinj ect = 3, E(ηij
R ) = 0.75,

E(ηijR ) ∈ [0.1, 0.9]

Fig. 4.18 A case study of


measures working together.
Settings: tinj ect = 3,
R ) = 0.75
E(ηij

received information [120]. According to the observation 5, the remedial measure


cannot largely reduce the value of If inal when E(ηij T ) < 0.5 < E(ηR ). We notice in
ij
the observation 4 that the remedial measure only has a long-term performance, while
in the observation 3 that the degree measure has a short-term best performance.
To address the specific case “rumor has wings while truth always stays indoors”,
we propose to put the eggs in different baskets. Both the degree measure and
the truth clarification method will be used for restraining rumors in OSNs. As an
R ) = 0.9 and t
inj ect = 3 to do case study. Besides, E(ηij ) and
example, we set E(ηij T

λ will be assigned as: (1) λ = 10%, E(ηij ) = 0: proactive measures, (2) λ = 0,


T

E(ηijT ) = 0.6: remedial measures, (3) λ = 5%, E(ηT ) = 0.3: two methods together.
ij
The results are shown in Fig. 4.18. We find that if we set λ = 5%, E(ηij T ) = 0.3,

both Imax and If inal will decrease compared with another two extreme settings
which can only reduce either Imax or If inal .
60 4 Restrain Malicious Attack Propagation

4.6.2 Equivalence of Measures

In the real world, the surveillance on influential users needs much financial support.
The propaganda used to prompt the spread of truths also costs much money. Given
a limited budget, we explore the equivalence between the proactive and remedial
measures in order to leverage these two different strategies.
Firstly, we investigate If inal when we apply different defense ratios (λ) and
values of E(ηij T ) on the propagation of rumors and truths. On the basis of our

mathematical model, this part of analysis will disclose the congruent relationship
inj ect = 3,
between the values of λ and E(ηij T ) in networks. Typically, we set t

E(ηij ) = 0.75 and use the Facebook and Google Plus topologies. The results are
T

shown in Fig. 4.19. Given a pair of λ and E(ηij T ), we can find several equivalent

solutions with different values of λ and E(ηij T ). These different solutions have the

same performance as the original pair of λ and E(ηij T ). This means we can leverage

the proactive and remedial measures according to the fixed budget.


Secondly, we further examine the numeric equivalence in the Facebook and
Google Plus networks. We will also consider people making absolute and M-S-
M choices. Following the settings of Fig. 4.19, we provide the results in Fig. 4.20.
We find the numeric equivalence exists in most cases. On the basis of the results in
Fig. 4.20, we are able to identify the exact schema to replace the original pair of λ
and E(ηijT ). This part of analysis and the results are of great significance from the

practical view of point in the real world.

Fig. 4.19 The final number of infected nodes (If inal ) when we set a series of different defense
inj ect = 3, E(ηij ) = 0.75
T ). Setting: t
ratios (λ) and truth spreading probability E(ηij R
4.7 Summary 61

Fig. 4.20 The numeric equivalence between the degree measure and the remedial measure when
T ). Setting:
we set a series of different defense ratios (λ) and truth spreading probabilities E(ηij
tinj ect = 3, E(ηij ) = 0.75
R

4.7 Summary

In this section, we first discuss the robustness of the contagious ability. Then, we
discuss the fairness to the community bridges when we evaluate the efficiency of
restraining rumors. We finally summarize the work in this chapter.

4.7.1 The Robustness of the Contagious Ability

In this section, we firstly discuss the robustness of the contagious ability. According
the definition of contagious ability, its usage relies on the rumor spreading origins.
However, it can be directly used for numeric evaluation of other measures when the
spread of rumors originates from highly connected nodes. To confirm the robustness
of this usage, we examine the average degree of contagious nodes, Dt , at each time
tick t, as in


m
P (Xi (t) = Con.)
Dt = m · di . (4.18)
i=0 i=0 P (Xi (t) = Con.)

wherein di is the degree of node i. In the experiments, we randomly choose the


rumor origins and average the values of Dt at each time tick t according to 100 runs.
The results are shown in Fig. 4.21. It is clear that Dt stays high at the beginning and
then sharply decreases till the end of the spread. This means the nodes with higher
degrees are more easily to be infected in the early stage. Actually, this feature may
be caused by the power-law and the scale-free properties of OSNs [56]. As a result,
the contagious ability based on randomly chosen origins will not largely deviate
from the ones based on the identical highly-connected origins. This explains the
robustness of the usage of the contagious ability.
62 4 Restrain Malicious Attack Propagation

Fig. 4.21 The average


degree of nodes that are being
infected at each time tick

4.7.2 The Fairness to the Community Bridges

In the real world, people form various communities according to their interests,
occupations and social relationships. They are more likely to contact the ones within
the same communities. Thus, it would be more precise to consider this premise in
our analysis. However, the algorithms (CFinder [28] and NetMiner [130]) have not
considered the communication bias between community members. This may cause
some unfairness to the community bridges when we evaluate the rumor restraining
efficiency.
In fact, the spread of information in community environment is a more complex
process. We plan to corporate the communication bias in communities from the
records of the real OSNs. This may help us more accurately evaluate the efficiency
of different measures. Due to the page limit, we will move this part to our future
work.
In summary, we carry out a series of analysis on the methods of restraining
rumors. On the basis of our mathematical model, the analysis results suggest that
the degree and betweenness measure outperform all the other proactive measures.
In addition, we observe that the degree measure has better short-term performance
in the early stage. We also investigate the efficiency of spreading truth in order
to restrain the rumors. We find the truth clarification method mainly has a long-
term performance. In order to address the critical case “rumor has wings while
truth always stays indoors”, we further explore the strategies of different measures
working together and the equivalence leveraging both of them. From both the
academic and practical perspective, our work is of great significance to the work
in this field.
Part II
Source Identification of Malicious
Attack Propagation
Chapter 5
Preliminary of Identifying Propagation
Sources

This chapter provides some primary knowledge about identifying propagation


sources of malicious attacks. We first introduce different types of observations about
the propagation of malicious attacks. Then, we present the maximum-likelihood
estimation method adopted by many approaches in this research area. We finally
introduce the evaluation metrics for source identification.

5.1 Observations on Malicious Attack Propagation

One of the major premises in propagation source identification is the observa-


tion of node states during the propagation process of malicious attacks. Diverse
observations lead to a great variety of methods. According to the literature, there
are three main categories of observations: complete observations, snapshots, and
sensor observations. An illustration of these three categories of observations is
shown in Fig. 5.1. It is clear that the snapshot and sensor observation provide much
less information for identifying propagation sources compared with the complete
observation.
Complete Observation Given a time t during the propagation, complete obser-
vation presents the exact state for each node in the network at time t. The state
of a node stands for the node having been infected, recovered, or remaining
susceptible. This type of observation provides comprehensive knowledge of
a transient status of the network. Through this type of observation, source
identification techniques are advised with sufficient knowledge. An example of
the complete observation is shown in Fig. 5.1a.
Snapshot Observation A snapshot provides partial knowledge of network status
at a given time t. Partial knowledge is presented in four forms: (1) nodes reveal
their infection status with probability μ; (2) we recognize all infected nodes,

© Springer Nature Switzerland AG 2019 65


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_5
66 5 Preliminary of Identifying Propagation Sources

A B C

Fig. 5.1 Illustration of three categories of observation in networks. (a) Complete observation; (b)
Snapshot; (c) Sensor observation

but cannot distinguish susceptible or recovered nodes; (3) only a set of nodes
were observed at time t when the snapshot was taken; (4) only the nodes who
were infected exactly at time t were observed. An example of the 4-th type of
snapshots is shown in Fig. 5.1b.
Sensor Observation Sensors are firstly injected into networks, and then the
propagation dynamics over these sensor nodes are collected, including their
states, state transition time and infection directions. In fact, sensors also stand
for users or computers in networks. The difference between sensors and other
nodes in networks is that they are usually monitored by network administrators
in practice. Therefore, the sensors can record all details of the malicious
attack propagation over them, and their life can be theoretically assumed to be
everlasting during the propagation dynamics. This is different from the mobile
sensor devices which may be out of work when their batteries run out. As an
example, we show the sensor observation in Fig. 5.1c.
The initial methods, such as Rumor Center [160] and Dynamic Age [58], for
propagation source identification require the complete observation of the network
status. Later, researchers proposed source identification methods, such as Jordan
Center and Concentricity based methods, for partial observations like snapshots.
Researches also explored source identification methods through injecting sensors
into the undering network and identifying the propagation sources based on the
observations of the sensors. Accordingly, current source identification methods can
be categorized into three categories in accordance with these three different types of
observations, which we will introduce in the following chapters.

5.2 Maximum-Likelihood Estimation

Maximum-likelihood estimation (MLE) [37] is a method of estimating the param-


eters θ of a statistical model M, given the independent observed data X =
5.3 Efficiency Measures 67

{x1 , x2 , . . . , xn }. Let us assume the probability of observing xi in the model M


given parameters θ is f (xi |θ ). Then the likelihood of having parameters θ equals to
the probability of observing X given θ :

L (θ |X) = f (X|θ ) = f (x1 |θ )f (x2 |θ ) . . . f (xn |θ ) = i = 1n f (xi |θ ). (5.1)


n
logL (θ |X) = logf (xi |θ ). (5.2)
i=1


n
θ̂ = arg maxθ log L (θ |X) = arg max log f (xi |θ ). (5.3)
i=1

To find the optimal parameter θ̂ which best describes the observed data given
the model and thus provides the largest log-likelihood value, we can solve the
Eq. (5.3) or computationally search for the best solution in the parameter space. This
thesis mainly adopts MLE to estimate the probability of a node being a candidate
propagation source.

5.3 Efficiency Measures

• Accuracy. The accuracy of a single realization is ai = 1/|Vtop | if s ∗ ∈ Vtop


or ai = 0, otherwise, where s ∗ is the true source and Vtop is a group of nodes
with the highest score (top scorers). The total accuracy a is an average of many
realizations ai , therefore a ∈ [0, 1]. This measure takes into account the fact that
there might be more than one node with the highest score (ties are possible).
• Rank. The rank is the position of the true source on the node list sorted in
descending order by the score. In other words this measure shows how many
nodes, according to an algorithm, is a better candidate for a source than the true
source. If the real source has exactly the same score as some other node (or
nodes), the true source is always below that node (these nodes) on the score list
sorted in descending order. The rank takes into account the fact that an algorithm
which is very poor in pointing out the source exactly (low accuracy) can be very
good at pointing out a small group of nodes among which is the source.
• Distance error. The distance error is the number of hops (edges) between the
true source and a node designated as the source by an algorithm. If |Vtop | > 1,
which means that an algorithm found more than one candidate for the source,
the distance error is computed as a mean shortest path distance between the real
source and the top scorers.
Chapter 6
Source Identification Under Complete
Observations: A Maximum Likelihood
(ML) Source Estimator

In this chapter, we introduce a propagation source estimator under complete


observations: a maximum likelihood source estimator (Rumor Center). According
to Chap. 5, a complete observation presents the exact state for each node in
the network at certain time t. This type of observation provides comprehensive
knowledge of a transient status of the network. Initial research on propagation
source identification focused on complete observations, such as Rumor Center,
Dynamic Age, Minimum Description Length, etc. Among these methods, Rumor
center is a widely used method. Many variations have been proposed based on this
method, such as Local Rumor Center, Multiple Rumor Center, etc. Here, we present
the details of the Rumor Center estimator. For the techniques involved in other
methods under complete observations, readers could refer to Chap. 9 for details.

6.1 Introduction

In the modern world the ubiquity of networks has made us vulnerable to various
types of malicious attacks. These malicious attacks arise in many different contexts,
but share a common structure: an isolated risk is amplified because it is spread by the
network. For example, as we have witnessed computer viruses utilize the Internet
to infect millions of computers everyday. Malicious rumors or misinformation can
rapidly spread through existing social networks and lead to pernicious effects on
individuals or society. In the recent financial crisis, the strong dependencies or
‘network’ between institutions have led to the situation where the failure of one
institution have led to global instabilities.
In essence, all of these situations can be modeled as a rumor spreading through
a network, where the goal is to find the source of the rumor in order to control
and prevent these network risks based on limited information about the network
structure and the “rumor infected” nodes. The answer to this problem has many

© Springer Nature Switzerland AG 2019 69


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_6
70 6 Source Identification Under Complete Observations: A Maximum Likelihood. . .

important applications, and can help us answer the following questions: who is the
rumor source in online social networks? which computer is the first one infected by
a computer virus? and where is the source of an epidemic?
The initial work in addressing the problem of identifying the rumor propagation
source has primarily focused on the complete observations of the networks under
malicious attack. Shah and Zaman [160–162] are the first who provided a systematic
study of the problem. They model rumor spreading in a network with the popular
Susceptible-Iinfected (SI) model and then construct an estimator for the rumor
source. The estimator is based on a novel topological quantity, called rumor
centrality. They established a maximum likelihood (ML) estimator for a class of
graphs: regular tree graph, general tree graph, and general graph. They find that
on tree graphs, the rumor center and distance center are equivalent, but on general
graphs, they may differ.

6.2 The SI Model for Information Propagation

Suppose a network of nodes is modeled as an undirected graph G(V , E), where V


is the set of nodes and E is the set of edges of the form (i, j ) for nodes i and j in V .
Suppose the propagation initiated with only one node v ∗ is the rumor source. The
susceptible-infected (SI) model is used for modelling rumor propagation. In the SI
model, once a node has the rumor, it keeps it forever. Once a node i has the rumor,
it can spread it to its neighboring node.

6.3 Rumor Source Estimator: Maximum Likelihood (ML)

Suppose that the rumor starting at node v ∗ at time 0 has spread in the network G.
We observe the network at some time and find N infected nodes. Then, these nodes
must form a connected subgraph of G. The subgraph is denoted as GN . Based on the
observation GN and the knowledge of G, the maximum likelihood (ML) estimator
v̂ of v ∗ minimizes the error probability. By definition, the ML estimator is

v̂ ∈ arg max P (GN |v), (6.1)


v∈GN

where P (GN |v) is the probability of observing GN under the SI model assuming v
is the source, v ∗ . Thus, we need to calculate P (GN |v) for all v ∈ GN and then treat
the one with the maximal value as the rumor source.
6.4 Rumor Source Estimator: ML for Regular Trees 71

6.4 Rumor Source Estimator: ML for Regular Trees

Note that, the calculation of P (GN |v) is not computationally tractable. Shah and
Zaman [160] first evaluate P (GN |v) on a tree graph. Essentially, they need to find
the probability of all possible events that result in GN after N nodes are infected
starting with v as the source under the SI model. For example, in Fig. 6.1, suppose
node 1 was the source, i.e., we need to calculate P (G4 |1). Then there are two
disjoint events or node orders in which the rumor spreads that will lead to G4 with 1
as the source: {1, 2, 3, 4} and {1, 2, 4, 3}. in general to evaluate P (GN |v), we need
to find all such permitted permutations and their corresponding probabilities. The
permitted permutations are defined as follows.
Definition 6.1 (Permitted Permutation) Given a connected tree G(V , E) and a
source node v ∈ V , consider any permutation σ : V → {1, 2, · · · , |V |} of its
nodes where σ (u) denotes the position of node u in the permutation σ . σ is called a
permitted permutation for tree G(V , E) with source node v if
1. σ (v) = 1.
2. For any (u, u ) ∈ E, if d(v, u) < d(v, u ), then σ (u) < σ (u ). Here, d(v, u)
denotes the shortest path distance from v to u.
Let (v, GN ) be the set of all permitted permutations starting with node v and
resulting in rumor graph GN . The next step is to determine the probability P (σ |v)
for each σ ∈ (v, GN ). Let σ = {v1 = v, v2 , · · · , vN }, and define Gk (σ ) as the
subgraph of GN containing nodes {v1 = v, v2 , · · · , vk } for 1 ≤ k ≤ N . Then,

P (σ |v) = N th
k=2 P (k infected node = vk |Gk−1 (σ ), v). (6.2)

Each term in the product on the right-hand side in Eq. (6.2), can be evaluated as
follows. Given Gk−1 (σ ) and source v, the next infected node could be any of the
neighbors of the nodes in Gk−1 (σ ) which are not yet infected. If Gk−1 (σ ) has
nk−1 (σ ) uninfected neighboring nodes, then each one of them is equally likely to be
the next infected node with probability 1/nk−1 (σ ). Therefore, Eq. (6.2) reduces to

1
P (σ |v) = N
k=2 . (6.3)
nk−1 (σ )

Fig. 6.1 Example network


where the rumor graph has
four nodes
72 6 Source Identification Under Complete Observations: A Maximum Likelihood. . .

Given Eq. (6.3), now the problem of computing P (σ |v) becomes evaluating the size
of the rumor boundary 1/nk−1 (σ ) for 2 ≤ k ≤ N . Suppose the kth added node
to Gk−1 (σ ) is vk (σ ) with degree dk (σ ). Then it contributes dk (σ ) − 2 new edges
(and hence nodes in the tree) to the rumor boundary. This is because, dk (σ ) new
edges are added but the edge along which the recent infection happened has to be
removed. That is, nk (σ ) = nk−1 (σ )dk (σ ) − 2. Subsequently


k
nk (σ ) = d1 (σ ) + (di (σ ) − 2). (6.4)
i=2

Therefore,

1
P (σ |v) = N
k=2 k−1 . (6.5)
d1 (σ ) + i=2 (di (σ ) − 2)

For a d regular tree, since all nodes have the same degree d, it follows from
Eq. (6.5) that every permitted permutation σ has the same probability, independent
of the source. Specifically, for any source v and permitted permutation σ

−1 1
P (σ |v) = N ≡ p(d, N). (6.6)
k=1
dk − 2(k − 1)

From above, it follows immediately that for a d regular tree, for any GN and
candidate source v, P (GN |v) is proportional to | (v, GN )|. Formally, they use
R(v, GN ) to denote the number of distinct permitted permutations | (v, GN )|.
Definition 6.2 Given a graph G(V , E) and vertex v of G, R(v, GN ) is defined as
the total number of distinct permitted permutations of nodes of G that begin with
node v ∈ G and respect the graph structure of G.
In summary, the ML estimator for a regular tree becomes

v̂ ∈ arg max P (GN |v) = arg max R(v, GN ). (6.7)


v∈GN v∈GN

6.5 Rumor Source Estimator: ML for General Trees

As Eq. (6.7) suggests, the ML estimator for a regular tree can be obtained by simply
evaluating R(v, GN ) for all v. However, as indicated by Eq. (6.5), such is not the
case for a general tree with heterogeneous degree. To form an ML estimator for
a general tree, it is required to keep track of the probability of every permitted
permutation. This is computationally expensive due to the exponential number of
terms involved.
6.6 Rumor Source Estimator: ML for General Graphs 73

Fig. 6.2 Example network


where rumor centrality with
the BFS heuristic equals the
likelihood P (GN |v). The
rumor infected nodes are in
gray and labeled with
numbers

Note that, the likelihood of a node is a sum of the probability of every permitted
permutation for which it is the source. In general, these will have different values,
but it may be that a majority of them have a common value. To obtain this common
value, Shah and Zaman [160] assume the nodes receive the rumor in a breadth-first
search (BFS) way. For example, consider the network in Fig. 6.2. If node 2 is the
source, then a BFS sequence of nodes would be {2, 1, 3, 4, 5} and the probability of
this permitted permutation is given by Eq. (6.5).
Suppose σv∗ is the BFS permitted permutation with node v as the source, then the
rumor source estimator becomes

v̂ ∈ arg max P (σv∗ |v)R(v, GN ). (6.8)


v∈GN

6.6 Rumor Source Estimator: ML for General Graphs

The ML estimator for a general graph, in principle can be computed by following a


similar approach as that for general trees. Specifically, it corresponds to computing
the summation of the likelihoods of all possible permitted permutations given the
network structure.
Note that in a general graph the rumor spreads along a spanning tree of the
observed graph corresponding to the first time each node receives the rumor.
Therefore, an approximation for computing the likelihood P (GN |v) is as follows.
First, suppose the spanning tree involved in the rumor spreading in known. Then, the
previously developed tree estimator in Eq. (6.8) can be applied on the spanning tree.
However, the spanning tree involved in the rumor spreading is generally unknown.
Shah and Zaman [160] circumvented the issue of not knowing the underlying
spanning tree as follows. They assume that if node v ∈ GN was the source, then
the rumor spreads along a breadth first search (BFS) tree rooted at v, Tbfs (v). The
intuition is that if v was the source, then the BFS tree would correspond to the
fastest spread of the rumor. Therefore, the rumor source estimator for general graph
GN becomes:
74 6 Source Identification Under Complete Observations: A Maximum Likelihood. . .

v̂ ∈ arg max P (σv∗ |v)R(v, Tbfs (v)). (6.9)


v∈GN

Consider a simple example as shown in Fig. 6.3. The BFS trees for each node are
shown. Using the expression for from Sect. 6.12, the general graph estimator values
for the nodes are

1 5!
P (σ1∗ |1)R(1, Tbfs (1)) = , P (σ2∗ |2)R(2, Tbfs (2))
4 ∗ 6 ∗ 8 ∗ 10 20
1 5!
=
4 ∗ 6 ∗ 8 ∗ 10 30
1 5!
P (σ3∗ |3)R(3, Tbfs (3)) = , P (σ4∗ |4)R(4, Tbfs (4))
4 ∗ 6 ∗ 8 ∗ 10 20
1 5!
=
4 ∗ 6 ∗ 8 ∗ 10 10
1 5!
P (σ5∗ |5)R(5, Tbfs (5)) =
4 ∗ 6 ∗ 8 ∗ 10 40
Node 4 maximizes this value and would be the estimate of the rumor source.

6.7 Rumor Centrality

Note from Eqs. (6.7), (6.8), and (6.9), R(v, GN ) plays an important role in each of
the rumor source estimators. Recall that R(v, GN ) counts the number of distinct
ways a rumor can spread in the network GN starting from source v. Shah and
Zaman [160] called this number, R(v, GN ), as the rumor centrality of the node
v with respect to GN . The node with maximum rumor centrality is called the rumor
center of the network. This section introduces the approach proposed by Shah and

Fig. 6.3 Example network


with a BFS tree for each node
shown. The gray nodes are
infected
6.7 Rumor Centrality 75

Zaman [160] to calculate R(v, GN ), and also presents an important property of


rumor centrality.

6.7.1 Rumor Centrality: Succinct Representation

Let GN be a tree graph. Define Tuv as the number of nodes in the subtree rooted at
node u, with node v as the source. Figure 6.4 illustrates this notation in a simple
example. Here, T21 = 3 because there are 3 nodes in the subtree with node 2 as the
root and node 1 as the source. Similarly, T71 = 1 because there is only 1 node in the
subtree with node 7 as the root and node 1 as the source.
To calculate R(v, GN ) is to count the number of permitted permutations of N
nodes of GN . There are N slots in a given permitted permutation, and the first of
which must be the source node v. Note that, a node u must come before all the
nodes in its subtree Tuv . Given a slot assignment for all nodes in Tuv subject to this
constraint, there are R(u, Tuv ) different ways in which these nodes can be ordered.
This suggests a natural recursive relation between the rumor centrality R(u, GN )
and the rumor centrality of its immediate childrens subtrees R(u, Tuv ) with u ∈
child(v). Here child(v) represents the set of all children of v in tree GN assuming v
as its root. Specifically, there is no constraint between the orderings of the nodes of
different subtrees Tuv with u ∈ child(v). This leads to

R(u, Tuv )
R(u, GN ) = (N − 1)!u∈child(v) (6.10)
Tuv !

If we expand this recursion Eq. (6.10) to the next level of depth in GN we obtain

R(u, Tuv )
R(u, GN ) = (N − 1)!u∈child(v)
Tuv !
(Tuv − 1)! R(w, Twv )
= (N − 1)!u∈child(v)  w∈child(u) (6.11)
Tuv ! Twv !
1 R(w, Twv )
= (N − 1)!u∈child(v) v w∈child(u) .
Tu Twv !

Fig. 6.4 Illustration of


subtree variable Tuv
76 6 Source Identification Under Complete Observations: A Maximum Likelihood. . .

Fig. 6.5 Example network


for calculating rumor
centrality

A leaf node l will have 1 node and 1 permitted permutation, so R(l, Tlv ) = 1. If we
continue this recursion until we reach the leaves of the tree, then we find that the
number of permitted permutations for a given tree GN rooted at v is

1 1
R(u, GN ) = (N − 1)!u∈GN \v v
= N !u∈GN v . (6.12)
Tu Tu

Thus, Eq. (6.12) gives the simple expression for rumor centrality in a tree graph. As
an example of the use of rumor centrality, consider the network in Fig. 6.5. Using the
rumor centrality formula in Eq. (6.12), we find that the rumor centrality of node 1 is

5!
R(1, G) = = 8. (6.13)
5∗3
Indeed, there are 8 permitted permutations of this network with node 1 as the source:

{1, 3, 2, 4, 5}, {1, 2, 3, 4, 5}, {1, 2, 4, 3, 5}, {1, 2, 4, 5, 3},

{1, 3, 2, 5, 4}, {1, 2, 3, 5, 4}, {1, 2, 5, 3, 4}, {1, 2, 5, 4, 3}.

6.7.2 Rumor Centrality Versus Distance Centrality

Shah and Zaman [160] further compared rumor centrality with distance centrality.
Distance centrality has become popular in the literature as a graph based score
function for various other applications. For a graph G, the distance centrality of
node v ∈ G, D(v, G), is defined as

D(v, G) = d(v, j ), (6.14)
j ∈G

where d(v, j ) is the shortest path distance from node v to node j . The distance
center of a graph is the node with the smallest distance centrality. Intuitively, it is
6.7 Rumor Centrality 77

the node closest to all other nodes. Shah and Zaman [160] proved that the distance
center is equivalent to the rumor center on a tree graph in the following theorem.
Theorem 6.1 On an N node tree, if vD is the distance center, then, for all v = vD

TvvD ≤ N/2. (6.15)

Furthermore, if there is a unique rumor center on the tree, then it is equivalent to


the distance center.
Readers could refer to [160] for the details of the proof.
For a general network, the rumor centrality is defined as the node with maximal
value of rumor centrality on its BFS tree. More precisely, the rumor center of a
general graph is the node v̂ with the following property:

v̂ ∈ arg max R(v, Tbfs (v)). (6.16)


v∈GN

Shah and Zaman [160] show that, the rumor center is not always equivalent to the
distance center in a general graph.
Extensive simulations have been performed on both synthetic networks (a small-
world network and a scale-free network) and real-world networks (the Internet
autonomous system (AS) network [1] and the U.S electric power grid network [16]).
The results show that the rumor center estimator either finds the source exactly or
within a few hops of the true source across different network topologies.
Chapter 7
Source Identification Under Snapshots:
A Sample Path Based Source Estimator

In this chapter, we introduce a propagation source estimator under snapshot


observations: a sample path based source estimator (Jordan Center). According
to Chap. 5, a snapshot provides partial knowledge of network status at a given
time t. Many approaches have been proposed to identify propagation sources
under snapshot observations, including Jordan Center, Dynamic Message Passing,
effective distance based method, etc. Within these methods, Jordan center is a
representative one and many variations and improvements have been made based
on this method. Here, we present the details of the Jordan Center estimator. For the
techniques involved in other methods under snapshot observations, readers could
refer to Chap. 9 for details.

7.1 Introduction

Malicious attack propagation in networks refer to the spread process of attack


throughout the networks, and have been widely used to model many real-world
phenomena such as the spreading of rumors over online social networks, and the
spreading of computer virus over the Internet. The problem of identifying propaga-
tion source has attracted many attentions. For example, Shah and Zaman proposed
Rumor Center to identify rumor propagation sources [160–162] was analyzed in
[160, 161]. They formulated the problem as a maximum likelihood estimation
(MLE) problem, and developed novel algorithms to detect the source. However,
these methods are considered under the complete observation of networks, which
is not applicable for many real-world scenarios. Zhu and Ying [200] considered the
following problem: given a snapshot of the diffusion process at time t, can we tell
which node is the source of the propagation?
Zhu and Ying [200] adopted the Susceptible-Infected-Recovered (SIR) model, a
standard model of epidemics [12, 47]. The network is assumed to be an undirected

© Springer Nature Switzerland AG 2019 79


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_7
80 7 Source Identification Under Snapshots: A Sample Path Based Source Estimator

graph and each node in the network has three possible states: susceptible (S),
infected (I ), and recovered (R). Nodes in state S can be infected and change to
state I , and nodes in state I can recover and change to state R. Recovered nodes
cannot be infected again. Initially, all nodes are assumed to be in the susceptible
state except one infected node. The infected node is the propagation source of the
malicious attack. The source then infects its neighbors, and the attack starts to spread
in the network. Now given a snapshot of the network, in which some nodes are
infected nodes, some are healthy (susceptible and recovered) nodes. The susceptible
nodes and recovered nodes assumed to be indistinguishable. Zhu and Ying [200]
proposed a low complexity algorithm, called reverse infection algorithm, through
finding the sample path based estimator in the underlying network. In the algorithm,
each infected node broadcasts its identity in the network, the node who first collect
all identities of infected nodes declares itself as the information source. They proved
that the estimated source node is the node with the minimum infection eccentricity.
Since a node with the minimum eccentricity in a graph is called the Jordan center,
they call the nodes with the minimum infection eccentricity the Jordan infection
centers.

7.2 The SIR Model for Information Propagation

Consider an undirected graph G(V , E), where V is the set of nodes and E is the
set of undirected edges. Each node v ∈ V has three possible states: susceptible (S),
infected (I ), and recovered (R). Zhu and Ying [200] assumed a time slotted system.
Nodes change their states at the beginning of each time slot, and the state of node
v in time t is denoted by Xv (t). Initially, all nodes are in state S except source
node v ∗ which is in state I . At the beginning of each time slot, each infected node
infects its susceptible neighbors with probability q. Each infected node recovers
with probability p. Once a node get recovered, it cannot be infected again. Then,
the infection process can be modeled as a discrete time Markov chain X(t), where
X(t) = {Xv (t), v ∈ V } is the states of all the nodes at time t. The initial state of this
Markov chain is Xv (0) = S for v = v ∗ and Xv ∗ (0) = I .
Suppose at time t, we observe Y = Yv , v ∈ V ⊆ V such that

1, if v is in state I ;
Yv = (7.1)
0, if v is in state S or R.

The propagation source identification problem is to identify v ∗ given the graph


G and the observation Y, where t is an unknown parameter. Figure 7.1 gives an
example of the attack propagation process. The left figure shows the information
propagation over time. The nodes on each dotted line are the nodes which are
infected at that time slot, and the arrows indicate where the infection comes from.
The right figure is the network we observe, where the red nodes are infected ones
7.3 Maximum Likelihood Detection 81

Fig. 7.1 An example of information propagation

and others are susceptible or recovered nodes. The pair of numbers next to each
node are the corresponding infection time and recovery time. For example, node 3
was infected at time slot 2 and recovered at time slot 3. -1 indicates that the infection
or recovery has yet occurred. Note that these two pieces of information are generally
not available.

7.3 Maximum Likelihood Detection

Suppose X[0, t] = X(τ ) : 0 < τ ≤ t is a sample path of the infection process from
0 to t, and define the function F (·) such that

1, if Xv (t) = I ;
F (Xv (t)) = (7.2)
0, otherwise.

Then, F(X[t]) = Y if F (Xv (t)) = Yv for all v. Identifying the propagation source
can be formulated as a maximum likelihood detection problem as follows:

v̂ ∈ arg max Pr(X[0, t]|v ∗ = v), (7.3)
v∈V
X[0,t]:F(X(t))=Y

where Pr(X[0, t]|v ∗ = v) is the probability to obtain sample path X[0, t] given the
source node v.
Note that the difficulty of solving the problem in (7.3) is the curse of dimen-
sionality. For each v such that Yv = 0, its infection time and recovered time are
required, i.e., O(t 2 ) possible choices; for each v such that Yv = 1, the infection
time needs to be considered, i.e., O(t) possible choices. Therefore, even for a
fixed t, the number of possible sample paths is at least at the order of t N , where
N is the number of nodes in the network. This curse of dimensionality makes
it computational expensive. Zhu and Ying [200] introduced a sample path based
approach to overcome the difficulty.
82 7 Source Identification Under Snapshots: A Sample Path Based Source Estimator

7.4 Sample Path Based Detection

Instead of computing the marginal probability, Zhu and Ying [200] proposed to
identify the sample path X∗ [0, t ∗ ] that most likely leads to Y, i.e.,

X∗ [0, t ∗ ] = arg max P r(X[0, t]), (7.4)


t,X[0,t]∈X (t)

where X = {X[0, t]|F(X(t)) = Y}. The source node associated with X∗ [0, t ∗ ] is
viewed as the information source.
Based on the in graph theory in [77], the eccentricity e(v) of a vertex v is
the maximum distance between v and any other vertex in the graph. The Jordan
centers of a graph are the nodes which have the minimum eccentricity. For example,
in Fig. 7.2, the eccentricity of node v1 is 4 and the Jordan center is v2 , whose
eccentricity is 3. Following a similar terminology, Zhu and Ying [200] define the
infection eccentricity ẽ(v) given Y as the maximum distance between v and any
infected nodes in the graph. Then, the Jordan infection centers of a graph are the
nodes with the minimum infection eccentricity given Y. In Fig. 7.2, nodes v3 , v10 ,
v13 and v14 are observed to be infected. The infection eccentricities of v1 , v2 , v3 , v4
are 2, 3, 4, 5, respectively, and the Jordan infection center is v1 .
Zhu and Ying [200] proved that the propagation source associated with the
optimal sample path is a node with the minimum infection eccentricity. The proof
of this result is consist of three steps: first, assuming the information source is vr ,
they analyze tv∗r such that

tv∗r = arg max P r(X[0, t]|v ∗ = vr ), (7.5)


t,X[0,t]

i.e., tv∗r is the time duration of the optimal sample path in which vr is the information
source. They proved that tv∗r equals to the infection eccentricity of node vr . In the
second step, they consider two neighboring nodes, say nodes v1 and v2 . They proved
that if ẽ(v1 ) < ẽ(v2 ), then the optimal sample path rooted at v1 occurs with a higher
probability than the optimal sample path rooted at v2 . At the third step, they proved
that given any two nodes u and v, if v has the minimum infection eccentricity and u

Fig. 7.2 An example


illustrating the infection
eccentricity
7.5 The Sample Path Based Estimator 83

has a larger infection eccentricity, then there exists a path from u to v along which
the infection eccentricity monotonically decreases, which implies that the source of
the optimal sample path must be a Jordan infection center. For example, in Fig. 7.2,
node v4 has a larger infection eccentricity than v1 and v4 → v3 → v2 → v1 is the
path along which the infection eccentricity monotonically decreases from 5 to 2. In
the next subsection, we briefly explain the techniques involved in these three steps.

7.5 The Sample Path Based Estimator

Lemma 7.1 Consider a tree network rooted at vr and with infinitely many levels.
Assume the information source is the root, and the observed infection topology is
Y which contains at least one infected node. If ẽ(vr ) ≤ t1 < t2 , then the following
inequality holds

max P r(X[0, t1 ]) > max P r(X[0, t2 ]), (7.6)


X[0,t1 ]∈X (t1 ) X[0,t2 ]∈X (t2 )

where X (t) = {X[0, t]|F(X(t)) = Y}. In addition,

tv∗r = ẽ(vr ) = max d(vr , u), (7.7)


u∈I

where d(vr , u) is the length of the shortest path between and u and also called the
distance between vr and u, and I is the set of infected nodes.
This lemma states that the optimal time is equal to the infection eccentricity. The
next lemma states that the optimal sample path rooted a node with a smaller
infection eccentricity is more likely to occur.
Lemma 7.2 Consider a tree network with infinitely many levels. Assume the
information source is the root, and the observed infection topology is Y which
contains at least one infected node. For u, v ∈ V such that (u, v) ∈ E, if tu∗ > tv∗ ,
then

P r(X∗u ([0, tu∗ ])) < P r(X∗v ([0, tv∗ ])), (7.8)

where X∗u ([0, tu∗ ]) is the optimal sample path starting from node u.
Proof Denote by Tv the tree rooted in v and Tu−v the tree rooted at u but without
the branch from v. See Tv−v 9
1
and Tv−v
7
2
in Fig. 7.2. Furthermore, denote by C (v)
the set of children of v. The sample path X[0, t] restricted to Tu−v is defined to be
X([0, t], Tu−v ).
The first step is to show tu∗ = tv∗ + 1. Note that Tv−u ∩ I = ∅, otherwise, all
infected node are on Tu−v . As T is a tree, v can only reach nodes in Tu−v through
edge (u, v), tv∗ = tu∗ + 1, which contradicts tu∗ > tv∗ .
84 7 Source Identification Under Snapshots: A Sample Path Based Source Estimator

If Tu−v ∩ I = ∅, ∀a ∈ Tu−v ∩ I , then

d(u, a) = d(v, a) − 1 ≤ tv∗ − 1, (7.9)

and ∀b ∈ Tv−u ∩ I ,

d(u, b) = d(v, b) + 1 ≤ tv∗ + 1. (7.10)

Hence,

tu∗ ≤ tv∗ + 1, (7.11)

which implies that

tv∗ < tu∗ ≤ tv∗ + 1, (7.12)

Hence, we obtain tu∗ = tv∗ + 1.


The second step is to prove that tvI = 1 on the sample path X∗u ([0, tu∗ ]). If tvI > 1
on X∗u ([0, tu∗ ]), then

tu∗ − tvI = tv∗ + 1 − tvI < tv∗ , (7.13)

According to the definition of tu∗ and tvI , within tu∗ − tvI time slots, node v can infect
all infected nodes on Tv−u . Since tu∗ = tvI + 1, the infected node farthest from node u
must be on Tv−u , which implies that there exists a node a ∈ Tv−u such that d(u, a) =
tu∗ = tv∗ + 1 and d(v, a) = tv∗ . So node v cannot reach a within tu∗ − tvI time slots,
which contradicts the fact that the infection can spread from node v to a within
tu∗ − tvI time slots along the sample path X∗u [0, tu∗ ]. Therefore, tvI = 1.
Now given sample path X∗u ([0, tu∗ ]), the third step is to construct X∗v ([0, tv∗ ])
which occurs with a higher probability. The sample path X∗u ([0, tu∗ ]) can be divided
into two parts along subtrees Tu−v and Tv−u . Since tvI = 1, then

P r(X∗u ([0, tu∗ ])) = q · P r(X∗u ([0, tu∗ ], Tv−u )|tvI = 1) · P r(X∗u ([0, tu∗ ], Tu−v )).
(7.14)
Suppose in X∗v ([0, tv∗ ]), node u was infected at the first time slot, then

P r(X∗v ([0, tv∗ ])) = q · P r(X∗v ([0, tv∗ ], Tv−u )|tvI = 1) · P r(X∗v ([0, tv∗ ], Tu−v )|tuI = 1).
(7.15)
For the subtree Tv−u , given X∗u ([0, tu∗ ], Tv−u ), in which tvI = 1, the partial sample
path X∗v ([0, tv∗ ], Tv−u ) can be constructed to be identical to X∗u ([0, tu∗ ], Tv−u ) except
that all events occur one time slot earlier, i.e.,

X∗v ([0, tv∗ ], Tv−u ) = X∗u ([0, tu∗ ], Tv−u ). (7.16)


7.5 The Sample Path Based Estimator 85

Then,

P r(X∗u ([0, tu∗ ], Tv−u )|t I v = 1) = P r(X∗v ([0, tv∗ ], Tv−u )).. (7.17)

For the subtree Tu−v , X∗v ([0, tv∗ ], Tu−v ) can be constructed such that

X∗v ([0, tv∗ ], Tu−v ) ∈ arg max P r(X̃([0, tv∗ ], tu−v )|tuI = 1).
X̃([0,tv∗ ],tu−v )∈X (tv∗ ,Tu−v )
(7.18)
According to Lemma 7.1, the following inequation is satisfied:

max P r(X̃([0, tv∗ ], tu−v )|tuI = 1)


X̃([0,tv∗ ],tu−v )∈X (tv∗ ,Tu−v )
= max P r(X̃([0, tu∗ − 1], tu−v )|tuI = 1) (7.19)
X̃([0,tu∗ −1],tu−v )∈X (tu∗ −1,Tu−v )
> max P r(X̃([0, tu∗ ], tu−v ))
X̃([0,tu∗ ],tu−v )∈X (tu∗ ,Tu−v )

Therefore, given the optimal sample path rooted at u, a sample path rooted at v can
be constructed, which occurs with a higher probability. The lemma holds.
The following lemma gives a useful property of the Jordan infection centers.
Lemma 7.3 On a tree network with at least one infected node, there exist at most
two Jordan infection centers. When the network has two Jordan infection centers,
the two must be neighbors.
The following theorem states that the sample path based estimator is one of the
Jordan infection centers.
Theorem 7.1 Consider a tree network with infinitely many levels. Assume that the
observed infection topology Y contains at least one infected node. Then the source
node associated with X∗ [0, t ∗ ] (the solution to the optimization problem (7.4)) is a
Jordan infection center, i.e.,

v̂ = arg min ẽ(v). (7.20)


v∈V

Proof Assume the network has two Jordan infection centers: w and u, and assume
ẽ(w) = ẽ(u) = λ. Based on Lemma 7.3, w and u must be adjacent. The following
steps show that, for any a ∈ V \{w, u}, there exists a path from a to u (or w) along
which the infection eccentricity strictly decreases.
First, it is easy to see from Fig. 7.3 that d(γ , w) ≤ λ − 1 ∀γ ∈ Tw−u ∩ I . Then,
there exists a node ξ such that the equality holds. Suppose that d(γ , w) ≤ λ − 2 for
any γ ∈ Tw−u ∩ I , which implies

d(γ , u) ≤ λ − 1 ∀ γ ∈ Tw−u ∩ I. (7.21)


86 7 Source Identification Under Snapshots: A Sample Path Based Source Estimator

Fig. 7.3 A pictorial


description of the positions of
nodes a, u, w and ξ

Since w and u are both Jordan infection centers, we have ∀ γ ∈ Tu−w ∩ I ,

d(γ , w) ≤ λ
(7.22)
d(γ , u) ≤ λ − 1.

In a summary, ∀ γ ∈ I ,

d(γ , u) ≤ λ − 1. (7.23)

This contradicts the fact that ẽ(w) = ẽ(u) = λ. Therefore, there exists ξ ∈ Tw−u ∩ I
such that

d(ξ, w) = λ − 1. (7.24)

Similarly, ∀ γ ∈ Tu−w ∩ I

d(γ , u) ≤ λ − 1. (7.25)

and there exists a node such that the equality holds.


If a ∈ V \{w, u}, a ∈ Tu−w and d(a, u) = β, then for any γ ∈ Tw−u ∩ I , we have

d(a, γ ) = d(a, u) + d(u, w) + d(w, γ )


≤β +1+λ−1 (7.26)
= λ + β,

and there exists ξ ∈ Tw−u ∩ I such that the equality holds. On the other hand,
∀ γ ∈ Tu−w ∩ I

d(a, γ ) ≤ d(a, u) + d(u, γ )


(7.27)
≤β +λ−1

Therefore,

ẽ(a) = λ + β, (7.28)

so the infection eccentricity decreases along the path from a to u.


7.6 Reverse Infection Algorithm 87

Repeatedly applying Lemma 7.2 along the path from node a to u, then the
optimal sample path rooted at node u is more likely to occur than the optimal sample
path rooted at node a. Therefore, the root node associated with the optimal sample
path X∗ [0, t ∗ ] must be a Jordan infection center. The theorem holds.

7.6 Reverse Infection Algorithm

Zhu and Ying [200] further proposed a reverse infection algorithm to identify the
Jordan infection centers. The key idea of the algorithm is to let every infected node
broadcast a message containing its identity (ID) to its neighbors. Each node, after
receiving messages from its neighbors, checks whether the ID in the message has
been received. If not, the node records the ID (say v), the time at which the message
is received (say tv ), and then broadcasts the ID to its neighbors. When a node
receives the IDs of all infected nodes, it claims itself as the propagation source and
the algorithm terminates. If there are multiple nodes receiving all IDs at the same
time, the tie is broken by selecting the node with the smallest tv . The details of
the algorithm is presented in Algorithm 7.1.
Simulations on g-regular trees were conducted. The infection probability q was
chosen uniformly from (0, 1) and the recovery probability p was chosen uniformly
from (0, q). The infection process propagates t was uniformly chosen from [3, 20].
The experimental results show that the detection rates of both the reverse infection
and closeness centrality algorithms increase as the degree increases and is higher
than 60% when g > 6.

Algorithm 7.1: Reverse infection algorithm


for (i ∈ I ) do
i sends its ID wi to its neighbors.
while (t ≥ 1 and STOP==0) do
for (u ∈ V ) do
if (u receives wi for the first time) then
Set tui = t and then broadcast the message wi to its neighbors.
If there exists a node who received |I | distinct messages, then set STOP==1.

Return: v̂ ∈ arg min tui , where S is the set of nodes who receive |I | distinct messages
u∈S
i∈I
when the algorithm terminates.
Chapter 8
Source Identification Under Sensor
Observations: A Gaussian Source
Estimator

In this chapter, we introduce a propagation source estimator under sensor observa-


tions: Gaussian source estimator. According to Chap. 5, sensors are firstly injected
into networks, and then the propagation dynamics over these sensor nodes are
collected, including their states, state transition time and infection directions.
There have been many approaches proposed under sensor observations, including
Bayesian based method, Gaussian based method, Moon-Walk based method, etc.
Here, we particular present the details of the Bayesian based method. For the
techniques involved in other methods, readers could refer to Chap. 9 for details.

8.1 Introduction

Localizing the source of a malicious attack, such as rumors or computer viruses,


is an extremely desirable but challenging task. The ability to estimate the source is
invaluable in helping authorities contain the malicious attacks or infection. In this
context, the inference of the unknown source was analyzed in [160, 161]. These
methods assume that we know the state of all nodes in the network. However,
having the observation on an entire network is hardly possible. Pinto et al. [146]
proposed to locate the source of propagation under the practical constraint that only
a small fraction of nodes can be observed. This is the case, for example, when
locating a spammer who is sending undesired emails over the internet, where it
is clearly impossible to monitor all the nodes. Thus, the main difficulty is to develop
tractable estimators that can be efficiently implemented (i.e., with subexponential
complexity), and that perform well on multiple topologies.

© Springer Nature Switzerland AG 2019 89


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_8
90 8 Source Identification Under Sensor Observations: A Gaussian Source Estimator

8.2 Network Model

We first introduce the network model used in [146]. The underlying network where
a malicious attack takes place is modeled by a finite, undirected graph G(V , E),
where the vertex/node set V has N nodes, and the edge set E has L edges (see
Fig. 8.1 for illustration). Assume that the graph G is a known priori. The propagation
source, s ∗ ∈ G, is the vertex that initiates the propagation. Suppose s ∗ as a random
variable with uniform distribution over the set V , i.e., so that any node in the network
is equally likely to be the source.
The propagation process is modeled as follows. At time t, each vertex u ∈ G
presents one of two possible statuses: (a) infected, if it has already been attacked
from any of its neighbors; or (b) susceptible, if it has not been infected/attacked so
far. Let V (u) denote the set of vertices directly connected to u, i.e., the neighborhood
or vicinity of u. Suppose u is in the susceptible state and, at time tu , receives the
attack for the first time from one neighbor, say s, thus becoming infected. Then, u
will re-transmit the malicious attack to all its other neighbors, so that each neighbor
v ∈ V (u)\s receives the attack at time tu + θuv , where θuv denotes the random
propagation delay associated with edge (u, v). The random variables {θuv } for
different edges (u, v) have a known, arbitrary joint distribution. The propagation
process is initiated by the source x ∗ at an unknown time t = t ∗ .
Let O := {ok }K k=1 ∈ G denote the set of K observers/sensors, whose location on
the network G is already known. Each observer measures from which neighbor and
at what time it received the attack. Specifically, if tv,o denotes the absolute time at
which observer o receives the attack from its neighbor v, then the observation set is
composed of tuples of direction and time measurements, i.e., O := {(o, v, tv,o )}, for
all o ∈ O and v ∈ V (o).

Fig. 8.1 Source estimation


on graph G. At time t = t ∗ ,
the propagation source s ∗
initiates the propagation of a
malicious attack. In this
example, there are three
observers/sensors, which
measure from which
neighbors and at what time
they were attacked. The goal
is to estimate, from these
observations, which node in
G is the propagation source
8.4 Source Estimator on a Tree Graph 91

8.3 Source Estimator

Pinto et al. [146] recovered the source location from measurements taken at the
observers by adopting a maximum probability of localization criterion, which
corresponds to designing an estimator ŝ(·) such that the localization probability
Ploc := P(ŝ(O = s ∗ )) is maximized. Since s ∗ is assumed to be uniformly
random over the network G, the optimal estimator is the maximum likelihood (ML)
estimator,

ŝ(O) = arg max P(O|s ∗ = s)


s∈G
  

= arg max P(s |s = s) × · · · g(θ1 , · · · , θL , O, s , s)dθ1 · · · θL .
s∈G
s
(8.1)
Wherein, s is the set of all possible paths {Ps,ok }Kk=1 between the source s and the
observers in the graph G. The set {θl }L l=1 represents the propagation delays for all
edges of the graph G. The deterministic function g depends on the joint distribution
of the propagation delays. Note that, the estimator in Eq. (8.1) is calculated through
averaging over two different sources of randomness: (a) the uncertainty in the paths
that the attack takes to reach the observers, and (b) the uncertainty in the time that
the attack takes to cross the edges of the graph G. Due the combinatorial nature of
Eq. (8.1), its complexity increases exponentially with the number of nodes in G, and
is therefore intractable. In order to address this problem, Pinto et al. [146] proposed
a method of complexity O(N ) that is optimal for general trees, and a strategy of
complexity O(N 3 ) that is suboptimal for general graphs.

8.4 Source Estimator on a Tree Graph

Consider the case of an underlying tree T . Because a tree does not contain cycles,
only a subset Oa ∈ O of the observers will receive the attack emitted by the
unknown source. Then, Oa = {ok }K a
k=1 is called the set of Ka active observers.
The observations made by the nodes in Oa provide two types of information. (a)
The first information is the direction in which attack arrives to the active observers,
which uniquely determines a subset Ta ∈ T of regular nodes. Hence, Ta is called
active subtree, (see the left figure in Fig. 8.2). (b) The second information is the
timing at which the attack arrives to the active observers, denoted by {tk }K a
k=1 , which
is used to localize the source within the set Ta . It is also convenient to label the edges
of Ta as E(Ta ) = {1, 2, · · · , Ea }, so that the propagation delay associated with edge
i ∈ E is denoted by θi (see the left figure in Fig. 8.2). Assume that the propagation
delays associated with the edges of T are independent identically distributed random
variables with Gaussian distribution N(μ, σ 2 ), where the mean μ and variance σ 2
are known. Based on these definitions, the following result can be concluded.
92 8 Source Identification Under Sensor Observations: A Gaussian Source Estimator

a b

Fig. 8.2 (a) Active tree Ta . The vector next to each candidate source s is the normalized
deterministic delay µ̃s := µs /μ. The normalized delay covariance for this tree is à := A/σ 2 =
[5, 2; 2, 4]. (b) Equiprobability contours of the probability density function P(d|s ∗ = s) for all
s ∈ Ta , and the corresponding decision regions. For a given observation d, the optimal estimator
chooses the source s that maximizes P(d|s ∗ = s)

Proposition 8.1 (Optimal Estimation in General Trees) For a general propaga-


tion tree T , the optimal estimator is given by

1
ŝ = arg max µTs A−1 (d − µs ), (8.2)
s∈Ta 2

where d is the observed delay, µs is the deterministic delay, and A is the delay
covariance, given by

[d]k = tk+1 − tk , (8.3)

[µs ]k = μ(|P (s, ok+1 )| − |P (s, o1 )|), (8.4)



|P (o1 , ok+1 )|, k = i,
[A]k,i = σ × 2
(8.5)
|P (o1 , ok+1 ) ∩ P (1, oi+1 )|, k = i,

for k, i = 1, · · · , Ka − 1, with |P (u, v)| denoting the number of edges (i.e., the
length) of the path connecting vertices u and v. Readers could refer to [146] for the
detailed proof of this proposition.
Note that, when node s is chosen as the source, µs and A represent, respectively,
the mean and covariance of the observed delay d (see Fig. 8.1 for illustration).
Proposition 8.1 reduces the estimator in (8.1) to a tractable expression, where the
parameters can be simply obtained from path lengths in the tree T . Furthermore,
8.5 Source Estimator on a General Graph 93

the complexity of Eqs. (8.2)–(8.5) scales as O(N ) with the number of nodes N in
the tree. The full proof of the complexity is given in the Supplemental Material in
[146]. The sparsity implies that the distance between observers is large, and so is
the number of random variables of the sum
 
dk = tk+1 − t1 = θi − θi . (8.6)
i∈P (s ∗ ,ok+1 ) i∈P (s ∗ ,o1 )

Based on the central limit theorem, the observer delay vector d can be closely
approximated by a Gaussian random vector.

8.5 Source Estimator on a General Graph

Now consider the most general case of source estimation on a general graph
G. When the malicious attack is propagated on the network, there is a tree
corresponding to the first time each node gets informed, which spans all nodes
in G. Note that, the number of spanning trees can be exponentially large. Pinto
et al. [146] assume that the actual propagation tree is a breadth-first search (BFS)
tree. This corresponds to assuming that the attack travels from the source to each
observer along a minimum-length path, which is intuitively satisfying. Then, the
resulting estimator can be written as

ŝ = arg max S(s, d, Tbf s,s ), (8.7)


s∈G

where S = µTs A−1s (d − 2 µs ), with parameters µs and As computed with respect to


1

the BFS tree Tbf s,s rooted at s. It can easily shown that the complexity of Eq. (8.7)
scales subexponentially with N , as O(N 3 ).
Therefore, Eqs. (8.2) and (8.7) gives the propagation source estimator for a tree
graph and a general graph, respectively. The computational complexity of these two
estimators are O(N ) and O(N 3 ), respectively. We call these estimators as Gaussian
source estimators.
To test the effectiveness of the proposed approach, Pinto et al. [146] used the
well-documented case of cholera outbreak that occurred in the KwaZulu-Natal
province, South Africa, in 2000. Propagation source identification was performed
by monitoring the daily cholera cases reported in K communities (the observers).
The experimental results show that by monitoring only 20% of the communities,
the Gaussian estimator achieves an average error of less than four hops between
the estimated source and the first infected community. This small distance error
may enable a faster emergency response from the authorities in order to contain an
outbreak.
Chapter 9
Comparative Study and Numerical
Analysis

This chapter provides an extensive literature review on identifying the propagation


source of malicious attacks by tracing research trends and hierarchically reviewing
the contributions along each research line regarding identifying the propagation
source of malicious attacks. This chapter consists of three parts. We first review the
existing approaches and analyze their pros and cons. Then, numerical studies are
provided according to various experiment settings and diffusion scenarios. Finally,
we summarize the remarks of existing approaches. Here, we particularly use rumor
propagation as an example to analyze these approaches.

9.1 Comparative Study

Current source identification methods can be categorized into three categories in


accordance with the three different types of observations: complete observations,
snapshot observations, and sensor observations. The approaches for source identi-
fication developed under each type of observations are introduced in the following
subsections.

9.1.1 Methods Based on Complete Observations

In this subsection, we summarize the methods of source identification developed


under complete observations. There are two main techniques in this category: rumor
center and eigenvector center based methods (see Fig. 9.1).

© Springer Nature Switzerland AG 2019 95


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_9
96 9 Comparative Study and Numerical Analysis

Fig. 9.1 Taxonomy of current source identification methods

9.1.1.1 Single Rumor Center

Shah and Zaman [160, 161] introduced rumor centrality for source identification.
They assume that information spreads in tree-like networks and the information
propagation follows the SI model. They also assume each node receives information
from only one of its neighbors. Since we consider the complete observations of
networks, the source node must be one of the infected nodes. This method is
proposed for the propagation of rumors originating from a single source. Assuming
an infected node as the source, its rumor centrality is defined as the number of
distinct propagation paths originating from the source. The node with the maximum
rumor centrality is called the rumor center. For regular trees, the rumor center is
considered as the propagation origin. For generic networks, researchers employ BFS
trees to represent the original networks. Each BFS tree corresponds to a probability
ρ of a rumor that chooses this tree as the propagation path. In this case, the source
node is revised as the one that holds the maximum product of rumor centrality and ρ.
In essence, the method is to seek a node from which the propagation matches the
complete observation the best. As proven in [160, 161], the rumor center is equiva-
lent to the closeness center for a tree-like network. However, for a generic network,
the closeness center may not equal the rumor center. The effectiveness of the method
is further examined by the work in [162]. The authors proved that the rumor center
method can still provide guaranteed accuracy when relaxing two assumptions: the
exponential spreading time and the regular trees. This method was further explored
in the snapshot scenario that nodes reveal whether they have been infected with
probability μ [89]. When μ was large enough, the authors proved the accuracy of
the rumor center method can still be guaranteed. Wang et al. [178] extended the
discussion of the single rumor center into a more complex scenario with multiple
snapshots. Although snapshot only provided partial knowledge of rumor spreading,
the authors proved that multiple independent snapshots could dramatically improve
9.1 Comparative Study 97

temporally sequential snapshots. The analysis in [178] suggested that the complete
observation of rumor propagation could be approximated by multiple independent
snapshots.
There are several strong assumptions far from reality. First, it is considered
on a very special class of networks: infinite trees. Generic networks have to be
reconstructed into BFS trees before seeking propagation origins. Second, rumors
are implicitly assumed to spread in a unicast way (i.e., an infectious node can only
infect one of its neighbors at one time step). Third, the infection probability between
neighboring nodes is equal to 1. In the real world, however, networks are far more
complex than trees, with rumors often spreading in multicast or broadcast ways,
and the infection probability between neighboring nodes differing from each other.

9.1.1.2 Local Rumor Center

Following the assumptions in the rumor center method, Dong et al. [44] proposed a
local rumor center method to identify rumor sources. This method designates a set of
nodes as suspicious sources. Therefore, it reduces the scale of seeking origins. They
extended the approaches and results in [160] and [161] to identify the source of
propagation in networks. Following the definition of the rumor center, they defined
the local rumor center as the node with the highest rumor centrality compared to
other suspicious infected nodes. The local rumor center is considered as the rumor
source.
For regular trees with every node having degree d, the authors analyze the
accuracy γ of the local rumor center method. To construct a regular tree, the degree
d of each node should be at least 2. For regular trees, Dong et al. [44] derived the
following conclusions.√(1) When d = 2, the accuracy of the local rumor center
method follows O(1/ n), where n is the number of infected nodes. Therefore,
when n is sufficiently large, the accuracy is close to 0. (2) When the suspicious set
degenerates into the entire network, the accuracy γ grows from 0.25 to 0.307 as
d increases from 3 to +∞. This means that the minimum accuracy γ is 25% and
the maximum accuracy is 30.7%. (3) When the suspicious nodes form a connected
subgraph of the network, the accuracy γ significantly exceeds 1/k when d = 3,
where k is the number of suspicious nodes. (4) When there are only two suspect
nodes, the accuracy γ is at least 0.75 if d = 3, and γ increases with the distance
between the two suspects. (5) When multiple suspicious nodes form a connected
subgraph, the accuracy γ is lower than when these nodes form several disconnected
subgraphs.
The local rumor center is actually the node with the highest rumor centrality
in the priori set of suspects. The advantage of the local rumor center method is
that it dramatically reduces the source-searching scale. However, it has the same
drawbacks as the single rumor center method.
98 9 Comparative Study and Numerical Analysis

9.1.1.3 Multiple Rumor Centers

Luo et al. [115] extended the single rumor center method to identify multiple
sources. In addition to the basic assumptions, they further assumed the number of
sources was known for the method of identifying multiple rumor centers. Based on
the definition of rumor centrality for a single node, Luo et al. [115] extended rumor
centrality to a set of nodes, which is defined as the number of distinct propagation
paths originating from the set. They proposed a two-source estimator to compute
the rumor centrality when there were only two sources. For multiple sources, they
proposed a two-step method. In the first step, they assumed a set of infected nodes
as sources. All infected nodes were divided into different partitions by using the
Voronoi partition algorithm [76] on these sources. The single rumor center method
was then employed to identify the source in each partition. In the second step,
estimated sources were calibrated by the two-source estimator between any two
neighboring partitions. These two steps were iterated until the estimated sources
become steady.
Luo et al. [115] are the first to employ the rumor center method to identify
multiple rumor sources. They further investigate the performance of the two-source
estimator on geometric trees [161]. The accuracy approximates to 1 when the
infection graph becomes large. This method has also been extended to identify
multiple sources with snapshot observations. Because snapshots only provide partial
knowledge about the spreading dynamics of rumors in networks, Zang et al. [197]
introduce a score-based method to assess the states of other nodes in networks,
which indirectly form a complete observation on networks.
According to the definition of rumor centrality of a set of nodes, we need to
calculate the number of distinct propagation paths originating from the node set. It
is too computationally expensive to obtain the result. Even though Luo et al. have
proposed a two-step method to reduce the complexity, the two-step method still
needs O(N k ) computations, where k is the number of rumor sources. This method
can hardly be used in the real world, especially for large-scale networks.

9.1.1.4 Minimum Description Length

Prakash et al. [147, 148] proposed a minimum description length (MDL) method
for source identification. This method is considered for generic networks. They
assumed rumor propagation following the SI model. Given an arbitrary infected
node as the source node, minimum description length corresponds to the probability
of obtaining the infection graph. For generic networks, it is too computationally
expensive to obtain the probability. Instead, Prakash et al. [148] introduced an
upper bound of the probability and detected the origin by maximizing the upper
bound. They claimed that to maximize the upper bound is to find the smallest
eigenvalue λmin and the corresponding eigenvector umin of the Laplacian matrix
of the infection graph. The Laplacian matrix is widely used in spectral graph theory
and has many applications in various fields. This matrix is mathematically defined
9.1 Comparative Study 99

as L = D −A, where D is the diagonal degree matrix and A is the adjacency matrix.
According to Prakash et al.’s work in [147, 148], the node with the largest score in
the eigenvector umin refers to the propagation source.
This method can also be used to seek multiple sources. The authors adopted the
minimum description length (MDL) cost function [75]. This was used to evaluate
the ‘goodness’ of a node being in the source set. To search the next source node, they
first removed the previous source nodes from the infected set. Then, they replayed
the process of searching the single source in the remaining infection graph. These
two steps were iterated until the MDL cost function stopped decreasing.
Due to the high complexity in computing matrix eigenvalues, generally O(N 3 ),
the DML method is not suitable for identifying sources in large-scale networks.
Moreover, the number of true sources is generally unknown. Further to this, the gap
between the upper bound and the real value of the probability has not been studied,
and therefore, the accuracy of this method is not guaranteed.

9.1.1.5 Dynamic Age

Fioriti et al. [58] introduced the dynamic age method for source identification in
generic networks. The assumption for this method is the same as the MDL method.
Fioriti et al. took the advantage of the correlation between the eigenvalue and the
‘age’ of a node in a network. The ‘oldest’ nodes which were associated to those
with largest eigenvalues were considered as the sources of a propagation [199].
Meanwhile, they utilized the dynamical importance of node in [150]. It essentially
calculated the reduction of the largest eigenvalue of the adjacency matrix after a
node had been removed. A large reduction after the removal of a node implied the
node was relevant to the ‘aging’ of a propagation. By combing these two techniques,
Fioriti et al. proposed the concept of dynamical age for an arbitrary node i as
follows,

DAi = |λm − λim |/λm , (9.1)

where λm was the maximum eigenvalue of the adjacency matrix, and λim was the
maximum eigenvalue of the adjacency matrix after node i was removed. The nodes
with the highest dynamic age were considered as the sources.
This method is essentially different from the previous MDL method. The MDL
method is to find the smallest eigenvalues and the corresponding eigenvectors of
Laplacian matrices, while the dynamic age method is to find the largest eigenvalues
of the adjacency matrix.
Similar to the MDL method, the dynamic age method is not suitable for
identifying sources in large-scale networks due to the complexity of calculating
eigenvectors. Moreover, since there is no threshold to determine the oldest nodes,
the number of source nodes is uncertain.
100 9 Comparative Study and Numerical Analysis

9.1.2 Methods Based on Snapshots

In the real world, a complete observation of an entire network is hardly possible,


especially for large-scale networks. Snapshots are observations closer to reality.
It only provides partial knowledge of a propagation in networks. There are three
techniques of source identification developed on snapshot: Jordan center, message
passing and concentricity based methods (see the taxonomy in Fig. 9.1).

9.1.2.1 Jordan Center

Zhu and Ying [201] proposed the Jordan center method for rumor source identifica-
tion. They assumed rumor propagated in tree-like networks and the propagation
followed SIR model. All infected nodes were given, but the susceptible nodes
and recovered nodes were indistinguishable. This method was proposed for single
source propagation. Zhu and Ying [201] proposed a sample path based approach to
identify the propagation source. An optimal sample path was the one which most
likely leaded to the observed snapshot of a network. The source associated with
the optimal sample path was proven to be the Jordan center of the infection graph.
Jordan center was then considered as the rumor source.
Zhu and Ying [202] further extended the sample path based approach to the het-
erogeneous SIR model. Heterogeneous SIR model means the infection probabilities
between any two neighboring nodes are different, and the recovery probabilities of
infected nodes differ from each other. They proved that on infinite trees, the source
node associated with the optimal sample path was also the Jordan center. Moreover,
Luo et al. [114, 116] investigated the sample path based approach in the SI and SIS
models. They obtained the same conclusion as in the SIR model.
Similar to rumor center based methods, the Jordan center method is considered
on infinite tree-like networks which are far away from real-world networks.

9.1.2.2 Dynamic Message Passing

Lokhov et al. [111] proposed the dynamic message-passing (DMP) method by


assuming that propagation follows the SIR model in generic networks. Only
propagation time t and the states of a set of nodes at time t are known. The DMP
method is based on the dynamic equations approach proposed in [90]. Assuming an
arbitrary node as the source node, it first estimates the probabilities of other nodes to
be in different states at time t. Then, it multiplies the probabilities of the observed
set of nodes being in the observed states. The source node which can obtain the
maximum product is considered the propagation origin.
The DMP method takes into account the spreading dynamics of the propagation
process. This is very different from the previous centrality based methods (e.g.,
9.1 Comparative Study 101

rumor center and Jordan center based methods). Lokhov et al. [111] claimed that
the DMP source identification method dramatically outperformed the previous
centrality based methods.
An important prerequisite of the DMP method is that we must know the
propagation time t. However, the propagation time t is generally unknown. Besides,
the computational complexity of this method is O(tN 2 d), where N is the number
of nodes in a network and d is the average degree of the network. If the underlying
network is strongly connected, it will be computationally expensive to use the DMP
method to identify the propagation source.

9.1.2.3 Effective Distance Based Method

Assuming propagation follows SI model in weighted networks, Brockmann and


Helbing [24] proposed an effective distance based method for rumor source
identification. This method is considered in another case of snapshot−wavefront.
Brockmann and Helbing [24] first proposed a new concept, effective distance,
to represent the propagation process. The effective distance from node n to a
neighboring node m, dmn , is defined as

dmn = 1 − logPmn , (9.2)

where Pmn is the fraction of a propagation with destination m emanating from n.


From the perspective of a chosen source node v, the set of shortest paths in terms of
effective distance to all other nodes constitutes a shortest-path tree v . Brockmann
and Helbing [24] empirically obtain that the propagation process initiated from node
v on the original network can be represented as wavefronts on the shortest-path tree
v . To illustrate this process, a simple example is shown in Fig. 9.2 (refers to [24]).
According to the propagation process of wavefronts, the spreading concentricity can
only be observed from the perspective of the propagation source. Then, the node,
which has the minimum standard deviation and mean of effective distances to the
nodes in the observed wavefront, is considered as the source node.

Fig. 9.2 Illustration of wavefronts in the shortest path tree v . Readers can refer to the work “The
Hidden Geometry of Complex, Network-driven Contagion Phenomena” [24] for the details of the
wavefronts
102 9 Comparative Study and Numerical Analysis

The information propagation process in networks is complex and network-


driven. The combined multiscale nature and intrinsic heterogeneity of real-world
networks make it difficult to develop an intuitive understanding of these processes.
Brockmann and Helbing [24] reduce the complex spatiotemporal patterns to a
simple wavefront propagation process by using effective distance.
To use the effective distance based method for source identification, we need
to compute the shortest distances from any suspicious source to the observed
infected nodes. This leads to high computational complexity, especially for large-
scale networks.

9.1.3 Methods Based on Sensor Observations

In the real world, a further strategy to identify propagation sources is based on


sensors in networks. The sensors report the direction in which information transmits
through them, and the time at which the information arrives at them. There are two
techniques developed in this category: statistics and greedy rules (see the taxonomy
in Fig. 9.1).

9.1.3.1 Bayesian Estimator

Distinguished from the DMP method which adopts the message-passing propaga-
tion model (see Sect. 9.1.2.2), Altarelli et al. [7] proposed using the Bayesian belief
propagation model to compute the probabilities of each node being at any state. This
method can work with different types of observations and in different propagation
scenarios, however guaranteed accuracy is only obtained in tree-like networks. This
method consists of three steps. The propagation of rumors are first presented by SI,
SIR or other isomorphic models [176]. Second, given an observation on the infection
of a network, either through a group of sensors or a snapshot at an unknown time, the
belief propagation equations are derived for the posterior distribution of past states
on all network nodes. By constructing a factor graph based on the original network,
these equations provide the exact computation of posterior marginal in the models.
Third, belief propagation equations are iterated with time until they converge. Nodes
are then ranked according to the posterior probability of being the source.
This method provides the exact identification of source in tree-like networks.
This method is also effective for synthetic and real networks with cycles, both in a
static and a dynamic context, and for more general networks, such as DTN [204].
This method relies on belief propagation model in order to be used with different
observations and in various scenarios.
9.1 Comparative Study 103

9.1.3.2 Gaussian Estimator

Assuming propagation follows SI model in tree-like networks, Pinto et al. [146]


proposed a Gaussian method for single source identification. They also assume
there is a deterministic propagation time on each edge, which are independent
and identically distributed with Gaussian distribution. This method is divided into
two steps. In the first step, they reduce the scale of seeking origins. According to
the direction in which information arrived at the sensors, it uniquely determines a
subtree Ta . The subtree Ta is guaranteed to contain the propagation origin [146]. In
the second step, they use the following Gaussian technique to seek the source in Ta .
On the one hand, given a sensor node o1 , they calculate the ‘observed delay’ between
o1 and the other sensors. On the other hand, assuming an arbitrary node s ∈ Ta as
the source, they calculate the ‘deterministic delay’ for every sensor node relative to
o1 by using the deterministic propagation time of the edges. The node, which can
minimize the distance between the ‘observed delays’ and the ‘deterministic delays’
of sensor nodes, is considered as the propagation source.
This method is considered on tree-like networks. For generic networks, Pinto et
al. [146] assume that information spreads along the BFS tree, and search rumor
source in the BFS trees. It is improved by combining community recognition
techniques [146] in order to reduce the number of deployed sensors in networks.
By choosing the nodes between communities and with high betweenness values as
sensors, Louni et al. [112] reduce 3% fewer sensors than the original method [146].
For generic networks, the Gaussian estimator is of complexity O(N 3 ). Again, it
is too computationally expensive to use this method for large-scale networks.

9.1.3.3 Monte Carlo Method

Agaskar and Lu [2] proposed a fast Monte Carlo method for source identification in
generic networks. They assume propagation follows the heterogeneous SI model in
which the infection probabilities between any two neighboring nodes are different.
In addition, the observation of sensors is obtained in a fixed time window. This
method consists of two steps. In the first step, assuming an arbitrary node as the
source, they introduce an alternate representation for the infection process initiated
from the source. The alternate representation is derived in terms of the infection
time of each edge. Based on the alternate representation, they sample the infection
time for each sensor. In the second step, they compute the gap between the observed
infection time and the sampled infection time of sensors. They further use the Monte
Carlo approach to approximate the gap. The node which can minimize the gap is
considered as the propagation origin.
The computational complexity of this method is O(LNlog(N)/ε), where L is the
number of sensor nodes, and ε is the assumed error. The complexity is lower than
other source identification methods, which are normally O(N 2 ) or even O(N 3 ).
When sampling infection time for each edge, Agaskar and Lu [2] assume that
information always spreads along the shortest paths to other nodes. However, in the
104 9 Comparative Study and Numerical Analysis

real world, information generally reaches other nodes by a random walk. Therefore,
this method may not be suitable for other propagation schemes, such as random
spreading or multicast spreading.

9.1.3.4 Moon-Walk Method

Xie et al. proposed a post-mortem technique on traffic logs to seek the origin of
a worm (a kind of computer virus) [191]. There are four assumptions for this
technique. First, it focuses on the scanning worm [181]. This kind of worm spreads
on the Internet by making use of OS vulnerabilities. Victims will proceed to scan
the whole IP space for vulnerable hosts. Famous related examples includes Code
Red [206] and Slammer [126]. Second, logs of infection from sensors cover the
majority of the propagation processes. Third, the worm propagation forms a tree-
like structure from its origin. Last, the attack flows of a worm do not use spoofed
source IP addresses. Based on traffic logs, the network communication between
end-hosts are modeled by a directed host contact graph. Propagation paths are then
created by sampling edges from the graph according to the time of corresponding
logs. The creation of each path stops when there is no contiguous edge within t
seconds to continue the path. As the sampling is performed, a count is kept of how
many times each edge from the contact graph is traversed. If the worm propagation
follows a tree-like structure, the edge with maximum count will most likely be the
top of the tree. The start of this directed edge will be considered as the propagation
source.
There are issues on this technique that need to be further analyzed. First, it is
reasonable to assume worm do not use the IP spoof technique. In the real world,
the overwhelming majority of worm traffic involved in propagation is initiated
by victims instead of the original attacker. Spoofed IP addresses would only
decrease the number of successful attacks without providing further anonymity
to the attacker. Second, IP trace back techniques [157] are related to Moonwalk
and other methods discussed in this article. However, trace back on its own is not
sufficient to track worms to their origin, as trace back only determines the true
source of the IP packets received by a destination. In an epidemic attack, the source
of these packets is almost never the origin of the attack, but just one infected victims.
The methods introduced in this article are still needed to find the hosts higher up in
the propagation casual trees. Third, this method relies only on traffic logs. This
feature benefits itself on its ability to work without any a priori knowledge about the
worm attack.
Nowadays, the number of scanning worms has largely decreased due to advances
in OS development and security techniques [189]. Therefore, the usage of Moon-
walk, which can only seek the propagation origin of the scanning worm, is largely
limited. Moreover, a full collection of infection logs is hardly achieved in the real
world. Finally, current computer viruses are normally distributed by Botnet [205].
Moonwalk, which can only seek single origin, may not be helpful in this scenario.
9.2 Numerical Analysis 105

9.1.3.5 Four-Metric Method

Seo et al. [158] proposed a four-metric source estimator to identify single source
node in directed networks. They assume propagation follows the SI model. The
sensor nodes who transited from susceptible states to infected states are regarded as
positive sensors. Otherwise, they are considered as negative sensors. Seo et al. [158]
use the intuition that the source node must be close to the positive sensor nodes, but
far away from the negative sensor nodes. They propose four metrics to locate the
source. First, they find out a set of nodes which are reachable to all positive sensors.
Second, they filter the set of nodes by choosing the ones with the minimum sum of
distances to all positive sensor nodes. Third, they further choose the nodes that are
reachable to the minimum number of negative sensor nodes. Finally, the node which
satisfies all of the above three metrics and has the maximum sum of distances to all
negative sensor nodes is considered as the source node.
Seo et al. [158] studied and compared different methods of choosing sensors,
such as randomly choosing (Random), choosing the nodes with high betweenness
centrality values (BC), choosing the nodes with a large number of incoming edges
(NI), and choosing the nodes which are at least d hops away from each other (Dist).
Different sensor selection methods produce different sets of sensor nodes, and have
different accuracies in source identification. They show that the NI and BC sensor
selection methods outperform the others.
For the four-metric source estimator, it needs to compute the shortest paths from
the sensors to any potential source. Generally, the computational complexity is
O(N 3 ). It is too computationally expensive to use this method.

9.2 Numerical Analysis

In order to have a numerical understanding of the existing methods of source


identification, we examine the methods under different experiment environments.
Furthermore, we analyze potential impact factors on the accuracy of source iden-
tification. We test the methods on both synthetic and real-world networks. All the
experiments were conducted on a desktop computer running Microsoft Windows7
with 2 CPUs and 4G memory. The implementation was done in Matlab2012.
For each category of observation, we examined one or two typical source
identification methods. In total, five methods were examined. For complete obser-
vation, we tested the rumor center method and the dynamic-age method. We
also tested the Jordan center method and the DMP method for snapshots of
networks. The Gaussian source estimator was examined for sensor observation.
In the experiments, we typically choose infection probability (q) to be 0.75 and
recovery probability (p) to be 0.5. We randomly choose a node as a source to initiate
a propagation, and then average the error distance δ between the estimated sources
and the true sources by 100 runs.
106 9 Comparative Study and Numerical Analysis

Fig. 9.3 Sample topologies


of synthetic networks. (a)
3-regular tree; (b)
small-world network

9.2.1 Comparison on Synthetic Networks

In this subsection, we first compare the performance of different source identifica-


tion methods on synthetic networks. Then, we study three potential impact factors
(network topology, propagation scheme and infection probability) on the accuracies
of the methods.

9.2.1.1 Crosswise Comparison

We conducted experiments on two synthetic networks: a regular tree [160] and a


small-world network [180]. Figure 9.3a, b show example topologies of a 3-regular
tree and a small-world network. Figure 9.4a shows the frequency of error distances
δ of different methods on a 4-regular tree, respectively. We can see that, the sources
estimated by the DMP method and the Jordan center method are the closest to the
true sources, with an average of 1.5–2 hops away. The rumor center method and
the Gaussian method estimate the sources with an average of 2–3 hops away from
the true sources. The sources estimated using the dynamic age method were the
farthest away from the true sources. Figure 9.4b shows the performances of different
methods on a small-world network. It is clear the Jordan center method outperforms
the others, with estimated sources around 1 hop away from the true sources. The
DMP method also exposes good performances by showing estimated sources are
an average of 1–2 hops away from the true sources. The dynamic age method and
Gaussian method have the worst performance.
From the experiment results on the regular tree and small-world network, we can
see that the DMP method and the Jordan center method have better performance
than the other methods.

9.2.1.2 The Impact of Network Topology

From Sect. 9.1, we see that some existing methods of source identification are
considered on tree-like networks. In the previous subsection, we have shown the
9.2 Numerical Analysis 107

Fig. 9.4 Crosswise comparison of existing methods on two synthetic networks. (a) 4-regular tree,
(b) Small-world network

results of methods implemented on regular trees and small-world networks. In order


to analyze the impact of network topology on the methods, we introduce another two
different network topologies: random trees and regular graphs: We further conduct
performance evaluation on these two topologies.
Figure 9.5a shows the experiment results of methods on a random tree. It is clear
the Jordan center method has the best performance, with estimated sources around
2 hops away from the true sources. The rumor center method and the dynamic
age method show similar performance, with estimated sources around 3 hops away
from the true sources. The DMP method and the Gaussian method have the worst
performance. Figure 9.5b shows the experiment results of methods on a regular
graph. It shows that sources estimated by using the Jordan center method and the
DMP method were the closest to the true sources. The sources estimated by the
rumor center method were the farthest from true sources. The dynamic age method
and the Gaussian method also show poor performance in this scenario.
From the experiment results on the four different network topologies, we can see
the source identification methods are sensitive to network topology.

9.2.1.3 The Impact of Propagation Scheme

From Sect. 9.1, we note that some existing methods of source identification are
based on the assumption that information propagates along the BFS trees in
networks. This means propagation follows the broadcast scheme. However, in the
real world, propagation may follow various propagation schemes. We focus on three
most common propagation schemes: snowball, random walk and contact process
[34]. Their definitions are given below.
108 9 Comparative Study and Numerical Analysis

Fig. 9.5 The impact of network topologies. (a) Random tree, (b) Small-world network

Fig. 9.6 Illustration of different propagation schemes. The black node stands for the source.
The numbers indicate the hierarchical sequence of nodes getting infected. (a) Random walk, (b)
Contact process, (c) Snowball

• Random Walk: A node can deliver a message randomly to one of its neighbors.
• Contact Process: A node can deliver a message to a group of its neighbors that
have expressed interest in receiving the message.
• Snowball Spreading: A node can deliver a message to all of its neighbors.
An illustration of these three propagation schemes is shown in Fig. 9.6. We examine
different propagation schemes on both regular trees and small-world networks.
Figure 9.7a shows the experiment results of the methods with propagation
following the random-walk propagation scheme on a 4-regular tree. It is clear the
Gaussian source estimator outperforms the others, with estimated sources around 1–
2 hops away from the true sources. The performances of the rumor center method,
the dynamic age method and the Jordan center method are similar to each other, with
estimated sources around 5 hops away from the true sources. The DMP method has
the worst performance. Figure 9.8a shows experiment results of the methods with
propagation following the contact-process propagation scheme on a 4-regular tree.
It is clear the results in Figs. 9.7a and 9.8a are similar to each other. This means the
methods have similar performances on both the random-walk and contact-process
9.2 Numerical Analysis 109

Fig. 9.7 The impact of propagation schemes: random-walk scheme. (a) 4-regular tree, (b) Small-
world network

Fig. 9.8 The impact of propagation schemes: contact-process scheme. (a) 4-regular tree, (b)
Small-world network

propagation schemes. Figure 9.9a shows the experiment results of the methods
with propagation following the snowball propagation scheme on a 4-regular tree.
The results show a big difference from the results of the previous two propagation
schemes. The DMP method and the Jordan center method outperformed the others,
with estimated sources around 1–2 hops away from the true sources. The rumor
center method and the Gaussian method also showed good performances, with
estimated sources around 2–3 hops away from the true sources. The dynamic age
method had the worst performance.
The experiment results of the methods with propagation following different
propagation schemes on a small-world network are shown in Figs. 9.7b, 9.8b,
and 9.9b. The results are dramatically different from the results on the 4-regular
110 9 Comparative Study and Numerical Analysis

Fig. 9.9 The impact of propagation schemes: snowball scheme. (a) 4-regular tree, (b) Small-world
network

tree. From Fig. 9.8 we can see the Gaussian source estimator obtains the best
performance, followed by the DMP method. The rumor center method, the dynamic
age method and the Jordan center method show identifying sources by randomly
choosing. From Fig. 9.8b, it is clear the Jordan center method, the DMP method and
the Gaussian method show similar performances. These three methods outperform
the others. From Fig. 9.9b we can see the Jordan center method outperforms the
others, with estimated sources around 1 hop away from the true sources. The sources
estimated using the DMP method are around 1–2 hops away from the true sources.
The Gaussian source estimator has the worst performance.
From the experiment results, we see the source identification methods are also
sensitive to propagation schemes. The methods of source identification show better
performance when propagation follows the snowball propagation scheme rather
than the random-walk or contact-process propagation schemes.

9.2.1.4 The Impact of Infection Probability

In this subsection, we will analyze the impact of infection probability on the


accuracy of source identification. We set the infection probability from 0.5 to 0.95.
The experiment results are shown in Fig. 9.10a, b. From these figures, we can
see that the rumor center method have similar performances when we change the
infection probability. The same phenomenon happens on the dynamic age method,
the Jordan center method and the Gaussian methods. The DMP method performs
best when infection probability q is equal to 0.5. The accuracy declines when q
increases to 0.95. Among the experiment results, the Jordan center method and the
9.2 Numerical Analysis 111

Fig. 9.10 The impact of infection probability. (a) q = 0.5, (b) q = 0.95

(A) (B)

Fig. 9.11 Sample topologies of two real-world networks. (a) Enron email network, (b) Power grid
network

DMP method outperform the other methods, with estimated sources around 1 hop
away from the true sources. The dynamic age method and the Gaussian method have
the worst performance.
From the experiment results, we can see only the DMP method is sensitive to the
infection probability and performs better when the infection probability is lower.
The other methods show slightly difference in their performance when applied with
various infection probabilities.
112 9 Comparative Study and Numerical Analysis

Fig. 9.12 Source identification methods applied on real networks. (a) Enron email, (b) Power grid

9.2.2 Comparison on Real-World Networks

In this subsection, we examine the methods of source identification on two real-


world networks. The first one is an Enron email network [88]. This network has 143
nodes and 1246 edges. On average, each node has 8.71 edges. Therefore, the Enron
email network is a dense network. The second is a power grid network [4]. This
network has 4941 nodes and 6594 edges. On average, each node has 1.33 edges.
Therefore, the power grid network is a sparse network. Sample topologies of these
two real-world networks are shown in Fig. 9.11.
Figure 9.12a shows the frequency of error distance δ of different methods
on the Enron email network. We can see the rumor center method, the Jordan
center method and the dynamic age method outperform the others. The DMP
method has the worst performance. The Enron email network is a small and dense
network, complete observation of this network is reasonable and executable, and
the identification accuracy is also acceptable. Figure 9.12b shows the experiment
results on the power grid network. It is clear the Jordan center method and the DMP
method outperform the others, with estimated sources around 1–2 hops away from
the true sources. The rumor center method and the Gaussian method show similar
performance, with estimated sources around 2–4 hops away from the true sources.
The dynamic age method has the worst performance.
From the experiment results, we can see the accuracies of the methods are greatly
different between these two real-world networks. For the Enron email network, the
rumor center method and the dynamic age method outperform the other methods,
while the DMP method has the worst performance. However, for the power grid
network, the DMP method and the Jordan center have the best performance.
9.3 Summary 113

9.3 Summary

We summarize the source identification methods in this subsection. Based on


Sect. 9.1, it is clear that current methods rely on either the topological centrality
measures or the measures of the distance between the observations and mathemati-
cal estimations of the propagation.
In Table 9.1, we collect seven features from the methods discussed in this article.
A detailed summary on each feature is elaborated as follows:
1. Topology: As shown in Table 9.1, a significant part the focus for current methods
is tree-like topology. These methods can deal with generic network topologies by
using the BFS technique to reconstruct generic networks into trees. According to
comparative studies in Sect. 9.1, methods on different topologies show a great
variety of accuracy in seeking origins.
2. Observation: Based on the analysis in Sect. 9.1, the category of observation is
not a deterministic factor on the accuracy of source identification. The accuracy
of each method varies according to the different conditions and scenarios. In the
real world, complete observation is generally difficult to achieve. Snapshot and
sensor observation are normally more realistic.

Table 9.1 Summary of current source identification methods


Number Infection
Topology Observation Model of sources probability Time delay Complexity
Single Tree Complete SI Single HM/HT Constant O(N 2 )
rumor
center
Local rumor Tree Complete SI Single HM Constant O(N 2 )
center
Multi rumor Tree Complete SI Multiple HM Constant O(N k )
centers
Eigenvector Generic Complete SI Multiple HM Constant O(N 3 )
center
Jordan Tree Snapshot SI(R/S) Single HM/HT Constant O(N 3 )
center
DMP Generic Snapshot SIR Single HT Constant O(t0 N 2 d)
Effective Generic Snapshot SI Single HT Constant O(N 3 )
distance
Gaussian Tree Sensor SI Single HT Variable O(N 3 )
Monte Generic Sensor SIR Single HT Variable O(N logN/ε2 )
Carlo
Four- Generic Sensor SI Single HT Variable O(N 3 )
metrics
Note: HM and HT represent homogeneous and heterogeneous, respectively
114 9 Comparative Study and Numerical Analysis

3. Model: The majority of methods employ SI model to present the propagation


dynamics of risks. The SI model only considers the susceptible and infected
states of nodes regardless of the recovery process. The extension to SIR/SIS
will increase the complexity of source identification methods. Jordan center and
Monte Carlo method is based on SIR/SIS model. In particular, the Bayesian
source estimator can be used in scenarios with various propagation models as the
belief propagation approach can estimate the probabilities of node states under
various conditions.
4. Source: Most methods focus on single source identification. The multi-rumor
center method and eigenvector center method can be used to identify multiple
sources. However, these two methods are too computationally expensive to be
implemented. In the real world, risks are normally distributed from multiple
sources. For example, attackers generally employ a botnet which contains
thousands of victims to help spread the computer virus [14, 61]. For source
identification, these victims are the propagation origins.
5. Probability: For simplicity, earlier methods consider the infection probabilities to
be identical among the edges in networks. Later, most methods are extended to
varied infection probabilities among different edges. Noticeably, this extension
makes source identification methods more realistic.
6. Time Delay: Only the methods under sensor observations consider time delay for
edges. Accurate time delay of risks is an important factor in the propagation [38].
It is important to consider the time delay in source identification techniques.
7. Complexity: Most current methods are too computationally expensive to quickly
capture the sources of propagation. The complexity ranges from O(N logN/ε)
to O(N k ). In fact, the complexity of methods dominates the speed of seeking
origins. Quickly identifying propagation sources in most cases is of great
significance in the real world, such as capturing the culprits of rumors. Future
work is needed to improve the identification speed.
Part III
Critical Research Issues in Source
Identification
Chapter 10
Identifying Propagation Source
in Time-Varying Networks

Identifying the propagation sources of malicious attacks in complex networks plays


a critical role in limiting the damage caused by them through the timely quarantine
of the sources. However, the temporal variation in the topology of the underlying
networks and the ongoing dynamic processes challenge our traditional source
identification techniques which are considered in static networks. In this chapter,
we introduce an effective approach used in criminology to overcome the challenges.
For simplicity, we use rumor source identification to present the approach.

10.1 Introduction

Rumor spreading in social networks has long been a critical threat to our society
[143]. Nowadays, with the development of mobile devices and wireless techniques,
the temporal characteristic of social networks (time-varying social networks) has
deeply influenced the dynamic information diffusion process occurring on top of
them [151]. The ubiquity and easy access of time-varying social networks not only
promote the efficiency of information diffusion but also dramatically accelerate the
speed of rumor spreading [91, 170].
For either forensic or defensive purposes, it has always been a significant work
to identify the source of rumors in time-varying social networks [42]. However,
the existing techniques for rumor source identification generally require firm
connections between individuals (i.e., static networks), so that administrators can
trace back along the determined connections to reach the diffusion sources. For
example, many methods rely on identifying spanning trees in networks [160, 178],
then the roots of the spanning trees are regarded as the rumor sources. The firm
connections between users are the premise of constructing spanning trees in these
methods. Some other methods detect rumor sources by measuring node centralities,
such as degree, betweenness, closeness, and eigenvector centralities [146, 201]. The

© Springer Nature Switzerland AG 2019 117


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_10
118 10 Identifying Propagation Source in Time-Varying Networks

individual who has the maximum centrality value is considered as the rumor source.
All of these centrality measures are based on static networks. Time-varying social
networks, where the involved users and interactions always change, have led to great
challenges to the traditional rumor source identification techniques.
In this chapter, a novel source identification method is proposed to overcome the
challenges, which consists the following three steps: (1) To represent a time-varying
social network, we reduce it to a sequence of static networks, each aggregating
all edges and nodes present in a time-integrating window. This is the case, for
instance, of rumors spreading in Bluetooth networks, for which the fine-grained
temporal resolution is not available, whose spreading can be studied through
different integrating windows t (e.g., t could be minutes, hours, days or even
months). In each integrating window, if users did not activate the Bluetooth on their
devices (i.e., offline), they would not receive or spread the rumors. If they moved out
the Bluetooth coverage of their communities (i.e., physical mobility), they would not
receive or spread the rumors. (2) Similar to the detective routine in criminology, a
small set of suspects will be identified by adopting a reverse dissemination process
to narrow down the scale of the source seeking area. The reverse dissemination
process distributes copies of rumors reversely from the users whose states have been
determined based on various observations upon the networks. The ones who can
simultaneously receive all copies of rumors from the infected users are supposed
to be the suspects of the real sources. (3) To determine the real source from the
suspects, we employ a microscopic rumor spreading model to analytically estimate
the probabilities of each user being in different states in each time window. Since
this model allows the time-varying connections among users, it can feature the
dynamics of each user. More specifically, assuming any suspect as the rumor source,
we can obtain the probabilities of the observed users to be in their observed states.
Then, for any suspect, we can calculate the maximum likelihood (ML) of obtaining
the observation. The one who can provide the maximum ML will be considered as
the real rumor source.

10.2 Time-Varying Social Networks

In this section, we introduce the primer for rumor source identification in time-
varying social networks, including the features of time-varying social networks, the
state transition of users when they hear a rumor, and the categorization of partial
observations in time-varying social networks.

10.2.1 Time-Varying Topology

The essence of social networks lies in its time-varying nature. For example, the
neighborhood of individuals moving over a geographic space evolves over time
10.2 Time-Varying Social Networks 119

Fig. 10.1 Example of a rumor spreading in a time-varying network. The random spread is located
on the black node, and can travel on the links depicted as line arrows in the time windows. Dashed
lines represent links that are present in the system in each time window

(i.e., physical mobility), and the interaction between the individuals appears and
disappears in online social networks (i.e., online/offline) [151]. Time-varying social
networks are defined by an ordered stream of interactions between individuals. In
other words, as time progresses, the interaction structure keeps changing. Examples
can be found in both face-to-face interaction networks [27], and online social
networks [170]. The temporal nature of such networks has a deep influence on
information spreading on top of them. Indeed, the spreading of rumors is affected
by duration, sequence, and concurrency of contacts among people.
Here, we reduce time-varying networks to a series of static networks by
introducing a time-integrating window. Each integrating window aggregates all
edges and nodes present in the corresponding time duration. In Fig. 10.1, we show
an example to illustrate the time-integrating windows. In the time window t − 1 (or,
at time t − 1), a rumor started to spread from node S who had interaction with five
neighbors in this time window. In the next time window t, nodes B, D and F were
successfully infected. In this time window, we notice that node O moved next to
B (i.e., physical mobility), and node G had no interaction with its neighbors (i.e.,
offline). Other examples of physical mobility or online/offline status of nodes can be
found in the time window t + 1.

10.2.2 Security States of Individuals

For the convenience of description, we borrow the notions from epidemiology to


describe the spreading of rumors in time-varying social networks [207]. We say a
user is infected when he/she accepts the rumors, and an infected user is recovered
if he/she abandons the rumors. In this chapter, we adopt the classic susceptible-
infected-recovered (SIR) scheme to present the infection dynamics of each user.
120 10 Identifying Propagation Source in Time-Varying Networks

Fig. 10.2 State transition of


a node in rumor spreading
model

Figure 10.2 shows the state transition graph of an arbitrary user in this model. Every
user is initially susceptible (Sus.). They can be infected (Inf.) by their neighbors
with probability v(i, t), and then recover (Rec.) with probability q(i). Rumors will
be spread out from infected users to their social neighbors until they get recovered.
There are also many other models of rumor propagation, including the SI, SIS, SIRS
models [117, 128]. In present work, we adopt the SIR model because it can reflect
the state transition of users when they hear a rumor, from being susceptible to being
recovered. Generally, people will not believe the rumor again after they know the
truth. Therefore, recovered users will not transit their states any more. For other
propagation models, readers can refer to Sect. 10.6 for further discussion.
To more precisely describe node states under different types of observations, we
introduce two sub-states of nodes being infected: ‘contagious’ (Con.) and ‘misled’
(Mis.), see Fig. 10.2. An infected node first becomes contagious and then transit
to being misled. The Con. state describes the state of nodes newly infected. More
specifically, a node is Con. at time t means this node is susceptible at time t − 1 but
becomes infected at time t. An misled node will stay being infected until it recovers.
For instance, sensors can record the time at which they get infected, and the infection
time is crucial in detecting rumor sources because it reflects the infection trend and
speed of a rumor. Hence, the introduction of contagious and misled states is intrinsic
to the rumor spreading framework.

10.2.3 Observations on Time-Varying Social Networks

Prior knowledge for source identification is provided by various types of partial


observations upon time-varying social networks. According to previous work on
static networks, we collect three categories of partial observations: wavefronts,
snapshots, and sensor observations. We denote the set of observed nodes as O =
{o1 , o2 , . . . , on }. Following the rumor spreading in Fig. 10.1, we will explain each
type of the partial observations as follows.
10.2 Time-Varying Social Networks 121

Fig. 10.3 Three types of observations in regards to the rumor spreading in Fig. 10.1. (a)
Wavefront; (b) Snapshot; (c) Sensor observation

Wavefront [24]: Given a rumor spreading incident, a wavefront provides partial


knowledge of the time-varying social network status. Only the users who are in the
wavefront of the spreading can be observed (i.e., all the contagious nodes in the
latest time window are observed). Figure 10.3a shows an example of the wavefront
in the rumor spreading in Fig. 10.1. We see that nodes C, E, I, K and O are in the
wavefront as they transit to being contagious at time t + 1.
Snapshot [115]: Given a rumor spreading incident, a snapshot also provides
partial knowledge of the time-varying social network status. In this case, only a
group of users can be observed in the latest time window when the snapshot is
taken. The states of the observed users can be susceptible, infected or recovered.
We use OS , OI and OR to denote the observed users who are susceptible, infected
or recovered, respectively. This type of observations is the most common one in our
daily life. Figure 10.3b shows an example of the snapshot in the rumor spreading in
Fig. 10.1. We see that OS = {N, Q, T , V }, OI = {F, I, K, O} and OR = ∅.
Sensor Observation [146]: Sensors are a group of pre-selected users in time-
varying social networks. The sensors can record the rumor spreading dynamics over
them, including the security states and the time window when they get infected
(more specifically, become contagious). We introduce OS and OI to denote the set
of susceptible and infected sensors, respectively. For each oi ∈ OI , the infection
time is denoted by ti . This type of observation is usually obtained from sensor
networks. Figure 10.3c shows an example of the sensor observations in the rumor
spreading in Fig. 10.1. In this case, OS = {N, P , T , V }, OI = {K, B}, and the
infection time of node K is t + 1, and node B is infected at time t.
We can see that these three types of partial observations provide three different
categories of partial knowledge of the time-varying social network status. Dif-
ferent types of observations are suitable for different circumstances in real-world
applications. Readers can refer to [24, 146, 201] for further discussion on different
types of partial observations. The partial knowledge together with the time-varying
characteristics of social networks make the tracing back of rumor sources much
more difficult.
122 10 Identifying Propagation Source in Time-Varying Networks

10.3 Narrowing Down the Suspects

Current methods of source identification need to scan every node in the underlying
network. This is a bottlenecks of identifying rumor sources: scalability. It is
necessary to narrow down a set of suspects, especially in large-scale networks. In
this section, we develop a reverse dissemination method to identify a small set of
suspects. The details of the method are presented in Sect. 10.3.1, and its efficiency
will be evaluated in Sect. 10.3.2.

10.3.1 Reverse Dissemination Method

In this subsection, we first present the rationale of the reverse dissemination method.
Then, we show how to apply the reverse dissemination method into different types
of partial observations on networks.

10.3.1.1 Rationale

The rationale of the reverse dissemination method is to send copies of rumors


along the reversed dynamic connections from observed nodes to exhaust all possible
spreading paths leading to the observation. The node from which all the paths,
covering all the observed nodes’ states, originated is more likely to be a suspect.
The reverse dissemination method is inspired from the Jordan method [201]. The
reverse dissemination method is different from the Jordan method, because our
method is based on time-varying social networks (involving the physical mobility
and online/offline status of users) rather than static networks. In Fig. 10.4, we show a
simple example to illustrate the reverse dissemination process. This example follows

Fig. 10.4 Illustration of the reverse dissemination process in regards to the wavefront observation
in Fig. 10.3a. (a) The observed nodes broadcast labeled copies of rumors to their neighbors in time
window t; (b) The neighbors who received labeled copies will relay them to their own neighbors
in time window t − 1
10.3 Narrowing Down the Suspects 123

the rumor spreading in Fig. 10.1 and the wavefront observation in Fig. 10.3a. All
wavefront nodes OI = {E, C, I, K, O} observed in time window t + 1 are labeled
as black in Fig. 10.4a. The whole process is composed of two rounds of reverse
dissemination. In round 1 (Fig. 10.4a), all observed nodes broadcast labeled copies
reversely to their neighbors in time window t. For example, nodes S and O received
copies of node C (S, O ← C), and node D received copies of three observed
nodes C, I and K (D ← C, I, K). In round 2 (Fig. 10.4b), the neighbors who
have received labeled copies will relay them to other neighbors in time window
t − 1. In each round, the labels will be recorded in each relay node. We can see
from Fig. 10.4b that node S has received all copies from all the observed nodes
(S ← C, E, K, I, O). Then, node S is chosen to be a suspect.
We notice that the starting time for each observed node starting their reverse
dissemination processes varies in different types of observations. For a wavefront,
since all the observed nodes are supposed to be contagious in the latest time window,
all the observed nodes need to simultaneously start their reverse dissemination
processes. For a snapshot, the observed nodes stay in their states in the latest time
window. Therefore, the reverse dissemination processes will also simultaneously
starts from all the observed nodes. However, for a sensor observation, because
the infected sensors record their infection time, the starting time of reverse
dissemination for each sensor will be determined by ti . More specifically, the latest
infected sensors first start their reverse dissemination processes, and then the sensors
infected in the previous time window, until the very first infected sensors.

10.3.1.2 Wavefront

Given a reverse dissemination process starting from an observed node oi , we use


PC (u, t|oi ) to denote the probability of an arbitrary node u to be contagious after
time t, where t denotes the time span of the whole reverse dissemination process.
Let all observed nodes oi start their reverse dissemination processes in the latest
time window. To match the wavefront, it is expected a suspect u can simultaneously
receive rumor copies from all oi ∈ O (i.e., the rumor copies sent from all observed
nodes can make node u become contagious simultaneously). Mathematically, we
identify those nodes who can provide the maximum likelihood, L(u, t), of being a
suspect receiving copies from all the observed nodes, as in

L(u, t) = ln(PC (u, t|oi )). (10.1)
oi ∈O

For the convenience of computation, we adopt logarithmic function ln(·) in


Eq. (10.1) to derive the maximum likelihood. We use U to denote the set of suspects.
The ones who provide larger values of L(u, t) are recognized as a member of set U .
124 10 Identifying Propagation Source in Time-Varying Networks

10.3.1.3 Snapshot

To match the snapshot observation (which includes susceptible, infected or recov-


ered nodes), it is expected that a suspect u needs to satisfy the following three
principles at time t. First, copies of rumors disseminated from observed susceptible
nodes oi ∈ OS cannot reach node u at time t (i.e., u is still susceptible). Second,
copies of rumors disseminated from observed infected nodes oj ∈ OI can reach
node u at time t (i.e., u becomes infected). Third, copies of rumors disseminated
from observed recovered nodes ok ∈ OR can arrive at node u before time t (i.e., u
becomes recovered). Again, we employ maximum likelihood to capture this kind of
nodes, as in
 
L(u, t) = ln(PS (u, t|oi )) + ln(PI (u, t|oj ))
oi ∈OS oj ∈OI
 (10.2)
+ ln(PR (u, t|ok )),
ok ∈OR

where PS (u, t|oi ), PI (u, t|oi ) and PR (u, t|oi ) denote the probabilities of u to be
susceptible, infected or recovered after time t, respectively, given that the reverse
dissemination started from oi .

10.3.1.4 Sensor

For sensor observations, according to our previous discussion, we let infected sensor
oi ∈ OI start to reversely disseminate copies of the rumor at time tˆi = T − ti ,
where T = max{ti |oi ∈ OI }. We also let the susceptible sensors oj ∈ OS start to
reversely disseminate copies of rumors at time t=0. To match a sensor observation, it
is expected a suspect u needs to satisfy the following two principles at time t. First,
copies of rumors disseminated from susceptible sensors oi ∈ OS cannot reach node
u at time t (i.e., node u is still susceptible). Second, copies of rumors disseminated
from all infected sensors oj ∈ OI can be received by node u at time t (i.e., node
u becomes contagious). Mathematically, we determine the suspects by computing
their maximum likelihood, as in

L(u, t) = ln(PC (u, t + tˆi |oi ))
oi ∈OI
 (10.3)
+ ln(PS (u, t|oj )).
oj ∈OS

The values of PS (u, t|oi ), PC (u, t|oi ), PI (u, t|oi ) and PR (u, t|oi ) will be
calculated by the model introduced in Sect. 10.4.2. We summarize the reverse
dissemination method in Algorithm 10.1.
10.3 Narrowing Down the Suspects 125

Algorithm 10.1: Reverse dissemination


Input: A set of observed nodes O = {o1 , o2 , . . . , on }, a set of infection times of the
observed nodes {t1 , t2 , . . . , tn }, a threshold α, and a threshold tmax .
Initialize: A set of suspects U = ∅, and t1 = . . . = tn = T if O is a snapshot/wavefront,
otherwise T = max{t1 , t2 , . . . , tn }.
for (t starts from 1 to a given maximum value tmax ) do
for (oi : i starts from 1 to n) do
if (oi has not started to disseminate the rumor) then
Start to propagate the rumor from user oi separately and independently at time
t + T − ti .
for (u: any node in the whole network) do
if (user u received n separate rumors from O) then
Compute the maximum likelihood L(u, t) for user u;
Add user u into the set U .
if (|U | ≥ αN ) then
Keep the first αN suspects with large maximum likelihoods in U , and delete all the
other suspects.
Stop.
Output: A set of suspects U .

Table 10.1 Comparison of data collected in the experiments


Dataset MIT Sigcom09 Email Facebook
Device Phone Phone Laptop Laptop
Network type Bluetooth Bluetooth WiFi WiFi
Duration (days) 246 5 14 6
# of devices 97 76 143 45,813
# of contacts 54,667 69,189 1246 264,004

10.3.2 Performance Evaluation

We evaluate the performance of the reverse dissemination method in real time-


varying social networks. Similar to Lokhov et al.’s work [108], we consider the
infection probabilities and recovery probabilities to be uniformly distributed in
(0,1), and the average infection and recovery probabilities are set to be 0.6 and 0.3.
We also use α to denote the ratio of suspects over all nodes, α = |U |/N, where N is
the number of all nodes in a time-varying social network. The value of α ranges from
5% to 100%. We randomly choose the real source in 100 runs of each experiment.
The number of 100 comes from the work in [207].
We consider four real time-varying social networks in Table 10.1: The MIT
reality [46] dataset captures communication from 97 subjects at MIT over the
course of the 2004–2005 academic year. The Sigcom09 [145] dataset contains
the traces of Bluetooth device proximity of 76 persons during SIGCOMM 2009
conference in Barcelona, Spain. The Enron Email [164] dataset contains record of
email conversations from 143 users in 2001. The Facebook [171] dataset contains
126 10 Identifying Propagation Source in Time-Varying Networks

Fig. 10.5 Accuracy of the reverse dissemination method in networks. (a) MIT; (b) Sigcom09; (c)
Enron Email; (d) Facebook

communications from 45,813 users during December 29th, 2008 and January 3rd,
2009. All of these datasets reflect the physical mobility and online/offline features of
time-varying social networks. According to the study in [151], an appropriate tem-
poral resolution t is important to correctly characterize the dynamical processes
on time-varying networks. Therefore, we need to be cautious when we choose the
time interval of size t. Furthermore, many social networks have been shown small-
world, i.e., the average distance l between any two nodes is small, generally l ≤ 6.
Previous extensive works show that rumors can spread quickly in social networks,
generally after 6–10 time ticks of propagation (see [42]). Hence, we divided the
social networks into 6–10 time windows. Therefore, for the datasets used in this
chapter, we uniformly divide each into 6–10 discrete time windows [151]. For other
division of temporal resolution, readers could refer to [151] for further discussion.
Figure 10.5 shows the experiment results in the four real datasets. We find the
proposed method works quite well in reducing the number of suspects. Especially
for snapshots, the searching scale can be narrowed to 5% of all users for the MIT
dataset, 15% for the Sigcom09 dataset, and 20% for the Enron Email and Facebook
datasets. The number of suspects can be reduced to 45% of all users in the MIT
reality dataset under snapshot and wavefront observations. For the Enron Email and
Facebook datasets, the number of suspects can be reduced to 20% of all users. The
worst case occurred in the Sigcom09 dataset with wavefronts, but our method still
achieved a reduction of 35% in the total number of users.
The experiment results on real time-varying social networks show that the
proposed method is efficient in narrowing down the suspects. Real-world social
networks usually have a large number of users. Our proposed method addresses
the scalability in source identification, and therefore is of great significance.

10.4 Determining the Real Source

Another bottleneck of identifying rumor sources is to design a good measure to


specify the real source. Most of the existing methods are based on node centralities,
which ignore the propagation probabilities between nodes. Some other methods
10.4 Determining the Real Source 127

consider the BFS trees instead of the original networks. These violate the rumor
spreading processes. In this section, we adopt an innovative maximum-likelihood
based method to identify the real source from the suspects. A novel rumor spreading
model will also be introduced to model rumor spreading in time-varying social
networks.

10.4.1 A Maximum-Likelihood (ML) Based Method


10.4.1.1 Rationale

The key idea of the ML-based method is to expose the suspect from set U that
provides the largest maximum likelihood to match the observation. It is expected
that the real source will produce a rumor propagation which not only temporally
but also spatially matches the observation more than other suspects. Given an
observation O = {o1 , o2 , . . . , on } in a time-varying network, we let the spread
of rumors start from an arbitrary suspect u ∈ U from the time window that
is tu before the latest time window. For an arbitrary observed node oi , we use
PS (oi , tu |u) to denote the probability of oi being susceptible at time tu , given that the
spread of rumors starts from suspect u. Similarly, we have PC (oi , tu |u), PI (oi , tu |u)
and PR (oi , tu |u) representing the probabilities of oi being contagious, infected
and recovered at time tu , respectively. We use L̃(tu , u) to denote the maximum
likelihood of obtaining the observation when the rumor started from suspect u.
Among all the suspects in U , we can estimate the real source by choosing the
maximum value of the ML, as in

(u∗ , t ∗ ) = arg maxu∈U L̃(tu , u). (10.4)

The result of Eq. (10.4) suggests that suspect u∗ can provide a rumor propagation not
only temporally but also spatially matches the observation better than other suspects.
We also have an estimation of infection scale I (t ∗ , u∗ ) as a byproduct, as in


N
I (t ∗ , u∗ ) = PI (i, t ∗ |u∗ ). (10.5)
i=1

Later, we can justify the effectiveness of the ML-based method by examining the
accuracy of t ∗ and I (t ∗ , u∗ ).

10.4.1.2 Wavefront

In a wavefront, all observed nodes are contagious in the time window when the
wavefront is captured. Supposing suspect u is the rumor source, the maximum
128 10 Identifying Propagation Source in Time-Varying Networks

likelihood L̃(tu , u) of obtaining the wavefront O is the product of the probabilities


of any observed node oi ∈ O being contagious after time tu . We also adopt a
logarithmic function to present the computation of the maximum likelihood. Then,
we have L̃(tu , u) for a wavefront, as in

L̃(tu , u) = ln(PC (oi , tu |u)). (10.6)
oi ∈O

10.4.1.3 Snapshot

In a snapshot, the observed nodes can be susceptible, infected or recovered in the


time window when the snapshot is taken. Supposing suspect u is the rumor source,
the maximum likelihood of obtaining the snapshot is the product of the probabilities
of any observed node oi ∈ O being in its observed state. Then, we have the
logarithmic form of the calculation for L̃(tu , u) in a snapshot, as in

L̃(tu , u) = ln(PS (oi , tu |u))+
oi ∈OS
  (10.7)
ln(PI (oj , tu |u)) + ln(PR (ok , tu |u)).
oj ∈OI ok ∈OR

10.4.1.4 Sensor

In a sensor observation, each infected sensor oi ∈ OI records its infection time


ti . Although the absolute time ti cannot directly suggest the spreading time of the
rumor, we can derive the relative infection time of each sensor. Supposing suspect u
is the rumor source, for an arbitrary infected sensor oi , its relative infection time is
t˜i = ti − t˜ + tu where t˜ = min{ti |oi ∈ OI }, and tu is obtained from Algorithm 10.1.
For suspect u ∈ U , the maximum likelihood L̃(tu , u) of obtaining the observation
is the product of the probability of any sensor oi to be in its observed state at time
t˜i . Then, we have the logarithmic form of the calculation for L̃(tu , u) in a sensor
observation, as in
 
L̃(tu , u) = ln(PC (u, t˜i |oi )) + ln(PS (u, tu |oj )). (10.8)
oi ∈OI oj ∈OS

Note that, PS (u, t|oi ), PC (u, t|oi ), PI (u, t|oi ), and PR (ũ, t|oi ) can be calculated
in the rumor spreading model in Sect. 10.4.2. We summarize the method of
determining rumor sources in Algorithm 10.2.
10.4 Determining the Real Source 129

Algorithm 10.2: Targeting the suspect


Input: A set of suspects U , a set of observed nodes O, and a threshold tmax .
Initialize: Lmax = 0, u∗ = ∅, t ∗ = 0.
for (ũ: any node in set U ) do
for (t starts from 1 to a given maximum value tmax ) do
Disseminate the rumor from suspect ũ.
if (We can obtain the observation O) then
Compute the maximum likelihood value L̃(t, ũ).
if (L̃(t, ũ) > Lmax ) then
Lmax = L̃(t, ũ);
u∗ = ũ;
t ∗ = t.
if (L̃(t, ũ) < L̃(t − 1, ũ)) then
Stop.

Output: The rumor source u∗ and propagation time t ∗ .

10.4.2 Propagation Model

In this subsection, we introduce an analytical model to present the spreading


dynamics of rumors in time-varying social networks. The state transition of each
node follows the SIR scheme introduced in Sect. 10.2.2. For rumor spreading
processes among users, we use this model to calculate the probabilities of each user
in various states.
In the modeling, every user is initially susceptible. We use ηj i (t) to denote the
spreading probability from user j to user i in time window t. Then, we can calculate
the probability of a susceptible user being infected by his/her infected neighbors
as in

v(i, t) = 1 − [1 − ηj i (t) · PI (j, t − 1)], (10.9)
j ∈Ni

where, Ni denotes the set of neighbors of user i. Then, we can compute the
probability of an arbitrary user to be susceptible at time t as in

PS (i, t) = [1 − v(i, t)] · PS (i, t − 1). (10.10)

Once a user gets infected, he/she becomes contagious. We then have the probability
that an arbitrary user is contagious at time t as in

PC (i, t) = v(i, t) · PS (i, t − 1). (10.11)


130 10 Identifying Propagation Source in Time-Varying Networks

Since an infected user can be either contagious or misled, we can obtain the value
of PI (i, t) as in

PI (i, t) = PC (i, t) + (1 − qi (t)) · PI (i, t − 1). (10.12)

Then, the value of the PR (i, t) can be derived from

PR (i, t) = PR (i, t − 1) + qi (t) · PI (i, t − 1). (10.13)

This model analytically derives the probabilities of each user in various states in an
arbitrary time t. This in addition constitutes the maximum likelihood L(u, t) of an
arbitrary user u being a suspect in time window t in Sect. 10.3.1. This also supports
the calculation of the maximum likelihood L̃(t, u) to match the observation in time
window t, given that the rumor source is the suspicious user u in Sect. 10.4.1.

10.5 Evaluation

In this section, we evaluate the efficiency of our source identification method. The
experiment settings are the same as those presented in Sect. 10.3.2. Specifically,
we let the sampling ratio α range from 10% to 30%, as the reverse dissemination
method has already achieved a good performance with α dropping in this range.

10.5.1 Accuracy of Rumor Source Identification

We evaluate the accuracy of our method in this subsection. We use δ to denote


the error distance between a real source and an estimated source. Ideally, we have
δ = 0 if our method accurately captures the real source. In practice, we expect that
our method can accurately capture the real source or a user very close to the real
source (i.e., δ is very small). As the user close to the real source usually has similar
characteristics with the real source, quarantining or clarifying rumors at this user is
also very significant to diminish the rumors [160].
Our method shows good performances in the four real time-varying social
networks. Figure 10.6 shows the frequency of the error distances (δ) in the MIT
reality dataset under different categories of observations. When the sampling ratio
α ≥ 20%, our method can identify the real sources with an accuracy of 78% for
the sensor observations, more than 60% for the snapshots, and around 36% for the
wavefronts. For the wavefronts, although our method cannot identify real sources
with very high accuracy, the estimated sources are very close to the real sources, and
are generally 0–2 hops away. Figure 10.7 shows the frequency of the error distances
δ in the Sigcom09 dataset. When the sampling ratio α ≥ 20%, the proposed method
can identify the real sources with an accuracy of more than 70% for the snapshots.
10.5 Evaluation 131

Fig. 10.6 The distribution of error distance (δ) in the MIT Reality dataset. (a) Sensor; (b)
Snapshot; (c) Wavefront

Fig. 10.7 The distribution of error distance (δ) in the Sigcom09 dataset. (a) Sensor; (b) Snapshot;
(c) Wavefront

Fig. 10.8 The distribution of error distance (δ) in the Enron Email dataset. (a) Sensor; (b)
Snapshot; (c) Wavefront

For the other two categories of observations, although our method cannot identify
real sources with very high accuracy, the estimated sources are very close to the
real sources, with an average of 1–2 hops away in the sensor observations, and 1–3
hops away for the wavefronts. Figure 10.8 shows the performance of our method
in the Enron Email dataset. When the sampling ratio α ≥ 20%, our method can
132 10 Identifying Propagation Source in Time-Varying Networks

Fig. 10.9 The distribution of error distance (δ) in the Facebook dataset. (a) Sensor; (b) Snapshot;
(c) Wavefront

identify the real sources with an accuracy of 80% for the snapshots, and more than
45% for the wavefronts. The estimated sources are very close to the real sources,
with an average 1–3 hops away in the sensor observations. Figure 10.9 shows the
performance of our method in the Facebook dataset. Similarly, when the sampling
ratio α ≥ 20%, the proposed method can identify the real sources with an accuracy
of around 40% for the snapshots. The estimated sources are very close to the real
sources, with an average of 1–3 hops away from the real sources under the sensor
and wavefront observations.
Compared with previous work, our proposed method is superior because our
method can work in time-varying social networks rather than static networks. Our
method can achieve around 80% of all experiment runs that accurately identify the
real source or an individual very close to the real source. However, the previous
work of [178] has theoretically proven their accuracy was at most 25% or 50% in
tree-like networks, and their average error distance is 3–4 hops away.

10.5.2 Effectiveness Justification

We justify the effectiveness of our ML-based method from three aspects: the
correlation between the ML of the real sources and that of the estimated sources,
the accuracy of estimating rumor spreading time, and the accuracy of estimating
rumor infection scale.

10.5.2.1 Correlation Between Real Sources and Estimated Sources

We investigate the correlation between the real sources and the estimated sources by
examining the correlation between their maximum likelihood values. For different
types of observation, the maximum likelihood of an estimated source can be
obtained from Eqs. (10.6), (10.7) or Eq. (10.8), i.e., L̃(t ∗ , u∗ ). The maximum
10.5 Evaluation 133

Fig. 10.10 The correlation between the maximum likelihood of the real sources and that of the
estimated sources in the MIT reality dataset. (a) Sensor observation; (b) Snapshot observation; (c)
Wavefront observation

Fig. 10.11 The correlation between the maximum likelihood of the real sources and that of the
estimated sources in the Sigcom09 dataset. (a) Sensor observation; (b) Snapshot observation; (c)
Wavefront observation

likelihood of a real source is obtained by replacing u∗ and t ∗ as the real source


and the real rumor spreading time, respectively. If the estimated source is in fact the
real source, their maximum likelihood values should present high correlation.
The correlation results of the maximum likelihood values when α = 20% in
the four time-varying social networks are shown from Figs. 10.10, 10.11, 10.12,
10.13. We see that the maximum likelihood values of the real sources and that of the
estimated sources are highly correlated with each other. Their maximum likelihood
values approximately form linear relationships to each other. Figure 10.10 shows the
results in the MIT reality dataset. We can see that the maximum likelihood values of
the real sources and that of the estimated sources are highly correlated in both sensor
and snapshot observations. The worst results occurred in wavefront observations,
however the majority of the correlation results still tend to be clustered in a line.
These exactly reflect the accuracy of identifying rumor sources in Fig. 10.6. The
results in the Sigcom09 dataset are shown in Fig. 10.11. We see that the maximum
likelihood values are highly correlated in both snapshot and wavefront observations.
The worst results occurred in sensor observations, however the majority of the
134 10 Identifying Propagation Source in Time-Varying Networks

Fig. 10.12 The correlation between the maximum likelihood of the real sources and that of the
estimated sources in the Enron Email dataset. (a) Sensor observation; (b) Snapshot observation;
(c) Wavefront observation

Fig. 10.13 The correlation between the maximum likelihood of the real sources and that of the
estimated sources in the Facebook dataset. (a) Sensor observation; (b) Snapshot observation;
(c) Wavefront observation

correlation results still tend to be clustered in a line. These exactly reflect the
accuracy of identifying rumor sources in Fig. 10.7. The results in the Enron Email
dataset are shown in Fig. 10.12. We see that the maximum likelihood values
are highly correlated in both snapshot and wavefront observations, and slightly
correlated in sensor observations. These exactly reflect the accuracy of identifying
rumor sources in Fig. 10.8. Similar results can be found in the Facebook dataset
in Fig. 10.13, which precisely reflects the accuracy of identifying rumor sources in
Fig. 10.9.
The strong correlation between the ML values of the real sources and that of the
estimated sources in time-varying social networks reflects the effectiveness of our
ML-based method.

10.5.2.2 Estimation of Spreading Time

As a byproduct, our ML-based method can also estimate the spreading time (in
Eq. (10.4)) of rumors. In order to justify the effectiveness of our proposed method,
10.5 Evaluation 135

Table 10.2 Accuracy of estimating rumor spreading time


Environment settings Estimated spreading time
Observation T MIT Sigcom09 Email Facebook
Sensor 2 2±0 2±0 2±0 1.787±0.411
4 4.145±0.545 3.936±0.384 4.152±0.503 3.690±0.486
6 6.229±0.856 5.978±0.488 6.121±0.479 5.720±0.604
Snapshot 2 1.877±0.525 2.200±1.212 2.212±0.781 2.170±0.761
4 3.918±0.862 3.920±0.723 3.893±0.733 4.050±0.716
6 6.183±1.523 6.125±1.330 5.658±1.114 5.650±1.266
Wavefront 2 2±0 2±0 2±0 1.977±0.261
4 4.117±0.686 4±0 3.984±0.590 4.072±0.652
6 6±0 5.680±1.096 5.907±0.640 5.868±0.864

we further investigate the effectiveness of this byproduct. We expect the estimate


can accurately expose the real spreading time of rumors. We let the real spreading
time vary from 2 to 6 in four real time-varying social networks. The experiment
results are shown in Table 10.2.
As shown in Table 10.2, we analyze the means and the standard deviations of the
estimated spreading time. We see that the means of the estimated spreading time are
very close to the real spreading time, and most results of the standard deviations
are smaller than 1. Especially when the spreading time T = 2, our ML-based
method in sensor observations and wavefront observations can accurately estimate
the spreading time in the MIT reality, Sigcom09 and Enron Email datasets. The
results are also quite accurate in the Facebook dataset. From Table 10.2, we can see
that our method can estimate the spreading time with extremely high accuracy in
wavefront observations, and relatively high accuracy in snapshot observations.
Both the means and standard deviations indicate that our method can estimate
the real spreading time with high accuracy. The accurate estimate of the spreading
time indicates that our method is effective in rumor source identification.

10.5.2.3 Estimation of Infection Scale

We further justify the effectiveness of our ML-based method by investigating


its accuracy in estimating the infection scale of rumors provided by the second
byproduct in Eq. (10.5). We expect that the ML-based method can accurately
estimate the infection scale of each propagation incident. Particularly, we let the
rumor spreading initiate from the node with largest degree in each full time-varying
social network and spread for six time windows in experiments.
In Fig. 10.14, we show the real infection scales at each time tick, and also the
estimated infection scales in different types of observations. We can see that the
proposed method can provide a fairly accurate estimate of on the infection scales of
rumors in the MIT reality dataset, the Sigcom09 dataset and the Facebook dataset in
136 10 Identifying Propagation Source in Time-Varying Networks

Fig. 10.14 The accuracy of estimating infection scale in real networks. (a) MIT; (b) Sigcom09;
(c) Enron Email; (d) Facebook

different types of observations. As shown in Fig. 10.14c, the worst result occurred
in the Enron Email dataset after time tick 4. According to our investigation, this was
caused by a great deal of infected nodes that tend to be in the recovered stage in the
SIR scheme, which leads to a fairly large uncertainty in the estimate.
To summarize, all of the above evaluations reflect the effectiveness of our method
from different aspects: the high correlation between the ML values of the real
sources and that of the estimated sources, the high accuracy in estimating spreading
time of rumors, and the high accuracy of the infection scale.

10.6 Summary

In this chapter, we explore the problem of rumor source identification in time-


varying social networks that can be reduced to a series of static networks by
introducing a time-integrating window. In order to address the challenges posted
by time-varying social networks, we adopted two innovative methods. First, we
utilized a novel reverse dissemination method which can sharply narrow down the
scale of suspicious sources. This addresses the scalability issue in this research area
and therefore dramatically promotes the efficiency of rumor source identification.
Then, we introduced an analytical model for rumor spreading in time-varying
social networks. Based on this model, we calculated the maximum likelihood of
each suspect to determine the real source from the suspects. We conduct a series
of experiments to evaluate the efficiency of our method. The experiment results
indicate that our methods are efficient in identifying rumor sources in different types
of real time-varying social networks.
There is some future work can be done in identifying rumor sources in time-
varying networks. There are also many other models of rumor propagation, such
as the models in [117, 128]. These models can be basically divided into two
categories: the macroscopic models and the microscopic models. The macroscopic
models, which are based on differential equations, only provide the overall infection
trend of rumor propagation, such as the total number of infected nodes [207]. The
microscopic models, which are based on difference equations, not only provide
10.6 Summary 137

the overall infection status of rumor propagation, but they also can estimate the
probability of an arbitrary node being in an arbitrary state [146]. In the field of
identifying propagation sources, researchers generally choose microscopic models,
because it requires to estimate which specific node is the first one getting infected.
As far as we know, so far there is no work that is based on the macroscopic models
to identify rumor sources in social networks. Future work may also investigate
combining microscopic and macroscopic models, or even adopting the mesoscopic
models [118, 124], to estimate both the rumour sources and the trend of the
propagation. There are also many other microscopic models other than the SIR
model adopted in this chapter, such as the SI, SIS, and SIRS models [146, 201].
As we discussed in Sect. 10.2.2, people generally will not believe the rumor again
after they know the truth, i.e., after they get recovered, they will not transit to other
states. Thus, the SIR model can reflect the state transition of people when they hear
a rumor. We also evaluate the performance of the proposed method on the SI model.
Since the performance of our method on the SI model is similar to that on the SIR
model, we only present the results on the SIR model in this chapter.
Chapter 11
Identifying Multiple Propagation Sources

The global diffusion of epidemics, computer viruses and rumors causes great
damage to our society. One critical issue to identify the multiple diffusion sources so
as to timely quarantine them. However, most methods proposed so far are unsuitable
for diffusion with multiple sources because of the high computational cost and
the complex spatiotemporal diffusion processes. In this chapter, we introduce an
effective method to identify multiple diffusion sources, which can address three
main issues in this area: (1) How many sources are there? (2)Where did the diffusion
emerge? (3) When did the diffusion break out? For simplicity, we use rumor source
identification to present the approach.

11.1 Introduction

With the rapid urbanization and advancements in communication technologies, the


world has become more interconnected. This not only makes our daily life more
convenient, but also has made us vulnerable to new types of diffusion risks. These
diffusion risks arise in many different contexts. For instance, infectious diseases,
such as SARS [123], H1N1 [62] or Ebola [168], have spread geographically and
killed hundreds of thousands or even millions of people. Computer viruses, like
Cryptolocker and Alureon, cause a good share of cyber-security incidents [119]. In
the real world, rumors often emerge from multiple sources and spread incredibly
fast in complex networks [62, 123, 142, 168]. After the initial outbreak of rumor
diffusion, the following three issues often attract people’s attention: (1) How many
sources are there? (2) Where did the diffusion emerge? and (3) When did the
diffusion breakout?
In the past few years, researchers have proposed a series of methods to
identify rumor diffusion sources in networks. These methods mainly focus on the
identification of a single diffusion source in networks. For example, Shah and Zaman

© Springer Nature Switzerland AG 2019 139


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_11
140 11 Identifying Multiple Propagation Sources

[160] proposed a rumor-center method to identify rumor sources in networks under


complete observations. Essentially, they consider the rumor source as the node that
has the maximum number of distinct propagation paths to all the infected nodes.
Zhu and Ying [203] proposed a Jordan-center method, which utilizes a sample
path based approach, to detect diffusion sources in tree networks with snapshot
observations. Luo et al. [115] derived the Jordan-center method based on a different
approach. Shah and Zaman [161] proved that, even in tree networks, the rumor
source identification problem is a #P-complete problem, which is at least as hard
as the corresponding NP problem. The problem becomes even harder for generic
networks. However, due to the extreme complexity of the spatiotemporal rumor
propagation process and the underlying network structure, few of existing methods
are proposed for identifying multiple diffusion sources. Luo et al. [114] proposed
a multi-rumor-center method to identify multiple rumor sources in tree-structured
networks. The computational complexity of this method is O(nk ), where n is the
number of infected nodes and k is the number of sources. It is too computationally
expensive to be applied in large-scale networks with multiple diffusion sources.
Chen et al. [32] extended the Jordan-center method from single source detection to
the identification of multiple sources in tree networks. However, the topologies of
real-world networks are far more complex than trees. Fioriti et al. [58] introduced
a dynamic age method to identify multiple diffusion sources in general networks.
They claimed that the ‘oldest’ nodes, which were associated to those with largest
eigenvalues of the adjacency matrix, were the sources of the diffusion. Similar work
to this technique can be found in [148]. However, an essential prerequisite is that we
need to know the number of sources in advance.
In this chapter, we introduce a novel method, K-center, to identify multiple
rumor sources in general networks. In the real world, the rumor diffusion processes
in networks are spatiotemporally complex because of the combined multi-scale
nature and intrinsic heterogeneity of the networks. To have a clear understanding
of the complex diffusion processes, we adopt a measure, effective distance, recently
proposed by Brockmann and Helbing [24]. The concept of effective distance
reflects the idea that a small propagation probability between neighboring nodes
is effectively equivalent to a large distance between them, and vice versa. By using
effective distance, the complex spatiotemporal diffusion processes can be reduced
to homogeneous wave propagation patterns [24]. Moreover, the relative arrival time
of diffusion arriving at a node is independent of diffusion parameters but linear
with the effective distance between the source and the node of interest. For multi-
source diffusion, we obtain the same linear correlation between the relative arrival
time and the effective distance of any infected node. Thereby, supposing that any
node can be infected very quickly, an arbitrary node is more likely to be infected
by its closest source in terms of effective distance. Therefore, to identify multiple
diffusion sources, we need to partition the infection graph so as to minimize the sum
of effective distances between any infected node and the corresponding partition
center. The final partition centers are viewed as diffusion sources.
The contribution of this part of work is threefold corresponding to the three key
issues of rumor source identification problem:
11.2 Preliminaries 141

• We propose a fast method to identify multiple rumor diffusion sources. Based on


this method, we can determine where the diffusion emerged. We prove that the
proposed method is convergent and the computational complexity is O(mnlogα),
where α = α(m, n) is the slowly growing inverse-Ackermann function, n is the
number of infected nodes, and m is the number of edges connecting them.
• According to the topological positions of the detected rumor sources, we derive
an efficient algorithm to estimate the spreading time of the diffusion.
• When the number of sources is unknown, we develop an intuitive and effective
approach that can estimate the number of diffusion sources with high accuracy.

11.2 Preliminaries

In this section, we introduce preliminary knowledge used in this chapter, including


the analytic epidemic model [207] and the concept of effective distance [24]. For
convenience, we borrow notions from the area of epidemics to represent the states
of nodes in a network [186]. A node being infected stands for a person getting
infected by a disease, viruses having compromised a computer, or a user believing
a rumor. Readers can derive analogous meanings for a node being susceptible or
recovered.

11.2.1 The Epidemic Model

We adopt the classic susceptible-infected (SI) model to present the diffusion


dynamics of each node. Figure 11.1 shows the state transition graph of a node in
this model.
As shown in Fig. 11.1, every node is initially susceptible (Sus.). An arbitrary
susceptible node i can be infected (Inf.) by its already-infected neighbors with
probability v(i, t) at time t. Therefore, we can compute the probability of node i
to be susceptible at time t as in

PS (i, t) = [1 − v(i, t)] · PS (i, t − 1). (11.1)

Then, we can obtain the probability of node i to be infected at time t as in

PI (i, t) = v(i, t) · PS (i, t − 1) + PI (i, t − 1). (11.2)

Fig. 11.1 The state transition


graph of a node in the SI
model
142 11 Identifying Multiple Propagation Sources

We use ηj i to denote the propagation probability from node j to its neighboring


node i. Then, we can calculate the probability of node i being infected by its
neighbors as in

v(i, t) = 1 − j ∈Ni [1 − ηj i · PI (j, t − 1)], (11.3)

where Ni denotes the set of neighbors of node i. This model analytically derives
the probability of each node in various states at an arbitrary time. To address real
problems, the length of each time tick relies on the real environment. It can be
1 min, 1 h or 1 day. We also need to set the propagation probability ηij between
nodes properly.

11.2.2 The Effective Distance

Brockmann and Helbing [24] recently proposed a new measure, effective distance,
which can disclose the hidden pattern geometry of complex diffusion. The effective
distance from a node i to a neighboring node j is defined as

e(i, j ) = 1 − logηij , (11.4)

where ηij is again the propagation probability from i to j . This concept reflects
the idea that a small propagation probability from i to j is effectively equivalent
to a large distance between them, and vice versa. To illustrate this measure, a
simple example is shown in Fig. 11.2. For instance, the propagation probability
is 0.8 between node S and A, and is only 0.1 between S and B (see Fig. 11.2a).
Correspondingly, the effective distance between S and A is 1.22 which is much less
than that between S and B (see Fig. 11.2b).
Based on the effective distances between neighboring nodes, the length λ() of a
path  = {u1 , . . . , uL } is defined as the sum of effective lengths along the edges of
the path. Moreover, the effective distance from an arbitrary node i to another node j
is defined by the length of the shortest path in terms of effective distance from node
i to node j , i.e.,

d(i, j ) = min λ(), (11.5)

From the perspective of diffusion source s, the set of shortest paths in terms
of effective distance to all the other nodes constitutes a shortest path tree s .
Brockmann and Helbing obtain that the diffusion process initiated from node s on
the original network can be represented as wave patterns on the shortest path tree
s . In addition, they conclude that the relative arrival time of the diffusion arriving
at a node is independent of diffusion parameters and is linear with the effective
distance from the source to the node of interest.
11.3 Problem Formulation 143

Fig. 11.2 An example of altering an infection graph using effective distance. (a) An example
infection graph with source S. The weight on each edge is the propagation probability. The two
dot circles represent the first-order and second-order neighbors of source S. The colors indicate the
infection order of nodes, e.g., nodes A, C, D and F are infected after the first time tick. Notice that
the diffusion process is spatiotemporally complex. (b) The altered infection graph. The weight on
each edge is the effective distance between the corresponding end nodes. Notice that the effective
distances from source S to the infected nodes can accurately reflect their infection orders

In this chapter, we will alter the original network by utilizing effective distance
through converting the propagation probability on each edge to the corresponding
effective distance. Then, by using the linear relationship between the relative arrival
time and the effective distance of any infected node, we derive a novel method to
identify multiple diffusion sources.

11.3 Problem Formulation

Before we present the problem formulation derived in this chapter, we firstly show
an alternate expression of an arbitrary infection graph by using effective distance
(see again Fig. 11.2). Figure 11.2a shows an example of an infection graph with
diffusion source S. The colors indicate the infection order of nodes (e.g., nodes A,
C, D and F were infected after the first time tick T = 1, similarly for the other
nodes). Notice that the diffusion process is spatiotemporally complex, because the
first-order neighbors of source S can be infected after the second time tick (e.g.,
node E) or even the third time tick (e.g., node B), similarly for the second-order
and third-order neighbors. We then alter the infection graph by replacing the weight
on each edge with the effective distance between the corresponding end nodes (see
Fig. 11.2b). We notice that the effective distances from source S to all the infected
nodes can accurately reflect the infection order of them. This exactly shows that the
144 11 Identifying Multiple Propagation Sources

relative arrival time of an arbitrary node getting infected is linear with the effective
distance between the source and the node of interest.
Suppose that at time T = 0, there are k(≥ 1) sources, S ∗ = {s1 , . . . , sk }, starting
the diffusion simultaneously [58, 115]. Several time ticks after the diffusion started,
we got n infected nodes. These nodes form a connected infection graph Gn , and
each source si has its infection region Ci (⊆ Gn ). Let C ∗ = A˛ ∪ki=1 Ci be a partition
of the infection graph such that Ci ∩ Cj = ∅. for i = j . Each partition Ci is a
connected subgraph in Gn and consists of the nodes whose infection can be traced
back to the source node si . For an arbitrary infected node vj ∈ Ci , suppose it can
be infected in the shortest time, then according to our previous analysis, it will have
shorter effective distance to source si than to any other source. Therefore, we need
to divide the infection graph Gn into k partitions so that each infected node belongs
to the partition with the shortest effective distance to the partition center. The final
partition centers are considered as the diffusion sources.
Given an infection graph Gn , from the above analysis, we know that our goal is
to identify a set of diffusion sources S ∗ and the corresponding partition C ∗ of the
infection graph Gn . To be precise, we aim to find a partition C ∗ of Gn , to minimize
the following objective function,


k 
minC ∗ f = d(vj , si ), (11.6)
i=1 vj ∈Ci

where node vj belongs to partition Ci associated with source si , and d(vj , si ) is the
shortest-path distance in terms of effective distance between vj and si .
Equation (11.6) is the proposed formulation for multi-source identification
problem. Since, we need to find out the k centers of the diffusion from Eq. (11.6),
we name the proposed method of solving the multi-source identification problem as
the K-center method, which we will detail in the following section.

11.4 The K-Center Method

In this section, we propose a K-center method to identify multiple diffusion sources


and the corresponding infection regions in general networks. We firstly introduce
a method for network partition. Then, we derive the K-center method. Secondly,
according to the estimated sources, we derive an algorithm to predict the spreading
time of the diffusion. Finally, we present a heuristic algorithm to estimate the
diffusion sources when the number of sources is unknown.
11.4 The K-Center Method 145

11.4.1 Network Partitioning with Multiple Sources

Given an infection network Gn and a set of sources S ∗ = {s1 , . . . , sk }, network par-


tition refers to the division of a network into k partitions with si (i ∈ {1, 2, . . . , k})
as the partition centers. According to our previous analysis in Sect. 11.3, an arbitrary
node vj ∈ Gn should be classified into partition Ci associated with source si , such
that

d(vj , si ) = minsl ∈S d(vj , sl ). (11.7)

In essence, for an arbitrary node vj ∈ Gn , it needs to be associated to source si that


is the nearest source to vj . This is similar to the Capacity Constrained Network-
Voronoi Diagram (CCNVD) problem [195]. Given a graph and a set of service
centers, the CCNVD partitions the graph into a set of contiguous service areas that
meet service center capacities and minimize the sum of the distances (min-sum)
from graph nodes to allotted service centers. The CCNVD problem is important for
critical societal applications such as assigning evacuees to shelters and assigning
patients to hospitals.
In this chapter, to satisfy Eq. (11.7), we utilize the Voronoi strategy to partition
the altered infection graph obtained from Sect. 11.3. The detailed Voronoi partition
process is shown in Algorithm 11.1. Future work may use community structure
for network partition. Current methods for detecting community structure include
strategies based on betweenness [72], information theory [152], and modularity
optimization [132].

11.4.2 Identifying Diffusion Sources and Regions

In this subsection, we present the K-center method to identify multiple diffusion


sources. According to the objective function in Eq. (11.6), we need to find a partition
C ∗ of the altered infection graph Gn , which can minimize the sum of the effective

Algorithm 11.1: Network partition


Input: A set of partition centers S = {si |i = 1, . . . , k} in an infection graph Gn .
Initialize: Initialize k partitions: C1 = {s1}, . . . ,Ck = {sk }.
for (j starts from 1 to n) do
Find the nearest source to node vj as follows,

si = argminsl ∈S d(xj , sl ). (11.8)

Put node vj into partition Ci .


Output: A partition of Gn : C ∗ = ∪ki=1 Ci .
146 11 Identifying Multiple Propagation Sources

Algorithm 11.2: K-center to identify multiple sources


Input: An infection graph Gn and the number of sources k.
Initialize: Initialize an positive integer L, and randomly choose a set of sources
(0) (0)
S (0) = {s1 , . . . , sk } ⊆ Gn .
for (l starts from 1 to a given maximum value L) do
Use Algorithm 11.1 to partition Gn with partition centers S (0) , and obtain a partition:

(l)
C (l) = ∪ki=1 Ci . (11.9)
(l)
Find the new center in each partition Ci as follows,

(l)

si = argminv (l) d(vj , vx ), i = 1, . . . , k. (11.10)
j ∈Ci
(l)
vx ∈Ci

if (S (l) = {s1(l) , . . . , sk(l) } is the same as S (l−1) ) then


Stop.

Output: A set of estimated sources S (l) = {s1(l) , . . . , sk(l) }.

distances between each infected node and its corresponding partition center. From
the previous subsection, if we randomly choose a set of sources S, Voronoi partition
can split the network into subnets such that each node is associated with its nearest
source. Thus, Voronoi partition can find a local optimal partition of Gn with a fixed
set of sources S. However, to optimize the partition C ∗ , we need to adjust the
center of each partition so as to minimize the objective function in Eq. (11.6). In
this chapter, we adjust the center of each partition by choosing a new center as the
node that has the minimum sum of effective distances to all the other nodes in the
partition. Therefore, we call this method as the K-center method. This is similar to
the rumor-center method and the Jordan-center method that consider rumor centers
or Jordan centers as the diffusion sources. As the name suggests, the K-center
method is more specific to the multi-source identification. The detailed process of
the K-center method is shown in Algorithm 11.2.
The following two theorems show the convergence of the proposed K-center
method and its computational complexity.
Theorem 11.1 The objective function in Eq. (11.6) is monotonically decreasing in
iterations. Therefore, the K-center method is convergent.
Proof Suppose that at iteration t, St = {s1t , . . . , skt } are the estimated sources. We
then use Algorithm 11.1 to partition the infection graph Gn as C t = ∪ki=1 Cik . Thus,
the objective function at iteration t becomes


k 
ft = d(vj , sit ). (11.11)
i=1 vj ∈Cit
11.4 The K-Center Method 147

At the next iteration t + 1, according to the K-center method, we recalculate the


center of each partition Cit and obtain S t+1 = {s1t+1 , . . . , skt+1 }, such that
 
d(vj , sit+1 ) ≤ d(vj , sit ). (11.12)
vj ∈Cit vj ∈Cit

Then, the objective function becomes


k 
f˜t = d(vj , sit+1 ). (11.13)
i=1 vj ∈Cit

From Eqs. (11.11) and (11.12), we notice that

f˜t ≤ f t . (11.14)

We then re-partition the infection graph Gn with centers S t+1 = {s1t+1 , . . . , skt+1 }
such that each infected node vj ∈ Gn will be associated to a nearest center sit+1 ,
and obtain a new partition C t+1 = ∪ki=1 Cit+1 of Gn . Thus, the objective function at
iteration t + 1 becomes


k 
f t+1 = d(vj , sit+1 ). (11.15)
i=1 vj ∈Cit+1

Since each node is classified to the nearest sit+1 , we see that

f t+1 ≤ f˜t . (11.16)

From Eqs. (11.14) and (11.16), we have

f t+1 ≤ f˜t ≤ f t . (11.17)

Therefore, the objective function in Eq. (11.6) is monotonically decreasing, i.e., the
K-center method is convergent.
Theorem 11.2 Given a infection graph Gn with n nodes and m edges, the compu-
tational complexity of the K-center method is O(mnlogα), where α = α(m, n) is
the very slowly growing inverse-Ackermann function [144].
Proof From Algorithm 11.2, we know that the main difficulty of the K-center
method stems from the calculation of the shortest paths between node pairs in the
altered infection graph Gn . Other computation in this algorithm can be treated as
a constant. In this chapter, we adopt the Pettie-Ramachandran algorithm [144],
to compute all-pairs shortest paths in Gn . The computational complexity of the
148 11 Identifying Multiple Propagation Sources

algorithm is O(mnlogα), where α = α(m, n) is the very slowly growing inverse-


Ackermann function [144]. Therefore, we have proved the theorem.
According to Theorem 11.1, we know that the proposed K-center method is well
defined. We notice that the rationale of the K-center method is similar to that of the
K-means algorithm in the data-mining field. Similar to the K-means algorithm, there
is no guarantee that a global minimum in the objective function will be reached.
From Theorem 11.2, We see that the computational complexity of the K-center
method is much less than the method in [115] with O(nk ), and much less than
the method in [58] with O(n3 ). In addition, the proposed method can be applied in
general networks, whereas the other methods mainly focus on trees. Comparatively,
the proposed method is more efficient and practical in identifying multiple diffusion
sources in large networks.

11.4.3 Predicting Spreading Time

Given an infection graph Gn , we can obtain a partition C ∗ = ∪ki=1 Ci of Gn and


the corresponding partition centers S ∗ by using the proposed K-center method.
According to the SI model in Sect. 11.2.1, the spreading time of diffusion can be
estimated by the total number of time ticks of the diffusion. Then, we can predict the
spreading time based on the hops between the source and the infected nodes in each
partition. For an arbitrary source si associated with partition Ci and an arbitrary node
vj ∈ Ci , we introduce h(si , vj ) to denote the minimum number of hops between si
and vj . Therefore, the spreading time in each partition can be estimated as in

ti = max{h(si , vj )|vj ∈ Ci }, i ∈ {1, . . . , k}. (11.18)

Then, the spreading time of the whole diffusion is as in

T = max{ti |i = 1, . . . , k}. (11.19)

The spreading time T based on hops has simplified the modeling process. In the real
world, the spreading time of different paths with the same number of hops may vary
from each other. We have solved this temporal problem of the SI model in another
chapter [186]. In this field, the majority of current modeling is based on spreading
hops [176]. To be consistent with previous work, we adopt the simplified hop-based
SI model to study the source identification problem.
11.5 Evaluation 149

11.4.4 Unknown Number of Diffusion Sources

In most practical applications, the number of diffusion sources is unknown. In this


subsection, we present a heuristic algorithm that allows us to estimate the number
of diffusion sources.
From Sect. 11.4.3, we know that if the number of sources k is given, we can
estimate the spreading time T (k) using Eq. (11.19). To estimate the number of
diffusion sources, we let k start from 1 and compute the spreading time T (1) .
Then, we increase the number of sources k by 1 in each iteration and compute the
corresponding spreading time T (k) until we find T (k) = T (k+1) , i.e., the spreading
time of the diffusion stays the same when the number of sources increases from k
to k + 1. That is to say when the number of sources increases from k to k + 1, they
can lead to the same infection graph Gn . We then choose the number of diffusion
sources as k (or k + 1). The detailed process of the K-center method with unknown
number of sources is shown in Algorithm 11.3. We evaluate this heuristic algorithm
in Sect. 11.5.2.
To address real problems, we firstly need to get the underlying network over
which the real diffusion spreads. Secondly, we need to measure the propagation
probability on each edge in the network. Thirdly, according to the SI model in
Sect. 11.2.1, we need to specify the length of one time tick of the diffusion properly.
All of this information is crucial to source identification. However it requires big
effort to obtain.

11.5 Evaluation

In this section, we evaluate the proposed K-center method in three real network
topologies: the North American Power Grid [180], the Yeast protein-protein inter-
action network [82], and the Facebook network [171]. The Facebook network
topology is crawled from December 29th, 2008 to January 3rd, 2009. The basic
statistics of these networks are shown in Table 11.1, and their degree distributions

Algorithm 11.3: K-center identification with unknown number of sources


Input: An infection graph Gn .
Initialize: Initialize the number of sources k = 1, and set T (0) = 0.
while (1) do
Use Algorithm 11.2 to identify a set of k sources in Gn : S (k) = {s1 , . . . , sk }.
Calculate the spreading time T (k) in Eq. (11.19).
if (T (k) = T (k−1) ) then
Stop.
Update k = k + 1.
Output: A set of k estimated sources S (k) = {s1 , . . . , sk }.
150 11 Identifying Multiple Propagation Sources

Table 11.1 Statistics of the Dataset Power grid Yeast Facebook


datasets collected in
experiments # nodes 4941 2361 45,813
# edges 13,188 13,554 370,532
Average degree 2.67 5.74 8.09
Maximum degree 19 64 223

Fig. 11.3 Degree distribution. (a) Power grid; (b) Yeast; (c) Facebook

are shown in Fig. 11.3. We adopt the classic SI model, and suppose all infections
are independent of each other. In simulations, we typically set the propagation
probability on each edge, ηij , uniformly distributed in (0, 1). As previous work
[186, 207] has proven that the distribution of propagation probability will not
affect the accuracy of the SI model, uniform distribution is enough to evaluate the
performance of the proposed method. Similar propagation probability setting can be
found in [116, 203] and [32]. We randomly choose a set of sources S ∗ , and let the
number of diffusion sources |S ∗ | range from 2 to 5. For each type of network and
each number of diffusion sources, we perform 100 runs. The number of 100 comes
from the discussion in the previous work of [207]. The implementation is in C++
and Matlab2012b.
We firstly show the convergence of the proposed method. Figure 11.4 shows the
objective function values in iterations when the number of sources is 2 in the three
real network topologies. It can be seen that the objective function is monotonically
decreasing in iterations. Similar results can be found when we choose different
number of sources. This, therefore, justifies Theorem 11.1 in Sect. 11.4.2.
11.5 Evaluation 151

Fig. 11.4 The monotonically decreasing of the objective functions. (a) Power grid; (b) Yeast; (c)
Facebook

11.5.1 Accuracy of Identifying Rumor Sources

We compare the performance of the proposed K-center method with two competing
methods: the dynamic age method [58] and the multi-rumor-center method [115].
To quantify the performance of each method, we firstly match the estimated sources
Ŝ = {ŝ1 , . . . , ŝk } with the real sources S ∗ = {s1 , . . . , sk } so that the sum of the
error distances between each estimated source and its match is minimized [160].
The average error distance is then given by

|S ∗ |
1 
= ∗ h(si , ŝi ). (11.20)
|S |
i=1

We expect that our method can accurately capture the real sources or at least a set
of sources very close to the real sources (i.e.,  is as small as possible).
The average error distances for the three real network topologies are provided in
Table 11.2. From this table we can see that the proposed method outperforms the
other two methods, that the estimated sources are closer to the real sources. To have
a clearer comparison between our proposed method and the other two methods,
we show the histogram of the average error distances () in Figs. 11.5 and 11.6,
when |S ∗ | = 2 or 3, respectively. We can see that the proposed K-center method
outperforms the others. When |S ∗ | = 2, the estimated sources are very close to
the real sources in the Power Grid, with the average error distances are generally
1–2 hops. However, the average error distances are around 3–4 hops when using
the multi-rumor-center method, and around 3–5 hops when using the dynamic age
method. For the Yeast network, the diffusion sources estimated by the proposed
method are with an average of 2–3 hops away from the real sources. However, the
152 11 Identifying Multiple Propagation Sources

Table 11.2 Accuracy of multi-source identification


Experiment settings Average error distance 
Network |S ∗ | MRC Dynamic age K-center Infection percentage %
Power grid 2 3.135 3.610 1.750 96.290
3 4.246 4.726 2.670 83.237
4 5.331 6.027 3.240 78.322
5 6.388 7.117 3.418 72.903
Yeast 2 2.700 3.175 2.680 89.606
3 3.520 3.146 2.733 74.762
4 3.525 3.077 2.962 70.599
5 3.474 3.050 2.874 68.563
Facebook 2 3.433 3.950 3.215 81.776
3 4.667 4.763 4.073 76.654
4 5.120 5.762 4.137 69.762
5 5.832 6.701 4.290 63.723

Fig. 11.5 Histogram of the average error distances () in various networks when S ∗ = 2. (a)
Power grid; (b) Yeast; (c) Facebook

Fig. 11.6 Histogram of the average error distances () in various networks when S ∗ = 3. (a)
Power grid; (b) Yeast; (c) Facebook
11.5 Evaluation 153

sources estimated by using the multi-rumor-center method are averagely 2–4 hops
away from the real sources, and averagely 3–4 hops away when using the dynamic
age method. For the Facebook network, the proposed method can estimate the
diffusion sources with an average of 2–3 hops away from the real sources. However,
the estimated sources are averagely 3–4 hops away from the real sources when using
the other two methods. Similarly, when |S ∗ | = 3, the diffusion sources estimated by
the proposed method are much closer to the real sources in these real networks.
We have compared the performance of our method with two competeting
methods. From the experiment results (Figs. 11.5 and 11.6, and Table 11.2), we
see that our proposed method is superior to previous work. Around 80% of all
experiment runs identify the nodes averagely 2–3 hops away from the real sources
when there are two diffusion sources. Moreover, when there are three diffusion
sources, there are also around 80% of all experiment runs identifying the nodes
averagely 3 hops away from the real sources.

11.5.2 Estimation of Source Number and Spreading Time

In this subsection, we evaluate the performance of the proposed method in estimat-


ing the number of sources and predicting diffusion spreading time.
Table 11.3 shows the means and the standard deviations of the time estimation in
the three real networks when we vary the real spreading time T from 4 to 6 and the
number of sources |S ∗ | from 2 to 3. Notice that the means of the estimated time are
very close to the real spreading time under different experiment settings, and most
results of the standard deviations are smaller than 1. This indicates that our method
can estimate the real spreading time with high accuracy.
Figure 11.7 shows the results in estimating the number of diffusion sources
in different networks. We let the number of sources, |S ∗ |, range from 1 to 3. In
Fig. 11.7, the horizontal axis indicates the estimated number of sources and the

Table 11.3 Accuracy of spreading time estimation


Experiment settings Estimated spreading time
Network |S ∗ | T =4 T =5 T =6
Power grid 2 4.020 ± 0.910 5.050 ± 1.256 5.627 ± 1.212
3 4.085 ± 0.805 5.051 ± 0.934 6.006 ± 1.123
Yeast 2 4.600 ± 0.710 5.130 ± 0.469 5.494 ± 0.578
3 4.534 ± 0.427 5.050 ± 0.408 5.447 ± 0.396
Facebook 2 4.380 ± 1.170 5.246 ± 0.517 5.853 ± 0.645
3 4.417 ± 0.736 5.378 ± 0.645 5.738 ± 0.467
154 11 Identifying Multiple Propagation Sources

Fig. 11.7 Estimate of the number of sources. (a) Yeast; (b) Power grid; (c) Facebook

vertical axis indicates the percentage of experiment runs estimating the correspond-
ing the number sources. For the Yeast network, we see that 70% experiment runs
can accurately estimate the number of sources when |S ∗ | = 1. More than 80% of
experiment runs can accurately estimate the number of sources when |S ∗ | = 2, and
around 60% when |S ∗ | = 3. For the Power Grid network, it can be seen that around
50% of the total experiment runs can accurately detect the number of sources when
|S ∗ | ranges from 1 to 3. The accuracy is about 68% on Facebook when |S ∗ | ranges
from 1 to 3.
The high accuracy in estimating both the spreading time and the number of
diffusion sources reflects the efficiency of our method from different angles.

11.5.3 Effectiveness Justification

We justify the effectiveness of the proposed K-center method from two different
aspects. Firstly, we examine the correlation between the objective function values
in Eq. (11.6) of the estimated sources and those of the real sources. If they
are highly correlated with each other, the objective function in Eq. (11.6) will
accurately describe the multi-source identification problem. Secondly, at each time
tick, we examine the average effective distances from the newly infected nodes to
their corresponding diffusion sources. The linear correlation between the average
effective distances and the spreading time will justify the effectiveness of using
effective distance in estimating multiple diffusion sources.
11.5 Evaluation 155

11.5.3.1 Correlation Between Real Sources and Estimated Sources

We investigate the correlation between the estimated sources and the real sources
by examining the correlation of their objective function values in Eq. (11.6). If the
estimated sources are exactly the real sources, their objective function values f
should present high correlations.
Figures 11.8 and 11.9 shows the correlation results of the objective function
values when |S ∗ | is 2 or 3, respectively. We can see that their objective function
values approximately form linear relationships. This means that the real sources and
the estimated sources are highly correlated with each other. The worst results occur
in Figs. 11.8a and 11.9a in the Power Grid network. However, the majority of the
correlation results in these two figures still tend to be clustered in a line. The strong
correlation between the real sources and estimated sources reflects the effectiveness
of the proposed method.

11.5.3.2 Average Effective Distance at Each Time Tick

We further investigate the correlation between the relative arrival time of nodes
getting infected and the average effective distance from them to their corresponding
sources. The experiment results in different networks when |S ∗ | is 2 or 3 are shown
in Figs. 11.10 and 11.11, respectively.
As shown in Figs. 11.10 and 11.11, at each time tick, the effective distance from
the nodes infected at this time tick to their corresponding sources are indicated as
blue circles. Their average effective distance to the corresponding sources at each
time tick is indicated as red square. It can be seen that the average effective distance
is linear with the relative arrival time. This therefore justifies that the proposed K-
center method is well-developed.

Fig. 11.8 The correlation between the objective function of the estimated sources and that of the
real sources when S ∗ = 2. (a) Power grid; (b) Yeast; (c) Facebook
156 11 Identifying Multiple Propagation Sources

Fig. 11.9 The correlation between the objective function of the estimated sources and that of the
real sources when S ∗ = 3. (a) Power grid; (b) Yeast; (c) Facebook

Fig. 11.10 The effective distances between the nodes infected at each time tick and their
corresponding sources when S ∗ = 2. (a) Power grid; (b) Yeast; (c) Facebook

Fig. 11.11 The effective distances between the nodes infected at each time tick and their
corresponding sources when S ∗ = 3. (a) Power grid; (b) Yeast; (c) Facebook
11.6 Summary 157

11.6 Summary

In this chapter, we studied the problem of identifying multiple rumor sources


in complex networks. Few of current techniques can detect multiple sources in
complex networks. We used effective distance to transform the original network in
order to have a clear understanding of the complex diffusion pattern. Based on the
altered network, we derive a succinct formulation for the problem of identifying
multiple rumor sources. Then we proposed a novel method that can detect the
positions of the multiple rumor sources, estimate the number of sources, and predict
the spreading time of the diffusion. Experiment results in various real network
topologies show the outperformance of the proposed method than other competing
methods, which justify the effectiveness and efficiency of our method.
The identification of multiple rumor sources is a significant but difficult task.
In this chapter, we have adopted the SI model with the knowledge of which nodes
are infected and their connections. There are also some other models, such as the
SIS and SIR. These models may conceal the infection history of the nodes that have
been recovered. Therefore, the proposed method requiring a complete observation of
network will not work in the SIS or SIR model. According to our study, we may need
other techniques, e.g., the network completion, to cope with the SIS and SIR models.
Future research includes the use of different models. Moreover, in the real world, we
may only obtain partial observations of a network. Thus, future work includes multi-
source identification with partial observations. We may also need to take community
structures into account to more accurately identify multiple diffusion sources.
Chapter 12
Identifying Propagation Source
in Large-Scale Networks

The global diffusion of epidemics, rumors and computer viruses causes great
damage to our society. It is critical to identify the diffusion sources and promptly
quarantine them. However, one critical issue of current methods is that they are
far are unsuitable for large-scale networks due to the computational cost and
the complex spatiotemporal diffusion processes. In this chapter, we introduce a
community structure based approach to efficiently identify diffusion sources in large
networks.

12.1 Introduction

With rapid urbanization progress worldwide, the world is becoming increasingly


interconnected. This brings us great convenience in daily communication but also
enables rumors to diffuse all around the world. For example, infectious diseases,
such as H1N1 or Ebola, can spread geographically and affects the lives of tens of
thousands or even millions of people [62, 168]. Computer viruses, like Cryptolocker,
can quickly spread among the Internet and cause numerous cyber-security incidents
[17]. ZDNet reported that, until 22 Dec 2013, CryptoLocker had infected around
250,000 victims, demanding an average $300 payout, which has left a trail of
millions in laundered Bitcoin [19]. Rumors started by a few individuals can
spread incredibly fast in online social networks [42]. In 7 Sep 2014, the rumor of
clicking fake URL links promising nude photos of Hollywood celebrities caused
distributed denial of service (DDoS) attacks and sent the Internet into meltdown
in New Zealand [142]. Therefore, it is of great significance to identify diffusion
sources, so that we can save lives from disease, reduce cybercrime and diminish
rumor damages. The huge scale of the underlying networks and the complex
spatiotemporal rumor diffusion make it difficult to develop effective strategies to

© Springer Nature Switzerland AG 2019 159


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_12
160 12 Identifying Propagation Source in Large-Scale Networks

quickly and accurately identify rumor sources and therefore eliminate the socio-
economic impact of dangerous rumors.
In the past few years, researchers have proposed a series of methods to identify
diffusion sources in networks. However, those methods are either with high
computational complexity or with relative lower complexity but for particularly
structured networks (e.g., trees and regular networks). For example, the initial
methods of rumor source identification are designed for tree networks, including
the rumor center method [161] and the Jordan center method [201]. Even in tree
networks, the problem of identifying diffusion sources is proven to be #P-complete
[161]. Later, the constraints on trees were relaxed but with complete or snapshot
observations through heuristic strategies, including Bayesian inference [7, 146],
spectral techniques [58], and centralities methods [34]. Most of them are based
on scanning the whole network. However, real networks are far more complex
than tree networks and it is impractical to scan the whole network to locate the
diffusion source, especially for large networks. Recently, Pinto et al. [146] proposed
to identify rumor sources based on sensor observations. The proposed Gaussian
method chooses sensors randomly or set up sensors on high degree nodes. In fact,
the selection of sensors is crucial in identifying rumor sources since well chosen
sensors can reflect the spreading direction and speed of the diffusion. Seo et al.
[158] compared different strategies of choosing sensors, and concluded that high
betweenness or degree sensors are more efficient in identifying rumor sources. They
proposed a Four-metric source estimator which is also based on scanning the whole
network and view the diffusion source as the node which not only can reach the
infected sensors with the minimum sum of distances but also is the furthest away
from the non-infected sensors. In a nutshell, current methods are not suitable for
large-scale networks due to the expensive computational complexity and the large
scale of real networks. Readers could refer to [87] for a detailed survey in this area.
In this chapter, we propose a community structure based approach to identify
diffusion sources in large-scale networks. It not only addresses the scalability issue
in this area, but also shows significant advantages. Firstly, to effectively set up sparse
sensors, we detect the community structure of a network and choose the community
bridge nodes as sensors. According to the earliest infected bridge sensors, we can
easily determine the very first infected community where the diffusion started and
spread out to the rest of the network. Consequently, this narrows the suspicious
sources down to the very first infected community. Therefore, this overcomes
the scalability issue of current methods. According to a fundamental property of
communities that links inside are much denser than those connecting outside nodes,
bridge sensors will be very sparse. Secondly, to accurately locate the diffusion
source from the first infected community, we use the intrinsic property of the
diffusion source that the relative infection time of any node is linear with its effective
distance from the source. The effective distance between any pair of nodes is based
on not only the number of hops but also the propagation probabilities along the paths
between them [24]. It reflects the idea that a small propagation probability between
nodes is effectively equivalent to a large distance between them, and vice versa.
Finally, we use correlation coefficient to measure the degree of linear dependence
12.2 Community Structure 161

between the relative infection time and effective distances for each suspect, and
consider the one that has the largest correlation coefficient as the diffusion source.
The main contribution of this chapter is threefold.
• We address the scalability issue in source identification problems. Instead of
randomly choosing sensors or setting up high centrality nodes as sensors in
previous methods, we assign sensors on community bridges. According to the
infection time of bridge sensors, we can easily narrow the suspicious sources
down to the very first infected community.
• We propose a novel method which can efficiently locate diffusion sources from
the suspects. Here, we use the intrinsic property of the real diffusion source that
the effective distance to any node is linear with the relative infection time of
that node. The effective distance makes full use of the propagation probability
and the number of hops between node pairs, which dramatically enhances the
effectiveness of our method.
• We evaluate our method in two large networks collected from Twitter. The exper-
iment results show significant advantages of our method in identifying diffusion
sources in large networks. Especially, when the average size of communities
shrinks, the accuracy of our method increases dramatically.

12.2 Community Structure

In general, communities are groups of nodes sharing common properties or corre-


sponding to functional units within a networked system. Many networks of interest,
including social networks, computer networks, and transportation networks, are
found to divide naturally into communities, where the links inside are much denser
than those connecting this set and the rest of the network [72]. Recent research
results show that community structure can dramatically affect the behavior of
dynamical processes of complex networks [132].
Past work on methods for discovering communities in networks divides into two
principal lines of research, both with long histories. The first, generally called hard-
partition, assumes that communities of complex networks are disjoint, placing each
node in only one community and all communities are non-overlapped. Algorithms
include division based on betweenness [72], information theory [152], modularity
optimization [132], and some others [155]. However, many real networks are char-
acterized by well-defined statistics of overlapping communities [139]. For example,
in collaboration networks an author might work with researchers in many groups,
and in biological networks a protein might interact with many groups of proteins.
Algorithms in detecting overlapping communities include techniques based on
k-clique [139], Link Clustering [3], and some others [138]. Figure 12.1 shows
two examples to illustrate separated communities and overlapping communities. To
demonstrate the robustness of the results across different types of communities, we
162 12 Identifying Propagation Source in Large-Scale Networks

Fig. 12.1 Illustration of network communities and community bridges. (a) Separated communi-
ties. Community bridges are the nodes associated with between-community edges, e.g., nodes A
and D connecting the blue community and the green community. (b) Overlapping communities.
Community bridges are not only the nodes associated with between-community edges but also the
nodes shared by different communities, e.g., nodes H , I and J shared by the green community and
the yellow community

will apply both separated and overlapping community detection methods on various
real networks− Infomap [153] and Link Clustering [3].
In this chapter, we will use community structures of networks to effectively
assign sensors. More specifically, we set sensors on community bridges. Community
bridges are nodes shared by two or more different communities or associated with
inter-community links (See Fig. 12.1). This is fundamentally different from previous
methods which choose high centrality nodes as sensors or even randomly set up
sensors.

12.3 Community-Based Method

In this section, we first introduce an effective strategy to set up sensors. Then,


we derive an efficient method to detect sources according to sensors’ sparse
observations. Finally, we analyze the computational complexity of our method and
compare to that of current methods.

12.3.1 Assigning Sensors

To identify diffusion sources under sensor observations, it is very critical to assign


sensors properly. To effectively set up sensors in a network, we need to choose
the nodes that are very important in diffusion processes. From Sect. 12.2, we
12.3 Community-Based Method 163

see that community bridges play a crucial role in transmitting information from
one community to another. They can reflect the spreading direction and speed of
diffusion. Thus, we choose community bridges as sensors.
To assign sensors on community bridges, we first need to detect the community
structure of a network. According to Sect. 12.2, community structures can generally
be divided into two categories: separated communities and overlapping communi-
ties. For separated communities, community bridges are the nodes associating with
the inter-community edges. For example in Fig. 12.1a, the green community and
red community are connected by bridges E and F , and the bridges B, H , C and G
connect the blue community and the red community. For overlapping communities,
community bridges correspond not only to the nodes associated with the inter-
community edges, but also the nodes shared by different communities. For example
in Fig. 12.1b, the green community and the yellow community are connected by
shared bridges H , I and J , and bridges G, C and D connect the green community
and the purple community.
When we assign sensors on community bridges, we need to pay attention to the
number of sensors. The more sensors we set up, the more information we will collect
from them. However, in the real world, setting up more sensors will require more
money to buy equipment and more labor to maintain them. Generally, we can control
the number of bridges by regulating the average size of communities. The larger
the average size of communities, the smaller the number of community bridges,
and vice versa. Here are two extreme examples to explain this. (1) If we divide a
network into two communities, and choose one node as the first community and all
the remaining nodes in the second community, the number of bridge nodes will be
d + 1, where d is the degree of the node in the first community. (2) If we set every
single node as a community, the number of bridges will be the number of nodes in
the whole network. Furthermore, the number of bridges will be very small because
of the intrinsic property of communities that the links between communities is much
sparser than those within communities. In Sect. 12.4.2, we will analyze in detail the
influence of the average size of communities in detecting diffusion sources.
Compared with the existing sensor selection methods, which randomly choose
sensors or select high centrality nodes as sensors, the proposed community structure
based sensor selection method can additionally reflect the diffusion direction and
speed. We will compare different sensor-selection methods in various real networks
in Sect. 12.4.4.

12.3.2 Community Structure Based Approach

The proposed community structure based approach consists of two steps. In the first
step, we determine the very first infected communities. Given a diffusion process
running for some time in a network, we obtain sparse observations from the sensors
assigned by the scheme in the previous subsection. Assume there are k sensors
having been infected, denoted as O = {o1 , . . . , ok }, and {t1 , . . . , tk } represents the
164 12 Identifying Propagation Source in Large-Scale Networks

time at which the infection arrives at these sensors. Then, according to the first
infected sensor(s), we can determine which community started the diffusion since
the diffusion has to go through community bridges to infect other communities. For
example in Fig. 12.1a, if sensors {H, F, E, B} are observed as infected and node
H is the first infected one, we can determine that the diffusion started from the red
community. In Fig. 12.1b, if sensors {K, F, H, J, G} are observed infected and node
K is the first infected one, we can determine that the diffusion could have started
from the blue community or the yellow community. We denote the set of nodes in
the first infected communities as

U = {u1 , u2 , . . . , um }. (12.1)

Since we do not have an absolute time reference, we have knowledge only about
the relative infection time. Choosing an arbitrary infected sensor, say o1 , as the
reference node, we can obtain the relative infection time of all the infected sensors
as in

τ = {0, t2 − t1 , . . . , tk − t1 }. (12.2)

In the second step, we investigate each suspect in the set U and identify the real
diffusion source. According to the properties of effective distance in, we know that
the relative infection time of any infected node is linear with its effective distance
from the real diffusion source. Therefore, to identify the diffusion source, we aim to
find the suspect with the best linear correlation between sensors’ relative infection
time and their effective distances from this suspect. Here, we use the correlation
coefficient, which is widely used as a measure of the degree of linear dependence
between two variables [104]. The correlation coefficient between two vectors x =
{x1 , x2 , . . . , xn } and y = {y1 , y2 , . . . , yn } is defined as,
n
(xi − x̄)(yi − ȳ)
e =  i=1  , (12.3)
n n
(x
i=1 i − x̄) 2 (y
i=1 i − ȳ)

where x̄ and ȳ are the means of xi and yi , respectively. The correlation coefficient
ranges from -1 to 1. A value of 1 implies that a linear equation describes the
relationship between x and y perfectly, with all data points lying on a line for which
y increases as x increases. A value of -1 implies that all data points lie on a line
for which y decreases as x increases. A value of 0 implies that there is no linear
correlation between the variables. Therefore, we need to find a suspect with the
maximum correlation coefficient. More precisely, we aim to find a suspect in U
to maximize Eq. (12.3) in terms of the relative infection time of sensors and their
effective distance from the suspect. The detailed process of the proposed approach
is given in Algorithm 12.1.
Compared with current methods of identifying diffusion sources, the proposed
approach is superior, as many of the existing methods ignore the propagation
12.3 Community-Based Method 165

Algorithm 12.1: Community structure based approach


Assigning Sensors: Detect community structure of a given network, and then set the
community bridges as sensors.
Input: A set of infected sensors O = {o1 , o2 , . . . , ok }, and their infection times
T = {t1 , t2 , . . . , tk }.
Initialization: The optimal diffusion source s ∗ = ∅, and the optimal correlation coefficient
e∗ = −∞.
Step 1: Find the earliest infected sensor. Without loss of generality, we assume o1 is the first
infected sensor. Choose o1 as the reference, and calculate the relative infection time of all
the infected sensors, denoted as

τ = {0, t2 − t1 , t3 − t1 , . . . , tk − t1 }.

Find the communities that contain sensor o1 , and combine the nodes in these communities,
denoted as

U = {u1 , u2 , . . . , um }.

Step 2: Calculate the correlation coefficient for each node in U and find the one which has
the largest correlation coefficient as follows.
for (each ui in U ) do
Compute the effective distance between ui and any infected sensor oj , denoted as

γ = [D(ui , o1 ), D(ui , o2 ), . . . , D(ui , ok )]. (12.4)

Compute the correlation coefficient between τ and γ ,


k
j =1 (τj − τ̄ )(γ − γ̄ )
e =   , (12.5)
k k
j =1 (τj − τ̄ ) j =1 (γ − γ̄ )
2

where τ̄ is the mean of τ , and γ̄ is the mean of γ .


if (e > e∗ ) then
Set e∗ = e, and s ∗ = ui .
Output: The estimated optimal diffusion source s ∗ .

probabilities [87]. The proposed method utilizes the effective distance between
nodes which precisely reflects not only the propagation probability but also the
number of hops between nodes. This makes our algorithm more accurate and
effective. The comparison of our method to many competeting methods is shown
in Sect. 12.4.4.

12.3.3 Computational Complexity

In this subsection, we analyze the computation complexity of the proposed method


and compare it with other existing methods of identifying diffusion sources based on
sensor observations, including the Gaussian method [146], the Monte Carlo method
[2], and the Four-metric method [158].
166 12 Identifying Propagation Source in Large-Scale Networks

From Algorithm 12.1, we see that the computation of our method is dominated
by Step 2 of calculating the correlation coefficient e for each suspect ui in the
very first infected community U . More specifically, the majority of computation
is in the calculation of effective distance between ui and any infected sensor oj
(∈ {o1 , o2 , . . . , ok }). Here, we use Dijkstra’s algorithm [63] to compute the shortest
paths (i.e., the effective distances) to all infected sensors from each ui . Dijkstra’s
algorithm requires O(M + N logN) computations to find the shortest paths from
one node to every other node in a network, where M is the number of edges and N
is the number of nodes in the network. However, in Algorithm 12.1, we only need to
calculate the effective distance between each suspect ui and any infected sensor oj ,
i.e., [D(ui , o1 ), D(ui , o2 ), . . . , D(ui , ok )] in Eq. (12.4). Therefore, the complexity
will be far less than O(M + N logN). Suppose the average size of communities in
the network is m and the number of infected sensors is k. Then, the computational
complexity of the proposed method is far less than O(L(M + N logN )), where
L = min{k, m}. Thus, if the average community size is smaller, it requires less time
to identify the diffusion source.
Current methods are far more complex than the proposed method. They need to
scan the whole network, and calculate the shortest path from each sensor to any
other node. For example, the computational complexity of the Gaussian method
[146] is O(N 3 ) since it requires constructing the BFS tree rooted at each node in
a network, and it also needs to calculate the inverse of the covariance matrix for
each BFS tree. The computational complexity of the Monte Carlo method [2] is
O(k(M + N logN )/ 2 ), where k is the number of infected sensors. The majority
of the computation is in calculating the shortest paths from an arbitrary node i to
all the sensors in order to sample the infection time of all sensors assuming that
node i is the diffusion source. By the central limit theorem, O(1/ 2 ) samples are
needed to achieve an error o(). For the Four-metric method [158], the majority of
the computation is also in computing the lengths of the shortest paths from each
node to all the sensors, both infected and non-infected. Thus, the computational
complexity of this method is O(n(M + N logN )), where n is the number of
sensors.
Compared with these existing methods of identifying diffusion sources based
on sensor observations, we see that the computational complexity of the proposed
method is much less than that of current methods. Furthermore, the proposed
method takes advantages of the relative infection time of sensors, the propagation
probabilities and the number of hops between nodes. However, the existing methods
either require the generation of the infection time of each sensor (e.g., the Gaussian
and Monte Carlo methods) or ignore the propagation probabilities (e.g., the Four-
metric method) between nodes. Thus, the proposed method is superior and is able
to work in large networks.
12.4 Evaluation 167

12.4 Evaluation

The proposed community structure based approach is evaluated in two real-


world large networks, the Retweet network and the Mention network collected
from Twitter, which were also used in the work of [188]. These two networks
were constructed from the tweets collected by using the Twitter streaming API
during Mar 24 and Apr 25, 2012. The basic statistics of these two networks are
listed in Table 12.1. In these two networks, only reciprocal communications are
kept as network edges, as bi-directional communications reflect more stable and
reliable social connections. Figure 12.2 shows the degree distribution of these two
networks. We can see that node degrees of these two networks follow the power
law distribution. The number of contacts (retweets or mentions) between any two
neighbors is set as the weight of the edge between them. Based on the number of
contacts between any two neighbors, we also generate the propagation probability
between them, as in
 
z
pij = min 1, , (12.6)
2∗μ

where z is the number of contacts between node i and j , and μ is the median of
contacts between neighbors.
To demonstrate the robustness of the proposed method across different types
of community structure, we apply separated (Infomap [153]) and overlapping
(Link Clustering [139]) community detection methods on these two networks. The
Infomap method shows communities of a network in a hierarchical structure from
which we can choose different levels of communities. In each level, the number
of communities will be different. The deeper the level is, the more communities
we will obtain and the smaller the average size of each community will be. In our
experiments, we typically choose the second, third and fourth-level communities,
denoted by β = 2, 3 and 4. On the other hand, we can adjust the parameter α in the
Link Clustering method to regulate the number of communities of a network. The
larger α is, the more communities we obtain, and similar to the previous method, the
smaller the average size of the communities will be. We typically set α = 0.10, 0.15
and 0.20 in our experiments. In each experiment, we randomly choose a diffusion
source in each run over 100 runs. The number of 100 comes from the discussion in
the previous work of [208]. The implementation is in conducted in C++.

Table 12.1 Statistics of two Dataset Mention Retweet


large networks in experiments
# nodes 300,197 374,829
# edges 1,048,818 598,487
Average degree 3.49 1.60
Maximum degree 124 178
168 12 Identifying Propagation Source in Large-Scale Networks

Fig. 12.2 Degree distribution of the two large networks. (a) The mention network; (b) The retweet
network

Fig. 12.3 The accuracy of the proposed method in identifying diffusion sources. (a) and (c) show
the accuracy of our method in the mention network and the retweet network having overlapping-
community structure with parameter α ∈ {0.10, 0.15, 0.20}. (b) and (d) show the accuracy of our
method in these networks having separated-community structure with parameter β ∈ {2, 3, 4}

12.4.1 Identifying Diffusion Sources in Large Networks

We show the accuracy of the proposed method in identifying diffusion sources in


this subsection. We use δ to denote the error distance (i.e., the number of hops)
between a real diffusion source and an estimated source. Ideally, we have δ = 0
if our method accurately captures the real source. In practice, we expect that our
method can accurately capture the real source or a node very close to the real source.
Figure 12.3 shows the accuracy of our method in the Mention network and the
Retweet network associated with overlapping-community structure and separated-
community structure. Overall, we see that the proposed method performs very well
in these two large networks with the majority of the experiments able to precisely
identify the diffusion sources. Especially with a large α or β, the proposed method
performs better in identifying diffusion sources, as the number of communities
becomes larger, and the average size of communities becomes smaller. Figure 12.3a
shows the experiment results in the Mention network with overlapping-community
12.4 Evaluation 169

structure. When α is 0.10, around 48% of the experiment runs can accurately
identify the diffusion sources. When α increases to 0.15 (equivalently, the average
size of communities becomes small), the accuracy of our method increases to
about 57%. When α increases to 0.20, more than 83% of the experiment runs can
accurately identify the real sources. Figure 12.3b shows the experiment results in
the Mention network with separated-community structure. Similar to the results in
the overlapping-community structure, when β is 2, around 52% of the experiment
runs can precisely identify the real diffusion sources. When β increases to 3 (i.e., the
average size of communities becomes small), the accuracy of our method increases
to around 70%. When β increases to 4, our method achieves an accuracy of around
98%, which means only a few runs could not identify the real sources. Similar
results can be found in the Retweet network in Figs. 12.3c, d.
Furthermore, we notice that the average distance between the estimated sources
and the real sources is very small. For both networks, from Fig. 12.3 we see that the
average error distance is within 1–2 hops. That is to say, even when the proposed
method does not accurately identify the real source, it is on average within a radius
of 1–2 hops from the estimated source. In addition, from Fig. 12.3 we see that
the maximum error distance is also very small (on average 5 hops). Compared
with the existing methods, which have low accuracy and expensive computational
complexity [87], the proposed method shows significantly higher performance in
identifying diffusion sources in large networks.
To summarize, our method performs very well in large networks associated with
either overlapping or separated community structures. Especially, when a network is
associated with a small average community size, our method can accurately identify
diffusion sources.

12.4.2 Influence of the Average Community Size

From the previous subsection, we notice that the accuracy of the proposed method
increases when the parameter α or β becomes large. Equivalently, the performance
of the proposed method improves when the average size of communities becomes
small. In order to analyze the influence of the average size of communities in
the accuracy of our method, we investigate the number of communities, bridges
and suspects when we change the parameters in the separated-community and
overlapping-community detection methods. More specifically, we let the parameter
β range from 2 to 4 for the Infomap method of detecting separated community
structure, and we let the parameter α range from 0.10 to 0.20 for the Link Clustering
method of detecting overlapping community structure.
The distribution of the community sizes of the previous two networks under
different parameter settings is shown in Fig. 12.4. Overall, we can see that the
community size shows power law distribution, i.e., a few communities are of a
significantly larger size but the majority of the communities are of a smaller size.
Furthermore, the number of communities decreases when the parameter α or β
170 12 Identifying Propagation Source in Large-Scale Networks

Fig. 12.4 Community size distribution under different parameter setting. (a) and (b) show the
community size distribution in the mention network and the retweet network having separated
community structure with β ∈ {2, 3, 4}; (c) and (d) show the community size distribution
in the mention network and the retweet network having separated community structure with
α ∈ {0.10, 0.15, 0.20}

Table 12.2 Statistics of network communities and accuracy of our method


Infomap Link clustering
Experiment settings β=2 β=3 β=4 α = 0.10 α = 0.15 α = 0.20
Retweet # communities 852 2684 21,300 2332 7166 34,559
# bridges 8,422 13,078 36,558 7470 12,281 76,537
# suspects 2158 925 153 5256 1380 157
Error distance 1.77 1.36 0.48 1.83 1.18 0.96
Mention # communities 588 5001 32,355 2754 7812 21,187
# bridges 10,165 18,974 57,335 8306 14,072 81,287
# suspects 3525 1169 318 6132 1247 297
Error distance 1.37 0.81 0.50 1.26 0.69 0.58

becomes smaller (compare the density of blue and green dots in Fig. 12.4). The
detailed statistics of the community structures of these two networks derived by
setting different parameters are shown in Table 12.2. For the Retweet network, when
β = 2, there are 852 communities, 8422 bridge nodes, an average of 2158 suspects,
and the average error distance between the estimated sources and the real sources
is 1.77. When β increases to 3, the average error distance decreases to 1.36 and the
number of suspects shrinks to 925, while the number of communities and bridges
increases. When β increase to 4, the average error distance decreases to 0.48 and
the number of suspects shrinks to 153, while the number of communities bridges
becomes larger. We notice that when the parameter β becomes large, the number
of communities rises, which leads to a decrease in the average size of communities.
Consequently, more bridges are needed to connect communities. We then can obtain
more information from bridge sensors. Thus, we see that the average error distance
between the real sources and the estimated sources becomes smaller. Similar results
can be found in both networks with overlapping-community structure. When the
parameter α increases, the number of bridges increases, and therefore, the average
error distance decreases.
12.4 Evaluation 171

Fig. 12.5 The influence of


the ratio of infected sensors in
the accuracy of our method in
the two real networks

In the real world, it requires a lot of money and energy to set up sensors and
maintain them. Hence, we need to choose as few sensors as possible and start
to identify diffusion sources when partial sensors get infected. Here, we select a
moderate-size set of sensors and then analyze the accuracy of our method when
only a small ratio of sensors are infected (see Fig. 12.5). More specifically, we
choose β = 2 for the Infomap method and α = 0.10 for the Link Clustering
method. Figure 12.5 shows the average error distance between the real sources and
the estimated sources when the ratio of infected sensors ranges from 10% to 100%.
We see that when more than 30% of sensors are infected, our method can identify a
node on average less than 2 hops away from the real source. When there are more
than 50% of sensors are infected, the average error distance between the real source
and the estimated source is approximately 1. Therefore, the proposed method can
identify diffusion sources with high accuracy even if only a small ratio of sensors
are infected.
From Figs. 12.4, 12.5, and Table 12.2, we see that the performance of the
proposed method improves when the average community size becomes smaller.
Even if the average community size is large and only a small ratio of sensors are
infected, our method can still accurately identify the real diffusion source or a node
very close to the real diffusion source.

12.4.3 Effectiveness Justification

In the second step of the proposed method, we utilize the linear correlation between
the relative infection time of any sensor and its effective distance from the diffusion
source. The suspect with the highest correlation coefficient is considered as the
diffusion source. In order to justify the effectiveness of the proposed method, we
examine the relationship between the relative infection time of any infected node
and its effective distance from the diffusion source, especially when the diffusion
starts from sources of different degrees.
In the previous two networks, we let the diffusion start from a small, moderate
and large degree source respectively, and compare the correlation coefficient of the
172 12 Identifying Propagation Source in Large-Scale Networks

Fig. 12.6 Justification of our method on the mention network. (a) Linear correlation between the
relative infection time of sensors and their average effective distance from the diffusion source.
Specifically, we let the diffusion start from sources with different degrees: small degree, moderate
degree and large degree. (b), (c) and (d) show the correlation coefficient value for each suspect

Fig. 12.7 Justification of our method on the retweet network. (a) Linear correlation between the
relative infection time of sensors and their average effective distance from the diffusion source.
(b), (c) and (d) show the correlation coefficient value for each suspect

real source and that of all the suspects. Figure 12.6 shows the experiment results on
the Mention network. From Fig. 12.6a, we can see that, with the diffusion starting
from sources of different degrees, the relative infection time of infected nodes is
linear with their average effective distance from the diffusion source. We notice that
when the diffusion starts from a large degree source, the scatter plot begins to curve
beginning at time tick 15. According to our investigation, almost all of the nodes
have been infected by time tick 15. Then, in the remaining time, only the nodes
which refused to be infected before time tick 15 can get infected. However, their
smallest number of hops from the diffusion source is fixed. Therefore, according to
Eq. (11.5), its effective distance from the diffusion source will be relatively short.
In Fig. 12.6b, c, and d, we show the correlation coefficient of all suspects. It can
be seen that the real sources (see the red dots) have a high correlation coefficient
whenever the diffusion starts from a source of small, moderate, or large degree.
Figure 12.7 shows the experiment results on the Retweet network. Similar to the
12.4 Evaluation 173

results on the Mention network, the relative infection time of any infected node is
linear to its effective distance from the diffusion sources of different degrees. When
the diffusion starts from a large degree source, their relation starts to curve towards
the end. This is because almost all of the nodes have been infected by time tick 13.
Figure 12.7b, c, and d show the correlation coefficient of all suspects. We can see
that the real sources all have high correlation coefficient.
To summarize, the linear correlation between relative infection time of nodes and
their average effective distance from diffusion source justifies the effectiveness of
our proposed method.

12.4.4 Comparison with Current Methods

In this section, we compare the proposed community structure based approach with
three competeting methods of identifying diffusion sources in networks based on
sensor observations. They include:
• The Gaussian method [146],
• The Monte Carlo method [2], and
• The Four-metric method [158].
According to Sect. 12.1, these methods are all susceptible to the scalability issue
because they need to scan every node in a network, which leads to very high
computational complexity (see Sect. 12.3.3). Especially, for the Gaussian method,
which is designed for tree networks, it needs to construct the BFS trees rooted at
each node in a general network and the inverse of the covariance matrix for each
BFS tree. Therefore, these methods are too computationally expensive to be applied
in large networks. In addition, among these methods only the Four-metric method
investigated and compared different sensor selection methods. Both the Gaussian
method and the Monte Carlo method set up sensors on high degree nodes or even
randomly choose nodes as sensors.
In the following, we first choose four relatively small networks to compare
the performance of the proposed method to that of the three methods. Then, we
introduce two well studied methods to select sensors for the three methods. Finally,
we present the detailed comparison results.

12.4.4.1 Four Relatively Small Networks

In order to compare with the three competeting methods, we choose four relatively
small networks:
• The Western U.S. Power Grid network [180],
• The Yeast protein interaction (PPI) network [82],
• Mention: the network of political communication between Twitter users [35].
• Retweet: the network of political communication between Twitter users [35].
174 12 Identifying Propagation Source in Large-Scale Networks

Table 12.3 Statistics of four relative small networks in experiments


Dataset Mention Retweet Power grid Yeast
# nodes 7175 18,470 4941 2361
# edges 28,473 121,043 13,188 13,554
Average degree 3.97 6.55 2.67 5.74
Maximum degree 425 1017 19 64
# communities (β = 2) 636 1462 37 2271
# bridges (β = 2) 2340 5207 228 798
# communities (α = 0.10) 1343 2528 322 297
# bridges (α = 0.10) 2245 4966 689 513

Fig. 12.8 Degree distribution of the four networks. (a) Political mention; (b) Political retweet; (c)
Power grid; (d) Yeast PPI network

The political communication dataset describes two networks of political commu-


nication between users of the Twitter social media platform (mention, and retweet)
in the six weeks prior to the 2010 U.S. Congressional midterm elections. We denote
the network of political retweets as Political Retweet, and the network of political
mentions as Political Mention. Statistics of these networks are given in Table 12.3.

12.4.4.2 Sensor-Selection Methods

Researchers in the work of [158] investigated various strategies to select sensors.


They conclude that high degree or high betweenness sensors are more efficient in
identifying rumor sources than other selected sensors. Therefore, we utilize these
two strategies to set up sensors for the three competeting methods.
• High-degree sensors [34]: we sort the nodes according their degree and choose
the high-degree nodes as sensors.
• High-betweenness sensors [132]: we sort the nodes according to their between-
ness centrality value, an choose the high-betweenness nodes as sensors.
Figure 12.8 shows the degree distributions of the four networks. We can see that
the two political communication networks show power law degree distribution, and
the Power Grid and the Yeast PPI network tend to be exponential.The between-
12.4 Evaluation 175

Fig. 12.9 Betweenness distribution of the four networks. (a) Political mention; (b) Political
retweet; (c) Power grid; (d) Yeast PPI network

ness distributions of these networks are shown in Fig. 12.9. Correspondingly, the
betweenness distributions also show the scale-free or exponential phenomenon of
these networks.
We use the above high degree or betweenness strategies to set up sensors for the
existing methods. We let the number of sensors account for no more than 50% of the
total number of nodes in each network. For the proposed community structure based
method, in order to select fewer sensors, we typically set α = 0.10 for the Link
Clustering method in detecting overlapping community structures, and set β = 2
for the Infomap method in detecting separated community structures. The number of
communities and bridges of these four networks under different experiment settings
are shown in Table 12.3. We see that the number of communities is very small and
the number of sensors account for less than 30% of the number of nodes in each
network.

12.4.4.3 Comparison Results

In the experiments, the diffusion probability is chosen uniformly from (0, 1), and the
diffusion process propagates t time steps where t is uniformly chosen from [8, 10].
We use detection rate to measure the accuracy of identifying diffusion sources. The
detection rate is defined as the fraction of experiments that accurately identify the
real diffusion sources. The higher detection rate, the better performance.
We first compare the proposed method to the existing methods associated with
high degree sensors. More specifically, we utilized the Infomap method [153] to
detect separated community structure of each network in this group of experiments.
The experiment results are shown in Fig. 12.10. We can see that the detection rate
of the proposed method is higher than that of the existing methods in each of the
networks. For the two political networks (see Fig. 12.10a, b), 30% of experiment
runs can accurately identify the diffusion sources. More than 90% of experiment
runs can identify a node within 2 hops away from the real source. Nearly 100%
that the real source is within 3 hops around the estimated source. Furthermore, the
average error distance from the real sources to the estimated sources is very small.
However, for the existing methods, only few experiment runs can accurately identify
176 12 Identifying Propagation Source in Large-Scale Networks

Fig. 12.10 Comparison of the proposed method with other methods in the accuracy of identifying
diffusion sources when setting sensors at high-degree nodes in four moderate-scale networks. (a)
Political mention; (b) Political retweet; (c) Power grid; (d) Yeast

Fig. 12.11 Comparison of the proposed method with other methods in the accuracy of identifying
diffusion sources when setting sensors at high-betweenness nodes in four moderate-scale networks.
(a) Political mention; (b) Political retweet; (c) Power grid; (d) Yeast

the real diffusion sources. Similar results can be found in the Yeast PPI network (see
Fig. 12.10d). More than 90% of the experiment runs can accurately identify the real
sources by using the proposed method, while few experiment runs can accurately
identify the real sources by using the existing methods. The average error distance
is much larger compared with that of the proposed method. The proposed method
also outperforms the existing methods in the Power Grid network (see Fig. 12.10c).
We then compare the proposed method to the existing methods associated with
high-betweenness sensors. In this group of comparisons, we utilized the Link
Clustering method [140] in detecting the overlapping community structure of each
network. The experiment results on the four networks are shown in Fig. 12.11.
Similar to the results in Fig. 12.10, the detection rate of the proposed method is
higher than that of the existing methods in each of the four networks. For the
two political networks, more than 50% of experiment runs accurately identified
the real diffusion sources in the political Mention network, and more than 70%
of experiment runs accurately identified the real diffusion sources in the political
Retweet network. However, for the existing methods, few of the experiment runs
accurately identified the diffusion sources. Furthermore, the average error distance
is larger compared with that of the proposed method. Similar results can be found in
the Power Grid network and the Yeast PPI network. By using the proposed method,
more than 45% of experiment runs accurately identified the diffusion sources in the
Power Grid network, and more than 60% for the Yeast PPI network. However, for
12.4 Evaluation 177

the existing methods, few of the experiment runs accurately identified the diffusion
sources. Furthermore, the average error distance from the estimated sources and the
real sources are larger compared that of the proposed method.
From Figs. 12.10 and 12.11, we see that the existing methods show different
performances in the two different sensor selection methods. For example in
Fig. 12.10d, the Monte Carlo method outperforms the Four-metric method with high
degree sensor selection method. However, in Fig. 12.11d, the Four-metric method
outperforms the Monte Carlo method with the high betweenness sensor selection
method, and the Gaussian method shows similar performances. In order to see the
impact of using different sensor selection methods on the existing methods, we show
in Fig. 12.12 the correlation between nodes’ degree and their average betweenness
of the four networks. As we can see, many nodes with the high degree tend to have
high betweenness. However, this is not always the case, as there are also some nodes
with low betweenness, especially for the political Retweet network and the Power
Grid network. This explains why the existing methods show different performances
in different sensor selection methods in Figs. 12.10 and 12.11.
Figure 12.13 shows the linear relation between relative infection time of nodes
and their average effective distance from the diffusion sources in the four relatively

Fig. 12.12 The relationship between degree and the average betweenness at each degree of the
four networks. (a) Political mention; (b) Political retweet; (c) Power grid; (d) Yeast PPI network

Fig. 12.13 Linear correlation


between relative infection
time and average effective
distance for the four relative
small networks
178 12 Identifying Propagation Source in Large-Scale Networks

small networks. We can see that relative infection time is linear with the average
effective distance in all these networks. Similar to the results in Figs. 12.6a
and 12.7a, the scatter plot curves towards the end because almost all of the nodes
have been infected by then. The linear correlation in these networks again justifies
the effectiveness of the proposed method.
To summarize, we see that the proposed community structure based method
outperforms the existing methods in identifying diffusion sources based on sen-
sor observations in various networks. The majority of the experiment runs can
accurately identify the real diffusion source or a node that is close to the real
source. However, the existing methods show low performance, and the average error
distance between the estimated sources and the real diffusion sources is very large.

12.5 Summary

In this chapter, we introduced an efficient method to identify diffusion sources


based on community structures in large-scale networks. To address the scalability
issue in the source identification problems, we first detect community structure
of the network and find bridges, which we assign as sensors. According to the
infection time of the sensors, we can easily determine from which community
the diffusion broke out. This method dramatically narrows down the scale of the
search for diffusion sources and therefore address the scalability issue in this area.
Then we proposed a novel method to locate the real diffusion source from the first
infected community, and considered the one with the highest correlation coefficient
as the real source. This method allows us to consider only sources inside the
suspicious communities, rather than the whole network, which means the method
can be applied just as efficiently to large networks as small ones. Experiments on
large networks and comparison with many competitive methods show significant
advantages of the proposed method.
Chapter 13
Future Directions and Conclusion

While previous chapters provide a thorough description on malicious attack prop-


agation and source identification, many interesting and promising issues remain
unexplored. The development of online social networks provides great opportunities
for research on restraining malicious attacks and identifying attack sources, but also
presents a challenge in effective utilization of the large volume of data. There are
still other topics that need to be considered in malicious attack propagation and
source identification, and we consider a few directions that are worthy of future
attention.

13.1 Continuous Time-Varying Networks

In Chap. 10, we introduced an effective method to identify the propagation source


of malicious attacks in time-varying networks by utilizing discrete time-integrating
windows to express time-varying networks. The size of the time window could be
minutes, hours, days or even months. This may lead to new ideas of identifying
propagation sources in continuous time windows.
In the real world, many complex networks—human contact network, online
social networks, transportation network, computer networks, to just name a few—
present continuous time-varying topologies. For example, in online social network
websites, users continuous publish posts and commenting on posts, which is an
essential part of many social networking websites and forums. In many cases
the data are recorded on a continuous time scale. The approach proposed in this
book analyses discrete time windows, by dividing the entire time duration into
several even intervals. This does greatly simplify time-varying networks but also
lose some latent features of continuous time windows. The designing of detecting
the propagation sources of malicious attacks in continuous time windows is a new
direction for future research.

© Springer Nature Switzerland AG 2019 179


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5_13
180 13 Future Directions and Conclusion

13.2 Multiple Attacks on One Network

In Chap. 11, we introduced an efficient method to identify multiple attacks in


complex networks. We considered multiple sources spreading one malicious attack.
In the real world, however, there often exist several different malicious attacks
spreading simultaneously in one network. These rumors may enhance the mass
spreading of the attacks. Therefore, identifying multiple sources of multiple mali-
cious attacks is of great significance.
Current research on source identification only considers one maliciou attack
diffusion. However, real-world events generally are more complicated. For example,
some rumor starting from March 2008 saying Obama was born in Kenya before
bing flown to Hawaii were spread on social network websites. Some other rumor
circulated on social network websites about his religion. These would disqualify
Obama from the presidency. The rumors about the same event sometimes support
each other, thus enlarge and extract more and more attentions from the general
public, and finally mislead people. Therefore, how to identify sources of multiple
malicious attacks in complex networks is a good topic for future research.

13.3 Interconnected Networks

The diffusion of malicious attacks is a complex process in the real world. It


may involve multiple interconnected networks to spread information. For example,
people may hear rumors from online social networks, such as Facebook or Twitter.
They can also receive rumors from other multimedia.
Current research on propagation source identification only considers malicious
attack spreading in a single network. However, real-world networks are often
interconnected or even interdependent. For example, in online social networks, a
user could have a Facebook account and also have a Twitter account. After the
user received a rumor on Facebook, he/she could also post the rumor on his/her
Twitter account. Thus, the rumor will successfully spread from Facebook to Twitter.
However, detecting malicious attack sources in interconnected networks is still
an open issue. Therefore, identifying malicious attack sources in interconnected
networks is much more realistic than methods considered in a single network.

13.4 Conclusion

This book attempted a comprehensive work on malicious attack propagation and


source identification. We provided an overview of the huge literature on two major
directions: analyzing the propagation of malicious attacks and identifying the source
of the propagation. For malicious attack propagation, we have identified different
13.4 Conclusion 181

methods for measuring the influence of network hosts and different approaches for
restraining the propagation of malicious attacks. For identifying the propagation
source of malicious attacks, we discussed about current methods in regards to three
different categories of observations on the propagation, and analyzed their pros and
cons based on real-world datasets. Furthermore, we discussed about three critical
research issues about propagation source identification: identifying propagation
source in time-varying networks, identifying multiple propagation sources, and
identifying propagation source in large-scale networks. For each research issue, we
introduced one representative state-of-the-art method.
Malicious attack propagation and source identification still have lots of unknown
potentials, and literature summarized in this book can be a starting point for
exploring new challenges in the future. Our goal is to give an overview of existing
works on malicious attack propagation and source identification to show their
usefulness to the newcomers as well as practitioners in various fields. We also hope
the overview can help avoiding some redundant, ad hoc effort, both from researchers
and from industries.
References

1. The caida as relationships dataset, Aug. 30, 2009.


2. A. Agaskar and Y. M. Lu. A fast monte carlo algorithm for source localization on graphs. In
SPIE Optical Engineering and Applications. International Society for Optics and Photonics,
2013.
3. Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann. Link communities reveal multiscale complexity in
networks. Nature, 466(7307):761–764, 2010.
4. R. Albert, I. Albert, and G. L. Nakarado. Structural vulnerability of the north american power
grid. Physical review E, 69(2):025103, 2004.
5. R. Albert and A.-L. Barabási. Statistical mechanics of complex networks. Reviews of modern
physics, 74(1):47, 2002.
6. K. Alintanahin. Cryptolocker: Its spam and zeus/zbot connection, October 21 2013.
7. F. Altarelli, A. Braunstein, L. DallAsta, A. Lage-Castellanos, and R. Zecchina. Bayesian
inference of epidemics on networks via belief propagation. Physical review letters,
112(11):118701, 2014.
8. L. A. N. Amaral, A. Scala, M. Barthelemy, and H. E. Stanley. Classes of small-world
networks. Proceedings of the National Academy of Sciences, 97(21):11149–11152, 2000.
9. C. Anagnostopoulos, S. Hadjiefthymiades, and E. Zervas. Information dissemination between
mobile nodes for collaborative context awareness. Mobile Computing, IEEE Transactions on,
10(12):1710–1725, 2011.
10. R. M. Anderson, R. M. May, and B. Anderson. Infectious diseases of humans: dynamics and
control, volume 28. Wiley Online Library, 1992.
11. N. Antulov-Fantulin, A. Lančić, T. Šmuc, H. Štefančić, and M. Šikić. Identification of patient
zero in static and temporal networks: Robustness and limitations. Physical review letters,
114(24):248701, 2015.
12. N. T. Bailey et al. The mathematical theory of infectious diseases and its applications. Charles
Griffin & Company Ltd, 5a Crendon Street, High Wycombe, Bucks HP13 6LE., 1975.
13. E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Everyone’s an influencer:
Quantifying influence on twitter. In Proceedings of the Fourth ACM International Conference
on Web Search and Data Mining, WSDM ’11, pages 65–74, New York, NY, USA, 2011.
ACM.
14. E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Everyone’s an influencer:
Quantifying influence on twitter. In Proceedings of the Fourth ACM International Conference
on Web Search and Data Mining, WSDM ’11, pages 65–74, 2011.

© Springer Nature Switzerland AG 2019 183


J. Jiang et al., Malicious Attack Propagation and Source
Identification, Advances in Information Security 73,
https://doi.org/10.1007/978-3-030-02179-5
184 References

15. A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. science,
286(5439):509–512, 1999.
16. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. science,
286(5439):509–512, 1999.
17. A. Beuhring and K. Salous. Beyond blacklisting: Cyberdefense in the era of advanced
persistent threats. Security & Privacy, IEEE, 12(5):90–93, 2014.
18. S. Bhagat, A. Goyal, and L. V. Lakshmanan. Maximizing product adoption in social networks.
In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining,
WSDM ’12, pages 603–612. ACM, 2012.
19. V. Blue. Cryptolocker’s crimewave: A trail of millions in laundered bitcoin.[en línea] 22 de
diciembre de 2013.[citado el: 22 de enero de 2014.].
20. P. Bonacich. Factoring and weighting approaches to status scores and clique identification.
Journal of Mathematical Sociology, 2(1):113–120, 1972.
21. P. Bonacich. Power and centrality: A family of measures. American journal of sociology,
pages 1170–1182, 1987.
22. Y. Boshmaf, I. Muslukhov, K. Beznosov, and M. Ripeanu. Design and analysis of a social
botnet. Computer Networks, 57(2):556–578, 2013.
23. A. Braunstein and A. Ingrosso. Inference of causality in epidemics on temporal contact
networks. Scientific reports, 6:27538, 2016.
24. D. Brockmann and D. Helbing. The hidden geometry of complex, network-driven contagion
phenomena. Science, 342(6164):1337–1342, 2013.
25. C. Budak, D. Agrawal, and A. El Abbadi. Limiting the spread of misinformation in social
networks. In Proceedings of the 20th international conference on World wide web, WWW
’11, pages 665–674. ACM, 2011.
26. S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, and E. Shir. From the cover: A model of
internet topology using k-shell decomposition. PNAS, Proceedings of the National Academy
of Sciences, 104(27):11150–11154, 2007.
27. C. Cattuto, W. Van den Broeck, A. Barrat, V. Colizza, J.-F. Pinton, and A. Vespignani.
Dynamics of person-to-person interactions from distributed rfid sensor networks. PloS one,
5(7):e11596, 2010.
28. CFinder. Clusters and communities, 2013.
29. D. Chakrabarti, J. Leskovec, C. Faloutsos, S. Madden, C. Guestrin, and M. Faloutsos.
Information survival threshold in sensor and p2p networks. In INFOCOM 2007. 26th IEEE
International Conference on Computer Communications. IEEE, pages 1316–1324, 2007.
30. W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD ’09, pages 199–208. ACM, 2009.
31. Y. Chen, G. Paul, S. Havlin, F. Liljeros, and H. E. Stanley. Finding a better immunization
strategy. Phys. Rev. Lett., 101:058701, Jul 2008.
32. Z. Chen, K. Zhu, and L. Ying. Detecting multiple information sources in networks under the
sir model. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on,
pages 1–4. IEEE, 2014.
33. A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large
networks. Phys. Rev. E, 70:066111, Dec 2004.
34. C. H. Comin and L. da Fontoura Costa. Identifying the starting point of a spreading process
in complex networks. Phys. Rev. E, 84:056105, Nov 2011.
35. M. Conover, J. Ratkiewicz, M. Francisco, B. Gonçalves, F. Menczer, and A. Flammini.
Political polarization on twitter. In ICWSM, 2011.
36. K. L. Cooke and P. Van Den Driessche. Analysis of an seirs epidemic model with two delays.
Journal of Mathematical Biology, 35(2):240–260, 1996.
37. G. Cowan. Statistical data analysis. Oxford university press, 1998.
38. D. Dagon, C. C. Zou, and W. Lee. Modeling botnet propagation using time zones. In NDSS,
volume 6, pages 2–13, 2006.
39. D. J. Daley and D. G. Kendall. Epidemics and rumours. 1964.
References 185

40. C. I. Del Genio, T. Gross, and K. E. Bassler. All scale-free networks are sparse. Phys. Rev.
Lett., 107:178701, Oct 2011.
41. Z. Dezső and A.-L. Barabási. Halting viruses in scale-free networks. Phys. Rev. E, 65:055103,
May 2002.
42. B. Doerr, M. Fouz, and T. Friedrich. Why rumors spread so quickly in social networks.
Commun. ACM, 55(6):70–75, June 2012.
43. P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings
of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD ’01, pages 57–66. ACM, 2001.
44. W. Dong, W. Zhang, and C. W. Tan. Rooting out the rumor culprit from suspects. In
Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, pages
2671–2675. IEEE, 2013.
45. S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin. Structure of growing networks with
preferential linking. Physical review letters, 85(21):4633, 2000.
46. N. Eagle and A. Pentland. Reality mining: sensing complex social systems. Personal and
ubiquitous computing, 10(4):255–268, 2006.
47. D. Easley and J. Kleinberg. Networks, crowds, and markets: Reasoning about a highly
connected world. Cambridge University Press, 2010.
48. H. Ebel, L.-I. Mielsch, and S. Bornholdt. Scale-free topology of e-mail networks. Phys. Rev.
E, 66:035103, Sep 2002.
49. H. Ebel, L.-I. Mielsch, and S. Bornholdt. Scale-free topology of e-mail networks. Phys. Rev.
E, 66:035103, Sep 2002.
50. C. Economics. Malware report: The economic impact of viruses, spyware, adware, botnets,
and other malicious code. Irvine, CA: Computer Economics, 2007.
51. Economist. A thing of threads and patches. Economist, August 25, 2012.
52. P. Erd0s. Graph theory and probability. Canad. J. Math, 11:34G38, 1959.
53. P. ERDdS and A. R&WI. On random graphs i. Publ. Math. Debrecen, 6:290–297, 1959.
54. ESET. Virus radar, November 2014.
55. M. R. Faghani and U. T. Nugyen. Modeling the propagation of trojan malware in online social
networks. arXiv preprint arXiv:1708.00969, 2017.
56. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet
topology. In Proceedings of the conference on Applications, technologies, architectures, and
protocols for computer communication, SIGCOMM ’99, pages 251–262. ACM, 1999.
57. X. Fan and Y. Xiang. Modeling the propagation of peer-to-peer worms. Future Generation
Computer Systems, 26(8):1433–1443, 2010.
58. V. Fioriti, M. Chinnici, and J. Palomo. Predicting the sources of an outbreak with a spectral
technique. Applied Mathematical Sciences, 8(135):6775–6782, 2014.
59. S. Fortunato. Community detection in graphs. Physics reports, 486(3):75–174, 2010.
60. S. Fortunato, A. Flammini, and F. Menczer. Scale-free network growth by ranking. Physical
review letters, 96(21):218701, 2006.
61. M. Fossi and J. Blackbird. Symantec internet security threat report 2010. Technical report,
Symantec Corporation, March, 2011.
62. C. Fraser, C. A. Donnelly, S. Cauchemez, W. P. Hanage, M. D. Van Kerkhove, T. D.
Hollingsworth, J. Griffin, R. F. Baggaley, H. E. Jenkins, E. J. Lyons, et al. Pandemic potential
of a strain of influenza a (h1n1): early findings. science, 324(5934):1557–1561, 2009.
63. M. L. Fredman and R. E. Tarjan. Fibonacci heaps and their uses in improved network
optimization algorithms. Journal of the ACM (JACM), 34(3):596–615, 1987.
64. L. C. Freeman. A set of measures of centrality based upon betweenness. Sociometry, 40:35–
41, 1977.
65. L. C. Freeman. Centrality in social networks conceptual clarification. Social networks,
1(3):215–239, 1978.
66. L. C. Freeman. Centrality in social networks: conceptual clarification. Social Networks,
1:215–239, 1979.
186 References

67. L. C. Freeman, S. P. Borgatti, and D. R. White. Centrality in valued graphs: a measure of


betweenness based on network flow. Social Networks, 13:141–154, 1991.
68. L. Fu, Z. Shen, W.-X. Wang, Y. Fan, and Z. Di. Multi-source localization on complex
networks with limited observers. EPL (Europhysics Letters), 113(1):18006, 2016.
69. C. Gao and J. Liu. Modeling and restraining mobile virus propagation. Mobile Computing,
IEEE Transactions on, 12(3):529–541, 2013.
70. C. Gao, J. Liu, and N. Zhong. Network immunization and virus propagation in email
networks: experimental evaluation and analysis. Knowledge and Information Systems,
27:253–279, 2011.
71. C. Gao, J. Liu, and N. Zhong. Network immunization with distributed autonomy-oriented
entities. Parallel and Distributed Systems, IEEE Transactions on, 22(7):1222–1229, 2011.
72. M. Girvan and M. E. Newman. Community structure in social and biological networks.
Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.
73. W. Goffman and V. Newill. Generalization of epidemic theory. Nature, 204(4955):225–228,
1964.
74. N. Z. Gong, A. Talwalkar, L. Mackey, L. Huang, E. C. R. Shin, E. Stefanov, E. Shi, and
D. Song. Joint link prediction and attribute inference using a social-attribute network. ACM
Transactions on Intelligent Systems and Technology (ACM TIST), 2013. Accepted.
75. P. D. Grünwald. The minimum description length principle. MIT press, 2007.
76. S. L. Hakimi, M. L. Labbé, and E. Schmeichel. The voronoi partition of a network and its
implications in location theory. ORSA journal on computing, 4(4):412–417, 1992.
77. F. Harary. Graph theory. 1969.
78. H. W. Hethcote. The mathematics of infectious diseases. SIAM review, 42(4):599–653, 2000.
79. P. W. Holland and S. Leinhardt. Transitivity in structural models of small groups. Comparative
group studies, 2(2):107–124, 1971.
80. P. Holme, B. J. Kim, C. N. Yoon, and S. K. Han. Attack vulnerability of complex networks.
Phys. Rev. E, 65:056109, May 2002.
81. C. S. Institute. The fifteenth annual csi computer crime and security survey. Monroe, WA:
Computer Security Institute, 2010.
82. H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality in protein
networks. Nature, 411(6833):41–42, May 2001.
83. H. Jeong, S. P. Mason, A.-L. Barabási, and Z. N. Oltvai. Lethality and centrality in protein
networks. Nature, 411(6833):41–42, 2001.
84. J. Jiang, W. Sheng, S. Yu, Y. Xiang, and W. Zhou. Rumor source identification in social net-
works with time-varying topology. IEEE Transactions on Dependable and Secure Computing,
2016.
85. J. Jiang, S. Wen, S. Yu, Y. Xiang, and W. Zhou. K-center: An approach on the multi-
source identification of information diffusion. Information Forensics and Security, IEEE
Transactions on, 17 August 2015.
86. J. Jiang, S. Wen, S. Yu, Y. Xiang, and W. Zhou. K-center: An approach on the multi-source
identification of information diffusion. IEEE Transactions on Information Forensics and
Security, 10(12):2616–2626, 2015.
87. J. Jiang, S. Wen, S. Yu, Y. Xiang, and W. Zhou. Identifying propagation sources in networks:
State-of-the-art and comparative studies. IEEE Communications Surveys and Tutorials,
accepted, in press.
88. S. Jitesh and A. Jafar. The enron email dataset database schema and brief statistical report.
Technical report, University of Southern California, 2009.
89. N. Karamchandani and M. Franceschetti. Rumor source detection under probabilistic sam-
pling. In Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on,
pages 2184–2188, 2013.
90. B. Karrer and M. E. J. Newman. Message passing approach for general epidemic models.
Phys. Rev. E, 82:016101, Jul 2010.
91. M. Karsai, N. Perra, and A. Vespignani. Time varying networks and the weakness of strong
ties. Scientific reports, 4, 2014.
References 187

92. L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43,
1953.
93. M. J. Keeling and K. T. Eames. Networks and epidemic models. Journal of the Royal Society
Interface, 2(4):295–307, 2005.
94. M. J. Keeling and P. Rohani. Modeling infectious diseases in humans and animals. Princeton
University Press, 2008.
95. D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social
network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’03, pages 137–146, 2003.
96. M. Kitsak, L. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. Stanley, and H. Makse.
Identification of influential spreaders in complex networks. Nature Physics, 6(11):888–893,
Aug 2010.
97. J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins. The web as a
graph: measurements, models, and methods. In International Computing and Combinatorics
Conference, pages 1–17. Springer, 1999.
98. D. Koschützki, K. A. Lehmann, L. Peeters, S. Richter, D. Tenfelde-Podehl, and O. Zlotowski.
Centrality indices. In Network analysis, pages 16–61. Springer, 2005.
99. P. L. Krapivsky and S. Redner. Organization of growing random networks. Physical Review
E, 63(6):066123, 2001.
100. M. J. Krasnow. Hacking, malware, and social engineering—definitions of and statistics about
cyber threats contributing to breaches. Expert Commentary: Cyber and Privacy Risk and
Insurance, January 2012.
101. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic
models for the web graph. In Foundations of Computer Science, 2000. Proceedings. 41st
Annual Symposium on, pages 57–65. IEEE, 2000.
102. H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media?
In WWW ’10: Proceedings of the 19th international conference on World wide web, pages
591–600. ACM, 2010.
103. K. Labs. Facebook malware poses as flash update, infects 110k users, February 2015.
104. I. Lawrence and K. Lin. A concordance correlation coefficient to evaluate reproducibility.
Biometrics, pages 255–268, 1989.
105. B. Li. An in-depth look into malicious browser extensions, October 2014.
106. F. Li, Y. Yang, and J. Wu. Cpmc: An efficient proximity malware coping scheme in
smartphone-based mobile networks. In INFOCOM, 2010 Proceedings IEEE, pages 1–9,
2010.
107. Y. Li, W. Chen, Y. Wang, and Z.-L. Zhang. Influence diffusion dynamics and influence
maximization in social networks with friend and foe relationships. In Proceedings of the sixth
ACM international conference on Web search and data mining, WSDM ’13, pages 657–666.
ACM, 2013.
108. Y. Li, P. Hui, D. Jin, L. Su, and L. Zeng. Optimal distributed malware defense in mobile
networks with heterogeneous devices. Mobile Computing, IEEE Transactions on, 2013.
Accepted.
109. Y. Li, B. Zhao, and J.-S. Lui. On modeling product advertisement in large-scale online social
networks. Networking, IEEE/ACM Transactions on, 20(5):1412–1425, 2012.
110. Y. Y. Liu, J. J. Slotine, and A. laszlo Barabasi. Controllability of complex networks. Nature,
473:167–173, 2011.
111. A. Y. Lokhov, M. Mézard, H. Ohta, and L. Zdeborová. Inferring the origin of an epidemy
with dynamic message-passing algorithm. arXiv preprint arXiv:1303.5315, 2013.
112. A. Louni and K. Subbalakshmi. A two-stage algorithm to estimate the source of information
diffusion in social media networks. In Computer Communications Workshops (INFOCOM
WKSHPS), 2014 IEEE Conference on, pages 329–333. IEEE, 2014.
113. R. D. Luce and A. D. Perry. A method of matrix analysis of group structure. Psychometrika,
14(2):95–116, 1949.
188 References

114. W. Luo and W. P. Tay. Finding an infection source under the sis model. In Acoustics, Speech
and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 2930–2934,
2013.
115. W. Luo, W. P. Tay, and M. Leng. Identifying infection sources and regions in large networks.
Signal Processing, IEEE Transactions on, 61(11):2850–2865, 2013.
116. W. Luo, W. P. Tay, and M. Leng. How to identify an infection source with limited
observations. IEEE Journal of Selected Topics in Signal Processing, 8(4):586–597, 2014.
117. W. Luo, W. P. Tay, and M. Leng. Rumor spreading and source identification: A hide and seek
game. arXiv preprint arXiv:1504.04796, 2015.
118. Y. Ma, X. Jiang, M. Li, X. Shen, Q. Guo, Y. Lei, and Z. Zheng. Identify the diversity
of mesoscopic structures in networks: A mixed random walk approach. EPL (Europhysics
Letters), 104(1):18006, 2013.
119. D. MacRae. 5 viruses to be on the alert for in 2014.
120. H. E. Marano. Our brain’s negative bias. Technical report, Psychology Today, June 20, 2003.
121. M. Mathioudakis, F. Bonchi, C. Castillo, A. Gionis, and A. Ukkonen. Sparsification of
influence networks. In Proceedings of the 17th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’11, pages 529–537. ACM, 2011.
122. R. M. May and A. L. Lloyd. Infection dynamics on scale-free networks. Phys. Rev. E,
64:066112, Nov 2001.
123. A. R. McLean, R. M. May, J. Pattison, R. A. Weiss, et al. SARS: A case study in emerging
infections. Oxford University Press, 2005.
124. S. Meloni, A. Arenas, S. Gómez, J. Borge-Holthoefer, and Y. Moreno. Modeling epidemic
spreading in complex networks: concurrency and traffic. In Handbook of Optimization in
Complex Networks, pages 435–462. Springer, 2012.
125. S. Milgram. The small world problem. Psychology today, 2(1):60–67, 1967.
126. D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver. Inside the slammer
worm. IEEE Security and Privacy, 1(4):33–39, July 2003.
127. D. Moore, C. Shannon, et al. Code-red: a case study on the spread and victims of an internet
worm. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment, pages
273–284. ACM, 2002.
128. Y. Moreno, M. Nekovee, and A. F. Pacheco. Dynamics of rumor spreading in complex
networks. Physical Review E, 69(6):066130, 2004.
129. T. Nepusz and T. Vicsek. Controlling edge dynamics in complex networks. Nature, 8:568–
573, 2012.
130. NetMiner4. Premier software for network analysis, 2013.
131. M. E. Newman. The structure and function of complex networks. SIAM review, 45(2):167–
256, 2003.
132. M. E. Newman. A measure of betweenness centrality based on random walks. Social
networks, 27(1):39–54, 2005.
133. M. E. Newman. The mathematics of networks. The new palgrave encyclopedia of economics,
2:1–12, 2008.
134. M. E. Newman and J. Park. Why social networks are different from other types of networks.
Physical Review E, 68(3):036122, 2003.
135. M. E. Newman, D. J. Watts, and S. H. Strogatz. Random graph models of social networks.
Proceedings of the National Academy of Sciences, 99(suppl 1):2566–2572, 2002.
136. M. E. J. Newman. Networks: An Introduction, chapter 17 Epidemics on networks, pages 700–
750. Oxford University Press, 2010.
137. M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks.
Phys. Rev. E, 69:026113, Feb 2004.
138. N. P. Nguyen, T. N. Dinh, S. Tokala, and M. T. Thai. Overlapping communities in dynamic
networks: their detection and mobile applications. In Proceedings of the 17th annual
international conference on Mobile computing and networking, MobiCom ’11, pages 85–96.
ACM, 2011.
References 189

139. G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure
of complex networks in nature and society. Nature, 435(7043):814–818, 2005.
140. G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure
of complex networks in nature and society. Nature, 435:814–818, 2005.
141. R. A. Pande. Using plant epidemiological methods to track computer network worms. PhD
thesis, Virginia Tech, 2004.
142. C. Pash. The lure of naked hollywood star photos sent the internet into meltdown in new
zealand. Business Insider Australia, September 7 2014, 4:21 PM.
143. F. Peter. ‘bogus’ ap tweet about explosion at the white house wipes billions off us markets,
April 23 2013. Washington.
144. S. Pettie and V. Ramachandran. A shortest path algorithm for real-weighted undirected
graphs. SIAM Journal on Computing, 34(6):1398–1431, 2005.
145. A.-K. Pietilainen. CRAWDAD data set thlab/sigcomm2009 (v. 2012-07-15). Downloaded
from http://crawdad.org/thlab/sigcomm2009/, July 2012.
146. P. C. Pinto, P. Thiran, and M. Vetterli. Locating the source of diffusion in large-scale networks.
Phys. Rev. Lett., 109, Aug 2012.
147. B. A. Prakash, J. Vreeken, and C. Faloutsos. Spotting culprits in epidemics: How many and
which ones? In Proceedings of the 2012 IEEE 12th International Conference on Data Mining,
ICDM ’12, pages 11–20, Washington, DC, USA, 2012. IEEE Computer Society.
148. B. A. Prakash, J. Vreeken, and C. Faloutsos. Efficiently spotting the starting points of an
epidemic in a large graph. Knowledge and Information Systems, 38(1):35–59, 2014.
149. A. Rapoport. Spread of information through a population with socio-structural bias: I.
assumption of transitivity. The bulletin of mathematical biophysics, 15(4):523–533, 1953.
150. J. G. Restrepo, E. Ott, and B. R. Hunt. Characterizing the dynamical importance of network
nodes and links. Phys. Rev. Lett., 97:094102, Sep 2006.
151. B. Ribeiro, N. Perra, and A. Baronchelli. Quantifying the effect of temporal resolution on
time-varying networks. Scientific reports, 3, 2013.
152. M. Rosvall and C. T. Bergstrom. An information-theoretic framework for resolving com-
munity structure in complex networks. Proceedings of the National Academy of Sciences,
104(18):7327–7331, 2007.
153. M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal
community structure. Proceedings of the National Academy of Sciences, 105(4):1118–1123,
2008.
154. G. Sabidussi. The centrality index of a graph. Psychometrika, 31(4):581–603, 1966.
155. M. Sales-Pardo, R. Guimera, A. A. Moreira, and L. A. N. Amaral. Extracting the hierar-
chical organization of complex systems. Proceedings of the National Academy of Sciences,
104(39):15224–15229, 2007.
156. S. Savage, D. Wetherall, A. Karlin, and T. Anderson. Practical network support for ip
traceback. ACM SIGCOMM Computer Communication Review, 30(4):295–306, 2000.
157. V. Sekar, Y. Xie, D. A. Maltz, M. K. Reiter, and H. Zhang. Toward a framework for internet
forensic analysis. In ACM HotNets-III, 2004.
158. E. Seo, P. Mohapatra, and T. Abdelzaher. Identifying rumors and their sources in social
networks. In SPIE Defense, Security, and Sensing, volume 8389, 2012.
159. M. A. Serrano and M. Boguñá. Clustering in complex networks. ii. percolation properties.
Phys. Rev. E, 74:056115, Nov 2006.
160. D. Shah and T. Zaman. Detecting sources of computer viruses in networks: Theory and exper-
iment. In Proceedings of the ACM SIGMETRICS International Conference on Measurement
and Modeling of Computer Systems, SIGMETRICS ’10, pages 203–214. ACM, 2010.
161. D. Shah and T. Zaman. Rumors in a network: Who’s the culprit? IEEE Transactions on
information theory, 57(8):5163–5181, 2011.
162. D. Shah and T. Zaman. Rumor centrality: A universal source detector. SIGMETRICS Perform.
Eval. Rev., 40(1):199–210, June 2012.
190 References

163. Z. Shen, S. Cao, W.-X. Wang, Z. Di, and H. E. Stanley. Locating the source of diffusion in
complex networks by time-reversal backward spreading. Physical Review E, 93(3):032301,
2016.
164. J. Shetty and J. Adibi. The enron email dataset database schema and brief statistical report.
Information Sciences Institute Technical Report, University of Southern California, 4, 2004.
165. S. Shirazipourazad, B. Bogard, H. Vachhani, A. Sen, and P. Horn. Influence propagation in
adversarial setting: how to defeat competition with least amount of investment. In Proceedings
of the 21st ACM international conference on Information and knowledge management, CIKM
’12, pages 585–594. ACM, 2012.
166. L.-P. Song, Z. Jin, and G.-Q. Sun. Modeling and analyzing of botnet interactions. Physica A:
Statistical Mechanics and its Applications, 390(2):347–358, 2011.
167. Symantec. The 2012 norton cybercrime report. Mountain View, CA: Symantec, 2012.
168. W. E. R. Team. Ebola virus disease in west africathe first 9 months of the epidemic and
forward projections. N Engl J Med, 371(16):1481–95, 2014.
169. K. Thomas and D. M. Nicol. The koobface botnet and the rise of social malware. In Malicious
and Unwanted Software (MALWARE), 2010 5th International Conference on, pages 63–70.
IEEE, 2010.
170. M. P. Viana, D. R. Amancio, and L. d. F. Costa. On time-varying collaboration networks.
Journal of Informetrics, 7(2):371–378, 2013.
171. B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi. On the evolution of user interaction
in facebook. In Proceedings of the 2nd ACM workshop on Online social networks, WOSN
’09, pages 37–42, 2009.
172. B. Vladimir and M. Andrej. Pajek: analysis and visualization of large networks. In GRAPH
DRAWING SOFTWARE, pages 77–103. Springer, 2003.
173. M. Vojnovic, V. Gupta, T. Karagiannis, and C. Gkantsidis. Sampling strategies for epidemic-
style information dissemination. Networking, IEEE/ACM Transactions on, 18(4):1013–1025,
2010.
174. K. Wakita and T. Tsurumi. Finding community structure in mega-scale social networks:
[extended abstract]. In Proceedings of the 16th international conference on World Wide Web,
WWW ’07, pages 1275–1276, 2007.
175. Y. Wang, G. Cong, G. Song, and K. Xie. Community-based greedy algorithm for mining
top-k influential nodes in mobile social networks. In Proceedings of the 16th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 1039–
1048. ACM, 2010.
176. Y. Wang, S. Wen, Y. Xiang, and W. Zhou. Modeling the propagation of worms in networks:
A survey. Communications Surveys Tutorials, IEEE, PP(99):1–19, 2013.
177. Y. Wang, S. Wen, Y. Xiang, and W. Zhou. Modeling the propagation of worms in networks:
A survey. Communications Surveys Tutorials, IEEE, 16(2):942–960, Second 2014.
178. Z. Wang, W. Dong, W. Zhang, and C. W. Tan. Rumor source detection with multiple
observations: Fundamental limits and algorithms. In The 2014 ACM International Conference
on Measurement and Modeling of Computer Systems, SIGMETRICS ’14, pages 1–13. ACM,
2014.
179. D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. nature,
393(6684):440–442, 1998.
180. D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. nature,
393(6684):440–442, 1998.
181. N. Weaver, V. Paxson, S. Staniford, and R. Cunningham. A taxonomy of computer worms. In
Proceedings of the 2003 ACM Workshop on Rapid Malcode, WORM ’03, pages 11–18, 2003.
182. S. Wen, J. Jiang, B. Liu, Y. Xiang, and W. Zhou. Using epidemic betweenness to measure
the influence of users in complex networks. Journal of Network and Computer Applications,
78:288–299, 2017.
References 191

183. S. Wen, J. Jiang, Y. Xiang, S. Yu, and W. Zhou. Are the popular users always important for
the information dissemination in online social networks? Network, IEEE, pages 1–3, October
2014.
184. S. Wen, J. Jiang, Y. Xiang, S. Yu, W. Zhou, and W. Jia. To shut them up or to clarify:
restraining the spread of rumors in online social networks. Parallel and Distributed Systems,
IEEE Transactions on, 25(12):3306–3316, 2014.
185. S. Wen, W. Zhou, Y. Wang, W. Zhou, and Y. Xiang. Locating defense positions for thwarting
the propagation of topological worms. Communications Letters, IEEE, 16(4):560–563, 2012.
186. S. Wen, W. Zhou, J. Zhang, Y. Xiang, W. Zhou, and W. Jia. Modeling propagation dynamics of
social network worms. Parallel and Distributed Systems, IEEE Transactions on, 24(8):1633–
1643, 2013.
187. S. Wen, W. Zhou, J. Zhang, Y. Xiang, W. Zhou, W. Jia, and C. Zou. Modeling and analysis
on the propagation dynamics of modern email malware. Dependable and Secure Computing,
IEEE Transactions on, 11(4):361–374, July 2014.
188. L. Weng, F. Menczer, and Y.-Y. Ahn. Virality prediction and community structure in social
networks. Scientific reports, 3, 2013.
189. P. Wood and G. Egan. Symantec internet security threat report 2011. Technical report,
Symantec Corporation, April, 2012.
190. Y. Xiang, X. Fan, and W. T. Zhu. Propagation of active worms: a survey. International journal
of computer systems science & engineering, 24(3):157–172, 2009.
191. Y. Xie, V. Sekar, D. A. Maltz, M. K. Reiter, and H. Zhang. Worm origin identification using
random moonwalks. In Security and Privacy, 2005 IEEE Symposium on, pages 242–256.
IEEE, 2005.
192. G. Yan, G. Chen, S. Eidenbenz, and N. Li. Malware propagation in online social networks:
nature, dynamics, and defense implications. In Proceedings of the 6th ACM Symposium on
Information, Computer and Communications Security, ASIACCS’11, pages 196–206, 2011.
193. G. Yan and S. Eidenbenz. Modeling propagation dynamics of bluetooth worms (extended
version). Mobile Computing, IEEE Transactions on, 8(3):353–368, 2009.
194. Y. Yan, Y. Qian, H. Sharif, and D. Tipper. A survey on smart grid communication infrastruc-
tures: Motivations, requirements and challenges. Communications Surveys Tutorials, IEEE,
15(1):5–20, First 2013.
195. K. Yang, A. H. Shekhar, D. Oliver, and S. Shekhar. Capacity-constrained network-voronoi
diagram: a summary of results. In International Symposium on Spatial and Temporal
Databases, pages 56–73. Springer, 2013.
196. Y. Yao, X. Luo, F. Gao, and S. Ai. Research of a potential worm propagation model based on
pure p2p principle. In Communication Technology, 2006. ICCT’06. International Conference
on, pages 1–4. IEEE, 2006.
197. W. Zang, P. Zhang, C. Zhou, and L. Guo. Discovering multiple diffusion source nodes in
social networks. Procedia Computer Science, 29:443–452, 2014.
198. Y. Zhou and X. Jiang. Dissecting android malware: Characterization and evolution. In
Security and Privacy (SP), 2012 IEEE Symposium on, pages 95–109. IEEE, 2012.
199. G.-M. Zhu, H. Yang, R. Yang, J. Ren, B. Li, and Y.-C. Lai. Uncovering evolutionary ages of
nodes in complex networks. The European Physical Journal B, 85(3):1–6, 2012.
200. K. Zhu and L. Ying. Information source detection in the sir model: A sample path based
approach. arXiv preprint arXiv:1206.5421, 2012.
201. K. Zhu and L. Ying. Information source detection in the sir model: A sample path based
approach. In Information Theory and Applications Workshop (ITA), pages 1–9, 2013.
202. K. Zhu and L. Ying. A robust information source estimator with sparse observations.
Computational Social Networks, 1(1):1, 2014.
203. K. Zhu and L. Ying. Information source detection in the sir model: a sample-path-based
approach. IEEE/ACM Transactions on Networking, 24(1):408–421, 2016.
192 References

204. Y. Zhu, B. Xu, X. Shi, and Y. Wang. A survey of social-based routing in delay tolerant
networks: Positive and negative social effects. Communications Surveys Tutorials, IEEE,
15(1):387–401, Jan 2013.
205. Z. Zhu, G. Lu, Y. Chen, Z. Fu, P. Roberts, and K. Han. Botnet research survey. In Computer
Software and Applications, 2008. COMPSAC ’08. 32nd Annual IEEE International, pages
967–972, July 2008.
206. C. C. Zou, W. Gong, and D. Towsley. Code red worm propagation modeling and analysis. In
Proceedings of the 9th ACM Conference on Computer and Communications Security, CCS
’02, pages 138–147, 2002.
207. C. C. Zou, D. Towsley, and W. Gong. Modeling and simulation study of the propagation and
defense of internet e-mail worms. IEEE Transactions on dependable and secure computing,
4(2):105–118, 2007.
208. C. C. Zou, D. Towsley, and W. Gong. Modeling and simulation study of the propagation and
defense of internet e-mail worms. IEEE Transactions on Dependable and Secure Computing,
4(2):105–118, 2007.

You might also like