You are on page 1of 33

Efficient Network Seeding under Variable Node Cost and Limited Budget for Social Networks

Journal Pre-proof

Efficient Network Seeding under Variable Node Cost and Limited


Budget for Social Networks

R.C. de Souza, D.R. Figueiredo, A.A. de A. Rocha, A. Ziviani

PII: S0020-0255(19)31075-8
DOI: https://doi.org/10.1016/j.ins.2019.11.029
Reference: INS 15021

To appear in: Information Sciences

Received date: 25 March 2018


Revised date: 24 July 2019
Accepted date: 16 November 2019

Please cite this article as: R.C. de Souza, D.R. Figueiredo, A.A. de A. Rocha, A. Ziviani, Efficient Net-
work Seeding under Variable Node Cost and Limited Budget for Social Networks, Information Sciences
(2019), doi: https://doi.org/10.1016/j.ins.2019.11.029

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Inc.


Efficient Network Seeding under Variable Node Cost
and Limited Budget for Social Networks✩

R. C. de Souzab , D. R. Figueiredob , A. A. de A. Rochaa , A. Zivianic


a Computing Institute, UFF, Brazil
b Systems Engineering and Computer Science Dept, UFRJ, Brazil
c National Laboratory for Scientific Computing (LNCC), Brazil

Abstract
The efficiency of information diffusion on networks highly depends on both the
network structure and the set of early spreaders. Moreover, in various realis-
tic scenarios, to seed different nodes implies different costs, as in the case of
viral marketing, where costs often correlate with local network structure. The
budgeted influence maximization (BIM) problem consists in determining a seed
set whose diffusion maximizes the total number of influenced nodes, provided
that the seeding cost is within a given budget. We investigate efficient seeding
strategies for the BIM problem under the deterministic fixed threshold diffusion
model. In particular, we introduce the concept of surrounding sets: relatively
cheap seeds neighboring expensive, structurally-privileged nodes, which then
become spreaders at lower costs. Numerical experiments with several real net-
works indicate our method outperforms strategies that seed nodes based on
their influence/cost ratios. A key insight from our evaluation is that larger dif-
fusion is generally attained from the surrounding sets that consider the two-hop
neighborhood of influential nodes, as opposed to their immediate neighbors only.
Keywords: network seeding, influence maximization, seeding strategies,
variable node cost, social network

1. Introduction

Spreading phenomena on real and online social networks, such as the dif-
fusion of content or diseases among individuals receive ever-growing attention
from both academia and industry. Understanding how diffusion can either long
last, reaching large fractions of network nodes, or die out quickly after a negligi-
ble spread, is a fundamental step in designing more effective diffusion processes.
These include more efficient marketing campaigns among users of online social

✩ This work was funded in part by research project grants from CNPq, FAPERJ, and

FAPESP.
Email addresses: rcsouza@cos.ufrj.br (R. C. de Souza), daniel@land.ufrj.br (D. R.
Figueiredo), arocha@ic.uff.br (A. A. de A. Rocha), ziviani@lncc.br (A. Ziviani)

Preprint submitted to Elsevier November 18, 2019


networks and more effective disease spreading prevention among individuals in
society [13, 36].
Intuitively, the initial spreaders—i.e. the seeds—play a fundamental role
on the diffusion as those somehow more central are likely to maximize some
aspect of the diffusion. In this sense, the influence maximization (IM) problem
consists in identifying the seed set whose diffusion maximizes the expected total
number of active nodes [17]. In particular, this expected value corresponds
to the influence of a seed set. The classical constraint in IM problems is the
size of the seed set, which means that (i) all nodes have identical costs for
being seeded, independently of their role in the network; and (ii) there exists an
implicit budget (given as some multiple of this unit cost) restricting the number
of seeds.
Assuming identical node costs is, however, inadequate in many scenarios.
Examples include viral marketing campaigns in online social networks: popu-
lar individuals (celebrities) on Twitter are paid differently to tweet sponsored
marketing messages [20]; popular Instagram users are paid differently to post
photos alongside sponsored products [6]; and similarly—through contracts of
largely different values—Facebook pays companies and celebrities to sustain
live video streaming content, raising users’ exposure to advertisements [34].
Scenarios where nodes have different initial costs have been recently formal-
ized as the budgeted influence maximization (BIM) problem [31] which asks,
given the cost of each node and a fixed budget, what is the most influential seed
set? The budget here represents the resources made available to construct the
set of early spreaders. For example, in a viral marketing campaign, where early
spreaders may charge very different amounts, the budget is the total resources
available for hiring those seeds. Therefore, the cost of the seed set must be
within the budget.
Clearly, a seeding strategy in BIM must consider not only each node’s es-
timated influence, but also its cost towards the budget. Is it better to seed
more nodes at the lowest cost possible? Or fewer nodes at a much higher cost?
Moreover, the correlation between node influence and node cost can play a fun-
damental role, as those more popular individuals are likely to charge higher
amounts for promoting products within their social circles.
In this paper, we investigate efficient seeding strategies for BIM under a
correlated node cost model (i.e., higher centrality, higher cost) and the linear
threshold model. Our main contributions are summarized as follows:
• We propose a seeding strategy, called Node Surround, which consists
of targeting the cheapest neighbors of central, expensive (or even cost-
prohibitive) nodes, leveraging their higher spreading potential at much
lower costs when compared to their direct seeding. We show that, as the
network threshold increases, this approach outperforms state-of-the-art
BIM strategies.
• We show how the classical fraction of activated nodes, a broadly adopted
metric, may lead to misinterpretations with respect to the effectiveness of
a strategy. Opposite to the unit-cost IM, different BIM strategies upon

2
receiving the same initial budget may still yield seed sets of very different
sizes, ranging from few key-nodes up to a large fraction of the network. By
considering solely their diffusion power (a metric defined in Section 5.1),
we capture the real benefit (activated non-seeds) of an investment (bud-
get). Diffusion power (DP) is a fundamental metric to properly assess
BIM strategies. It embeds the Outward Influence [32] concept—originally
proposed for IM—to tackle BIM. Indeed, we show that to ignore the seeds
(paid influencers) when measuring a strategy’s performance eliminates po-
tentially large assessment distortions.
• We propose a flexible, single-parameter model for node cost, which corre-
lates cost and network centrality. To the best of our knowledge, this is the
first model to admit non-linear relations between cost and local structure.
It also captures the common real-world scenario wherein node-costs across
a network may differ from one another by many orders of magnitude, thus
being more relevant in practice.
We empirically evaluate different seeding strategies on 7 real networks, vary-
ing parameters for node cost, budget, and activation threshold. We show that
even when the cost of those most central nodes is larger than the budget, ef-
fective propagation can still be induced. Moreover, we show that cost-aware
seeding strategies that select nodes by their marginal-gain/cost ratios fail to
trigger effective diffusion much earlier than strategies that surround central
nodes, considering increasing values for the network threshold. This finding is
strongly related to the diffusion model under consideration in this work which is
different from most prior works (see Section 2).
Another important insight is that larger diffusion is generally attained when
the two-hop neighbors of a central node are also leveraged in order to have
it surrounded, and not only its direct neighbors. Last, we believe the main
contributions and findings of this work, such as the node-surround concept, can
be applied to other contexts.
The remainder of this paper is organized as follows. In Section 2 we discuss
the related work. Section 3 presents the diffusion model and the node cost
model. In Section 4, we describe the different seeding strategies, including the
node surround concept. Section 5 describes the different performance metrics,
the network datasets, and also presents our evaluation for various scenarios. We
then conclude the paper with a brief discussion in Section 6.

2. Related Work

The influence maximization (IM) problem has been broadly investigated


since the seminal work of Kempe et al. [17]. This pioneering paper provides
a framework for the general problem, proving its hardness (it is an NP-Hard
problem) and providing an approximate, polynomial-time greedy algorithm with
provable performance (constant factor of optimal). Their algorithm is based on
submodular objective functions, which is shown to be the case for some diffusion

3
models. However, its high running time has led to a myriad of approaches to
tackle the problem more efficiently [17, 2, 18, 8, 16, 1, 7, 42, 26, 4, 24, 25].
Indeed, various prior works have focused on designing heuristics to deter-
mine good seeds, exploring structural features of the network as well as features
associated with nodes (e.g., labels). For example, computationally-inexpensive
heuristics based on node degree [8], particle swarm optimization [11], and node
homophily [1] have all been considered. Heuristics based on k-core decomposi-
tion [35] have also been explored [18], showing a correlation between influential
spreaders and highly connected regions of the network. This idea has been ex-
plored by various subsequent works that also adapt and augment such metric
with node rankings [4], communities [42], disjoint paths [7], and local neighbor-
hoods [26].
The IM problem has also been investigated under diffusion models funda-
mentally different from the widely adopted Independent Cascades (IC) and Lin-
ear Threshold (LT). For instance, Ugander et al [40] propose the structural diver-
sity model, further investigated by Wenzheng et al [43]. However, all these prior
works implicitly assume that nodes have identical costs, since the constraint to
start a propagation is simply the number of seeds.
There are also recent works that have investigated network seeding where
node costs are not fixed (over time) nor identical across the network. For ex-
ample, Leskovec et al. [22] propose strategies for placing sensors on a network
to more quickly detect a diffusion. Arthur et al. [2] propose strategies to price
products and provide cash-back (discount) to nodes in the network to induce
recommendations to their neighbors. Miyanchi et al. [27] formulate an opti-
mization problem wherein a fixed budget is allocated to a bipartite network of
marketing channels and customers with variable node costs (no diffusion con-
sidered). None of these works specifically addresses the BIM problem.
However, BIM has recently been formulated and investigated by Nguyen and
Zheng [31]. The authors depart from the framework introduced by Kempe et al. [17]
and tackle the problem using the IC model. They establish a submodular, cost-
normalized objective function, from which they determine a greedy algorithm—
referred to as GR in this text—with approximation guarantees to a constant
factor.
Other works have also investigated the BIM problem [14, 33, 10]. Han et al. [14]
tackle BIM with a heuristic combining two seeding strategies; one based on
node influence, and the other on node cost. More recently, Nguyen et al.
have formulated a more general problem, Cost-aware Targeted Viral Market-
ing (CTVM) [33], briefly described as follows. Beyond arbitrary selecting cost,
each node v also provides an arbitrary benefit b(v) for being activated. The goal
is thus to maximize not the influence spread but the total benefit provided by
the final active set. Besides CTVM their algorithm, named BCT, also tackles
either the classical IM and the BIM problems. The latter—which is the scope
of this work—corresponds to the case where, given a network G = (V, E) and a
constant C ∈ R∗+ , b(v) = C, ∀v ∈ V . They show that, for IM, BCT significantly
outperforms state-of-the-art algorithms such as TIM/TIM+ [38] and IMM [37]
in terms of running time, with equal performance in what regards the spread

4
of influence. Also, when considering arbitrary selecting costs, they report BCT
outperforms all above-mentioned algorithms, including GR, in terms of conceiv-
ing a seed set that yields a final active set with larger overall benefit. For BIM,
however, they report GR performs better than BCT in terms of total influenced
population. Last, Souza et al. [10] characterize the performance of simple and
traditional seeding strategies to solve the BIM problem, motivating the need for
more clever strategies.
Despite addressing the BIM problem, these prior works have the following
limitations. The theoretical result of Nguyen and Zheng [31] assumes that the
initial budget is larger than the cost of any node. Moreover, their numerical
evaluation uniformly assigns random costs to nodes, from a small range (less
than a factor of 10). Similarly, Han et al. [14] and Nguyen et al.[33] assume
that the initial budget is larger than the cost of any node, and their numerical
evaluation considers that cost and node centrality are linearly tied. These as-
sumptions fall short of capturing more general pricing practices, such as those
adopted by celebrities (nodes) for promoting viral marketing in online social
networks [34, 20, 6]. In particular, marketing campaigns may not have suffi-
cient budget to hire even those more expensive individuals. In what follows, we
propose a flexible node cost model that strictly depends on the network struc-
ture. It allows for an arbitrary range of values, without making assumptions on
the available budget.

3. Models

We now describe the models for diffusion and node cost considered in this
work. While the first corresponds to the classical LT model, the latter is first
proposed here. Table 1 describes the main symbols and abbreviations to be seen
hereafter. We consider a progressive influence spreading, i.e. every node, once
activated (i.e. influenced), will remain active until the diffusion ends. Also, time
evolves discretely, represented by t = {0, 1, 2...}. A set of nodes is assumed to
be active at time zero—the seed set—denoted A0 . At each time step, a node
will pertain to one between sets At of the active nodes or It of the inactive
ones. Thus, At ∪ It = V and At ∩ It = ∅. The diffusion then unfolds until some
quiescence time q, corresponding to the first time t such that At = At+i , ∀i ∈ N∗ .
Thus, Aq denotes the set of activated nodes when the propagation ends.

3.1. Linear Threshold (LT) model


To represent the influence propagation dynamics, we consider the linear
threshold (LT) model with fixed and identical thresholds [12, 30], briefly defined
as follows. Given a network, every node inactive at t becomes active at t + 1 if
it has at least θ active neighbors at t. Such θ ∈ N∗ is the activation threshold,
and its value is the same for all nodes. Formally,

At if v ∈ At−1 ∨ |N (v) ∩ At−1 | ≥ θ,
∀v ∈ V, v ∈ (1)
It otherwise.

5
Table 1: Main symbols and abbreviations.

Symbol Description

n Number of nodes in the network, n = |V |.


m Number of edges in the network, m = |E|.
N (v) Set of neighbors of node v ∈ V .
d(v) Degree of node v, d(v) = |N (v)|.
N (v)i The i-th neighbor of v (in an arbitrary sequence).
P
d Network’s average degree; d = 2m/n = ( ∀v∈V d(v))/n.
At ; I t Sets of active and inactive nodes at time t, respectively.
A0 Seed set: nodes activated at time zero.
θ Nodes’ activation threshold.
q Quiescence time: time step at which the diffusion ends.
c(v) The cost of adding v to the seed set, such that c(v) = d(v)α .
α Cost function’s single parameter, such that α ∈ R+ .
b Available budget.
k A fraction of n, such that 0 ≤ k ≤ 1.
NS-D; NS-T Seeding strategies. Node Surround policy over degree-based (D)
and triangle-based (T) node rankings, respectively.
CH Cheapest Nodes seeding strategy.
GR BIM strategy proposed in [31].
BCT BIM strategy proposed in [33].
SS Surrounding Set.
ESS Extended Surrounding Set.
Γ(v) Node v’s SS.
+
Γ (v) Node v’s ESS.
γ(v) The number of neighbors of v ∈ I0 yet to be activated in order for
v to become itself active in time step 1. Thus, γ(v) = max(0, θ −
|N (v) ∩ A0 |).
∆(A0 ) Seeds’ Diffusion Power (DP), such that ∆(A0 ) = (|Aq |−|A0 |)/n.

6
Note that the model evolves deterministically, controlled by a single parameter
θ. If θ = 1 then all nodes in the same connected component of at least one
seed will surely pertain to Aq . Trivially, only those nodes from I0 with degree
of at least θ can possibly meet the threshold condition and thus become active
at some point.
Since its first proposition by Granovetter [12], threshold models have been
widely adopted to represent the collective dynamics through which information
spreads among individuals. This model and its generalizations are found in
a wide range of scenarios, in part due to its simplicity (deterministic, with
a single parameter) and common intuition [13]. Within the context of IM,
threshold models have been widely considered on studies such as information
diffusion in social networks [17, 9, 15], information cascades on online social
networks [39], diffusion of innovation [41], among many others. Last, although
our developments here focus on the LT model, the techniques and algorithms
we soon propose can be applied to other propagation models.

3.2. Cost model


It is reasonable to assume some correlation between the cost of a seed and its
position in the network, in the sense that structurally privileged nodes—such as
celebrities on Twitter or core routers in the Internet—are likely to have higher
costs. While actual costs may depend on various network centrality metrics
and also on network-independent features, we consider costs dependent on node
degree. Intuitively, node degree is a first-order approximation of influence on
a network. Indeed, comprehensive studies over real data indicate that larger
degree nodes in a network are likewise those more active in viral marketing
campaigns [16]. Thus, the cost c : V → R+ of a node v ∈ V is given by

c(v) = d(v)α , (2)


where d(v) is v’s degree and α ∈ R+ controls how intensely nodes with larger
centrality are extra-valued. In particular, while being the sole parameter of the
model, α also plays a fundamental role in the trade-off between node centrality
(degree) and node cost. In fact, to diminish α leads costs to equalize across the
network, thus encouraging the direct seeding of those more influential nodes.
Conversely, large values of α yield huge cost gaps between any pair of nodes
unleveled in centrality (even if by a single unit). Costs must then be carefully
considered by the seeding strategy in such scenarios. Last, note that in our
model the unit-cost IM corresponds to the particular case wherein α = 0.
We also define the cost of a set of nodes, which is simply the sum of the
nodes’ costs in the set. As a useful convention, we consider the cost of an empty
set as ∞ (infinity). Formally, given a set U ⊆ V ,

(

P if U = ∅,
c(U ) = d(v)α otherwise. (3)
∀v∈U

7
3.3. Problem Statement
In order to measure the effectiveness of a seed set A0 ⊆ V , we consider the
metric Outward Influence (OI), proposed in [32], defined as σ : V → R+ such
that

σ(A0 ) = E[|Aq |] − |A0 |, (4)

where q is the quiescence time. In our (deterministic) context, Eq. 4 can be


rewritten as σ(A0 ) = |Aq | − |A0 |. Note that this metric captures how effective
the seeds are at being viral, by distinguishing them (paid influence) from those
nodes made active during the diffusion (real benefit).
Thus, given a network G = (V, E), the cost function c(·) and a budget b,
the goal is to determine a seed set A0 ⊆ V that maximizes σ(A0 ) provided that
c(A0 ) ≤ b.
Note that this problem is at least as hard as unit-cost IM, which has been
shown to be NP-hard [17]. As discussed in Section 2, a common approach for
network seeding is to run polynomial time greedy algorithms upon submodu-
lar objective functions σ(·), for which hill-climbing algorithms provide approx-
imation guarantees. Here, however, σ(·) is not submodular.We thus consider
another approach, wherein the seed set is formed by sequentially including not
those nodes with larger incremental influence over cost, but the cheapest ones
neighboring those more influential nodes. Moreover, we are primarily interested
in the diffusion power of the seed set (defined in Section 5.1), and not only in
the fraction of activated nodes. Thus, we aim at designing computationally effi-
cient seeding strategies based on heuristics tailored to the scenario where nodes
have different costs and diffusion follows the LT model.

4. Seeding Strategies

A common approach when designing seeding strategies for IM is to rank


nodes according to some criteria and then consider them sequentially for inclu-
sion in the seed set. We follow this framework, according to the two following
steps: (i) nodes are ranked and considered in sequence; (ii) a node is considered
to be placed in the seed set, or ignored, or even surrounded (as later detailed).
Note that (i) determines a global sequence in which nodes will be traversed,
taking into account their centrality, while (ii) concerns the cost, expressing the
attempt of “buying more for less”, in the sense of leveraging the importance of
each node v ∈ I0 by not seeding such v directly, but its many cheapest neighbors
that ensure v ∈ A1 .
In the following, we first describe the two node rankings adopted. We then
formalize the surrounding sets concept in sections 4.3 and 4.4, wherein we define
the surrounding set and the extended surrounding set, respectively. Next, we
present in Section 4.5 both the seeding policies and the seeding strategies (i.e.

8
“ranking-policy” combinations) to be evaluated in Section 5. Finally, in Sec-
tion 4.6 we present a cost-aware node ranking based on nodes’ centrality/cost-
to-surround ratios. Benefits and drawbacks of these cost-weighted rankings are
shown in Section 5.

4.1. Degree centrality ranking


As discussed in Section 3, degree may not be the best feature to consider
when estimating nodes’ importance on spreading processes [5]. Under the LT
model with fixed thresholds such a fact is perhaps easier to visualize due to its
diffusion dynamics, wherein the sole condition for a node v to become active
at time t is its number of already-active neighbors at t − 1. The node degree
therefore acts like a valve: any v for which d(v) < θ holds will never become
active unless v being itself a seed. Nevertheless, as also discussed in Section 3,
it is also unclear what are the main node features that yield large diffusion, thus
making node degree a simple yet reasonable alternative. Furthermore, degree
centrality as a measure of node importance presents three major benefits, as
follows. First, a common property to a wide variety of real networks, irrespective
of their nature, is a heavy-tailed degree distribution, meaning that nodes whose
degree is orders of magnitude larger than the average occur with non-negligible
probability. Although this condition may impose a constraint for seeding (since
the cost of such nodes may be prohibitive), it also means that there frequently
exists a set of a few nodes that jointly neighbor a large fraction of the network.
The surrounding set approach thus becomes specially attractive since important,
cost-prohibitive nodes may still be activated if conveniently surrounded by seeds.
Second, on real networks the set of neighbors of those largest-degree nodes
is mostly formed by nodes of small degrees. This observation is captured by
the notion of degree assortativity [29], i.e. the extent to which neighbors in a
network are similar with respect to their degree. Finally, large degree nodes
naturally tend to have numerous neighbors in common, meaning that part of
the nodes used to surround a large degree node may also contribute to surround
other large degree nodes, leading these to be surrounded by using less new seed
nodes. Therefore, one of the node rankings adopted is degree centrality, wherein
network nodes are sorted with respect to their degree, from the largest to the
smallest.

4.2. Triangle centrality ranking


When considering the very condition that allows an influence to propagate
upon a network under the LT model with fixed thresholds, it is immediate to
notice that, at each time step t, it is necessary that at least one inactive neighbor
of some active node has at least θ − 1 other active neighbors to become itself
active at t + 1. This highlights the importance of triangles on the network, i.e.
subsets of three nodes connected to each other or, more formally, cliques of size
3. Indeed, a node is more likely to adopt the behavior of its neighbors when these
are also neighbors from each other [3], a remarkable, known phenomenon once
referred to by Kleinberg as a “gravitational force” of a node’s neighbors [19]. We

9
therefore propose a node ranking based on triangles, defined as follows. First we
compute, for every node vi , the number τi0 of triangles that have vi as a vertex.
Formally, let N (vi ) be the set of neighbors of vi and 1(·) the indicator function.
Then

(
0 P if |N (v)| < 2,
τi0 = 1 (uk ∈ N (uj )) otherwise. (5)
∀uj ,uk ∈N (vi ),j<k

Next, we determine τi00 , which is the sum of τj0 from every neighbor vj of vi ,
plus τi0 :

X
τi00 = τi0 + τj0 . (6)
∀uj ∈N (vi )

The triangle centrality of node vi is given by τi00 , and larger is more central.
Note that the definition of τi00 implies that every triangle of τi0 is counted three
times, while triangles formed by two of vi ’s neighbors but not vi itself are counted
twice. Thus, τi00 reflects higher appreciation of vi ’s own triangles, a desirable
feature since at each time step only those inactive nodes directly neighboring
the active ones can possibly become influenced.
The second node ranking adopted, therefore, is triangle centrality, wherein
the assessed structural relevance of each node corresponds to its τ 00 index. Note
that such a score provides a different information from that of the node’s clus-
tering coefficient (CC) [30], as the latter captures a relative measure: nodes
with same CC may still hugely differ with respect to their absolute number of
triangles.
Hereafter, to distinguish the node rankings, we will denote degree centrality
by V d and triangle centrality by V t .

4.3. The Surrounding Set


Under different scenarios, cost-unaware strategies tend to exhaust the budget
inefficiently. A fundamental question is thus how to target a relatively numerous
set of nodes while leveraging those expensive, well-ranked ones? To tackle this
problem, we propose the surrounding sets, defined as follows. Let θ denote the
node activation threshold and γ : I → N denote the “distance” of a node v ∈ I0
from A1 , i.e. the number of v’s neighbors yet to be seeded to ensure v ∈ A1 .
Formally,

γ(v) = max(0, θ − |N (v) ∩ A0 |). (7)

A surrounding set Γ(v) of a node v is a γ(v)-size set of inactive nodes, given by


the set function Γ : I → I, as follows:

10
(
∅ S  if |N (v)| < θ ∨ γ(v) = 0,
Γ(v) = γ(v) (8)
arg min c i=1 {wi } otherwise.
w∈N (v)∩I0

Note that Γ(v) = ∅ may either mean that v already has at least θ active neigh-
bors, or that d(v) < θ (and thus v cannot be node-surrounded). If otherwise
Γ(v) 6= ∅ then Γ(v) contains the cheapest γ(v) inactive neighbors of v. Also,
note that if seeded all nodes in Γ(v), then v becomes active at t = 1. Thus, for
each v ∈ I0 , if |N (v) ∩ A0 | ≥ θ, then v is said to be surrounded, since certainly
v ∈ A1 . Finally, Algorithm 1 describes the construction of the surrounding set.
The algorithm: Algorithm 1 firstly verifies whether the input node v is eli-
gible for being surrounded (line 2), returning an empty Γ if this is not the case.
If eligible, however, we then add its γ(v) cheapest inactive neighbors to Γ (lines
5-11).
Computational complexity: Prior to executing any of the algorithms, we first
need to sort each node’s neighbors in ascending order of their costs. To sort all
d(v) neighbors of a node v requires
P O(d(v) log d(v)). The overall computation
for
P all nodes is therefore O( i d(v i ) log d(vi )) which is also O(m log m), since
i d(vi ) = 2m. Thus, pre-processing is O(m log m). Back to Alg. 1, its com-
plexity is dominated by iterating over γ(v) neighbors of v (lines 7-13). Since in
the worst case γ(v) = θ, Alg. 1 is O(θ).

4.3.1. Tiebreakers
Tiebreak is an important matter during the surrounding set formation, since
nodes with identical costs often differ with respect to their neighborhoods.
Whenever the minimum cost can be achieved from more than one node, the
tiebreak is as follows. For V d we choose the node with largest sum over its
neighbors’ degree; for V t , largest τ 0 is preferred. Formally, let ψ : I × I → I
denote the function that receives a pair of inactive nodes eligible for composing
Γ(v) and returns the selected one. Then,



 u1 if c(u1 ) < c(u2 ),
 u if c(u2 ) < c(u1 ),
2 P
ψ(u1 , u2 ) = (9)

 arg maxu∈{u1 ,u2 } ∀w∈N (u) d(w) if V d ,

arg maxu∈{u1 ,u2 } τu0 if V t .

Note that τ 00 as a tiebreaker for V t would be less appropriate than τ 0 , as τ 00


carries information from a region, not a node. A high τ 00 score means that the
subgraph induced by v, v’s neighbors, and v’s neighbors’ neighbors has a large
number of triangles, which does not imply τ 0 is also large (particularly, even
τ 0 = 0 is possible!). The fairly intuitive idea here is that a surrounding node
with larger number of triangles is more likely to surround other nodes, what in
turn is budget-saving.

11
4.4. The Extended Surrounding Set (ESS)
We now describe the extended surrounding set concept. The goal here is to
achieve a more effective budget usage by extending the surrounding set approach
(described in Section 4.3) to the two-hop neighborhood of each node, as follows.
Let Γ(·) be as described in Eq. 8. Now let ρ : I → I denote the function
that receives a node w ∈ I0 and returns the cheaper set between {w} and w’s
surrounding set Γ(w) as follows


{w} if c(w) < c(Γ(w)),
ρ(w) = (10)
Γ(w) otherwise.

Note that c(∅) = ∞ (as defined in Eq.3). Also, recall that γ(v) is the number of
nodes that still need to be seeded for having v surrounded. Then, the Extended
Surrounding Set (ESS) Γ+ (v) is

(
∅ S  if |N (v)| < θ ∨ γ(v) = 0,
+
Γ (v) = arg min c
γ(v)
ρ(wi ) otherwise. (11)
i=1
w∈N (v)∩I0

Note that if Γ+ (v) is seeded then v is activated at most when t = 2. Also, note
that c(Γ+ (v)) ≤ c(Γ(v))∀v ∈ V . Finally, Algorithm 2 describes the construction
of the extended surrounding set (ESS).
The algorithm: Algorithm 2 verifies whether the input node v is eligible
for being surrounded (line 2), returning an empty Γ+ if this is not the case.
If eligible, however, we then create an array, named arrayΓ, of d(v) initially-
empty sets, one for each u ∈ N (v) (line 3). For each neighbor u, its respective
set remains empty if u is already active. Conversely, if u ∈ I0 then its set within
arrayΓ will be that returned by ρ(·) (Eq. 10): either Γ(u) or {u} itself (lines
4-6). The resulting arrayΓ is then sorted ascending by cost. Because of the
initial validation performed over v (line 2), we are guaranteed the first γ(v) sets
of arrayΓ are not empty. Their union generates Γ+ (lines 8-11).
Computational complexity: Two scopes of Alg. 2 present an overall dominant
complexity, as follows. To form an ESS we first need to determine the surround-
ing set (SS) for each neighbor w of v (via ρ(·), line 5). Thus, the SS computation
is performed d(v) times, O(θ) each. After such iterations, the d(v)-size array
containing a SS for each of v’s neighbors is then sorted ascending by cost (line
20), which is O(d(v) log d(v)). Therefore, Alg. 2 is O(d(v)(θ + log d(v)).

4.5. Seeding Policies


In the following, we describe the two different seeding policies adopted. Each
seeding strategy further evaluated is a combination of a node ranking and a
seeding policy.

12
Algorithm 1 SurroundingSet.
Require: v //{Input. v = node to be surrounded}
Require: Γ //{Output. The surrounding set Γ}
1: Γ ← ∅
2: if v ∈/ I0 or d(v) < θ or γ(v) = 0 then return Γ
3: total ← 0
4: i ← 1 //{Accesses the node N (v)i from v’s neighbors. N (v), here, is already sorted,
ascending by cost and then descending by tiebreaks. }
5: repeat
6: if N (v)i ∈ I0 then
7: Γ ← Γ ∪ N (v)i
8: total ← total + 1
9: end if
10: i←i+1
11: until total == γ(v)
12: return Γ

Algorithm 2 ExtendedSurroundingSet.
Require: v, //{Input. v = the node to be surrounded}
Require: Γ+ //{Output. The set Γ+ of eligible nodes (no seeding performed)}
1: Γ+ ← ∅
2: if v ∈/ I0 or d(v) < θ or γ(v) = 0 then return Γ+
3: initialize(arrayΓ, d(v), ∅) //{Creates an array of |N (v)| initially-empty sets Γ.}
4: for i = 1, · · · , d(v) do
5: if N (v)i ∈ I0 then arrayΓ[i] ← ρ(N (v)i ) //{ρ(·) is as defined in Eq. 10.}
6: end for
7: sortAscendingByCost(arrayΓ) //{Empty sets have the highest cost, as in Eq. 3.}
8: for i = 1, · · · , γ(v) do
9: Γ+ ← Γ+ ∪ arrayΓ[i]
10: end for
11: return Γ+

13
• Node surround (NS): This policy consists in surrounding each visited node,
skipping those for which such a task is impossible. First, it determines
the ESS Γ+ v1 of v1 —the first node of a given node ranking—, seeding it
case c(Γ+v1 ) ≤ b. It then evaluates v2 in the same way, trying to surround
it. In the case a node v cannot be surrounded (for either |N (v)| < θ or
c(Γ+v ) > b), it is skipped. If the ranking gets completely traversed prior
to the budget exhaustion, then it is traversed again, but this time NS
tries to seed each visited node directly. When the budget finally becomes
residual, the process stops and the current set A0 of seeds is regarded as
complete. Two seeding strategies derive from the NS policy, one for each
node ranking. Indeed, we denote NS-D and NS-T the seeding strategies
which combine NS with V d and V t , respectively. Finally, Algorithm 4
describes both strategies.
Computational complexity: Prior to analyze Alg. 4, we first need to deter-
mine the complexity of seeding a given number of nodes, as described in
Alg. 3. The |C| iterations over the set of candidates (line 2) dominate the
overall complexity. Thus, Alg. 3 is O(|C|). Back to Alg. 4, its complexity
is dominated, in the worst case, by visiting all nodes in V (lines 4-11), in
order to have their ESS Γ+ determined (line 7) and seeded (line 8). Note
that the max-size Γ+ occurs when it is formed by θ surrounding sets of size
θ each. Therefore, Alg. 3 in the worst case will try to seed |Γ+ | = θ2 nodes.
Thus, Alg. 4 main complexity arises from n formations of ESS (which is
O(d(v)(θ + log d(v)))) summed over |C| trials of seeding θ2 nodes. Thus,
Alg. 4 is O(n(d(v)(θ + log d(v)) + θ2 )). Assuming θ ≤ log n, and observing
that d(v) < n, Alg. 4 is O(n2 θ).
• Cheapest nodes (CH): This simple policy is combined only with the de-
gree ranking V d but in ascending order, thus starting from the network’s
smallest degree. For each node vi visited, CH seeds it directly. Clearly,
one single ranking traversal suffices to determine A0 . Note that any seed
set formed here will have the largest size possible for a budget b. The mo-
ment the budget b is no longer effective, the process stops and the current
set A0 of seeds is regarded as complete. Because of its simplicity, we have
omitted its related algorithm. Finally, for its uniqueness we will also refer
to the seeding strategy formed by combining CH with V d as simply CH.
Computational complexity: In the worst case, b is large enough for the
strategy to seed the entire network. Since CH tries to seed one node at a
time, Alg. 3 here is O(1), hence CH is O(n).

4.6. Cost-weighted rankings


The seeding policies presented in Section 4.5 evaluate nodes sequentially,
from most central to least central, as given by either V d or V t . This approach
ignores that some strategy, right after having surrounded a node v1 , may visit
another node v2 that although slightly less central is, likewise, much less costly

14
Algorithm 3 tryToSeed.
Require: C ∈ I0 //{Input. The set C of candidates for being seeded.}
Require: I0 , A0 , b //{Output. Updated I0 , A0 and b}
1: if b ≥ c(C) then
2: for all v ∈ C do
3: I0 ← I0 \ {v}
4: A0 ← A0 ∪ {v}
5: b ← b − c(v)
6: end for
7: end if

Algorithm 4 NS (Node Surround).


Require: G = (V r , E), δ, θ, b //{Input. δ = minimum cost; V r is either V d or V t }
1: I0 ← V r //{Set of inactive nodes}
2: A0 ← ∅ //{Set of active nodes}
3: i ← 1
4: while b ≥ δ and i ≤ n do
5: v ← Vir //{Next node in the ranking}
6: if v ∈ I0 then
7: Γ+ ← extendedSurroundingSet(v)
8: tryT oSeed(Γ+ , I0 , A0 , b) //{Algorithm 3}
9: end if
10: i←i+1
11: end while
12: i ← 1 //{Ranking is traversed again to exhaust the remaining effective budget}
13: while b ≥ δ and i ≤ n do
14: v ← Vir
15: tryT oSeed({v}, I0 , A0 , b)
16: i←i+1
17: end while
18: return A0

15
for being surrounded. By intuition, if nodes with better centrality/cost-to-
surround ratio were always to be firstly surrounded, the budget usage would be
more effective, in the sense of allowing further seeding, with potentially larger
diffusion.
We thus introduce cost-weighted rankings, defined as follows. For each node
v, its ranking score—either d(v) for V d or τ 00 for V t —is divided by the cost of
00
having v surrounded (i.e. d(v)/c(Γ+ +
v ) or τ /c(Γv )). Such a division yields a
second score λv ∈ R+ for v, upon which V is sorted, yielding a new ranking VΓ .
For every v such that Γ+ v = ∅ (nodes that cannot be surrounded), its original
score is divided by a constant M = θ · c(dL ), where dL is the network’s largest
degree and c(dL ) is, consequently, the highest cost. Note that this leads all of
such nodes to become low ranked but still numerically comparable. Hencefor-
ward, cost-weighted rankings will be referred to as VΓd and VΓt for degree-based
and triangle-based rankings, respectively.

5. Evaluation
We start by presenting the performance metrics used. We then describe the
different networks and parameters used, followed by numerical evaluations and
main findings.

5.1. Performance metrics


We consider the following performance metrics:
• Fraction of activated nodes at quiescence time, namely |Aq |/n.
• Diffusion Power (DP): fraction of non-seeds influenced at the end of the
spreading, denoted by ∆ : A → R+ such that

σ(A0 )
∆(A0 ) = , (12)
n
where σ(A0 ) = |Aq | − |A0 |, denotes A0 ’s Outward Influence (OI) [32], as
defined in Eq. 4. Note that DP allows comparisons between networks of
different sizes.
• Average Diffusion Power (ADP) over thresholds. Formally, consider a
network G = (V, E) and let θm and θM denote, respectively, the mini-
mum and maximum values of θ from the range of thresholds applied on
G. Consider θm = 2 for all networks and θM being network-specific, as
indicated in Table 2. Now let A0,θ denote a strategy’s seed set for a given
θ. Thus, the average diffusion power ∆ is given by:

θP
M
∆(A0,θ )
θ=2
∆= . (13)
θM − 1

16
5.2. Budget Model
As stated in Section 3.3, a fixed budget must be defined as part of the
problem’s input. Thus, strictly for evaluation purposes, we propose a budget
model which ties the value of b to both the network structure and the cost
function. This favours the comparison between different networks: if we were
to deliver a same amount b to all networks instead, those with larger average
degree—and consequently higher average cost—would likely form smaller seed
sets. Our budget model therefore aims at balancing the relationship between
network connectedness and seeding capability.
The idea for the budget is therefore to cover the cost of a given
P number of
virtual nodes whose degree matches the average. Thus, let d = ( ∀v∈V d(v))/n
denote the network’s average degree, and let 0 < k ≤ 1 denote a fixed fraction.
Then,

b = kn · c(d) = kn(d)α . (14)

The budget thus corresponds to the cost of kn virtual, average-degree nodes,


ergo bigger networks as well as networks with larger d receive larger budgets. We
will adopt very small values for k (such as k = 10−4 ) and hence deliver strategies
meager resources, thus reinforcing the need for good decisions. Clearly, generous
budgets would loosen the demand for smart choices of seeds.

5.3. Network datasets and evaluation scenarios


We evaluate the strategies empirically upon real, undirected networks. Ta-
ble 2 lists the networks evaluated, including their basic statistics. More informa-
tion on these datasets may be found at [23]. For readability, original names have
been replaced by aliases as follows: Fb = “ego-Facebook”; hep = “ca-HepPh”;
astro = “ca-AstroPh”; cmat = “ca-CondMat”; enron = “email-Enron”; Bk =
“loc-Brightkite” and dblp = “com-DBLP”. Table 2 provides, for each network,
its number of nodes, number of edges, Degree Assortativity (DA) [29], percent-
age of nodes in the largest connected component (%LCC), average degree d,
maximum degree, and the maximum network threshold θM used during the
experiments. It is worth noting that the minimum degree for all networks is 1.
Note that the networks have quite different structures, not only with respect
to their sizes (orders of magnitude) but also their connectedness, expressed by
both their average degree d and degree assortativity (DA). For instance, astro
has less than 10% of dblp’s size, but 3-times larger d. The DA for enron reveals
that this network is slightly disassortative, meaning that neighbor nodes tend
to have considerably different degrees. Conversely, hep is quite assortative,
what in turn means that node degrees across its neighborhoods are more alike.
Considering all networks, note that the average degree is more than 20 times
smaller than the maximum degree; sometimes more than 100 times smaller (Bk ).
These networks, therefore, provide a structurally rich context to evaluate the
proposed strategies.

17
Table 2: Network datasets and some basic statistics.
Network |V | |E| DA %LCC d dM θM
astro 18772 198050 0.21 95.4% 21.10 504 15
cmat 23133 93439 0.13 92.3% 8.08 281 10
Fb 4039 88234 0.064 100% 43.69 1045 13
Bk 58228 214078 0.011 97.4% 7.35 1134 20
dblp 317080 1049867 0.27 100% 6.62 343 11
enron 36692 183832 -0.11 91.8% 10.02 1383 35
hep 12008 118489 0.63 93.3% 19.73 491 15

5.4. Model parameters


Besides the different networks, we also consider different settings for the
parameters associated with the model, as follows.
• Node cost: For the node cost (c(v) = d(v)α ) we consider α = {0.5, 1.0, 2.0},
which derives node costs with essentially different levels of dependence
on node degree (sub-linear, linear and quadratic, respectively). This al-
lows the proposed strategies to be evaluated under fundamentally different
regimes.
α
• Budget: For the initial budget (b = knd ) we consider k = {0.0005, 0.0002}.
• Linear threshold : The range of thresholds adopted is θ = {2, 3, 4, ..., θM }
where θM is network-specific, as described in Table 2.

Because of the model’s deterministic nature, a single run per setup is enough.
Nevertheless, the various possible parameter combinations lead to a broad range
of scenarios and regimes. Although we have extensively studied many different
setups, only a small fraction of our results is presented here, in order to better
illustrate and highlight our main findings.

5.5. Direct results


Figure 1 shows the fraction of activated nodes at quiescence time q for the
network Fb as a function of the diffusion threshold θ. Note that as θ increases the
fraction of activated nodes decreases for every strategy since diffusion becomes
more stringent. The different plots indicate different strategies for considering
cost-awareness. DS strategy shown in figures 1(a) and 1(b) performs direct seed-
ing of each node according to their rank ordering, budget permitting, skipping
the node otherwise. The rankings associated to DS in Figure 1(a) are V d and
V t , as defined in sections 4.1 and 4.2, thus forming the strategies DS-D and
DS-T, respectively. In Figure 1(b), rankings are such that the position of each
node v is given by its centrality/cost ratio, from largest to smallest, namely
d(v)/c(v) and τv00 /c(v) for degree and triangle centrality, respectively. We will
thus denote the corresponding strategies by DS-Dc and DS-Tc , respectively.

18
(a) central nodes first (b) best centrality/cost first (c) central nodes surrounded

Figure 1: Fraction of activated nodes per threshold on Fb network from cost-unaware strategies
(left), cost-awareness through centrality/cost ratio (center), and cost-awareness by surround-
ing the central nodes (right). α = 2.0 and k = 0.0005.

The comparison of figures 1(a) and 1(b) illustrates the importance of cost-
awareness for BIM. Nevertheless, strategies’ performances may still be signifi-
cantly further improved: Figure 1(c) shows that broad diffusion is still possible
for considerably larger values of θ when adopting the NS strategy over either
V d or V t . Indeed, θ = 7 is the maximum threshold to which DS-Dc still in-
duces meaningful propagation, activating almost 40% of the network (Fig. 1(b)),
whereas NS-D manages to influence more than 40% even when θ = 9 (Fig. 1(c)).
As for DS-Tc , note that its spreading is hindered whenever θ > 6 (Fig. 1(b))
whereas NS-T still manages to influence 20% of the network when θ = 10
(Fig. 1(c)).
The cost-weighted ranking VΓ (Section 4.6) may substantially increase the
resilience of the different strategies to growing values of θ, i.e. it may increase
the maximum θ for which strategies still manage to trigger broad diffusion, thus
being adopted in many scenarios. Figure 2 shows the fraction of activated nodes
per threshold from diffusion simulated over the network cmat, when k = 0.0005.
Costs are based on α = {0.5, 2.0}, and hence node degree impacts on the cost
either slightly (α = 0.5) or enormously (α = 2.0). Note, by comparing figures
2(a) and 2(b) (both with α = 2.0), that the use of VΓ (Figure 2(b)) led both
NS-based strategies to induce propagation for greater values of θ.
Interestingly, our results also demonstrate that the use of VΓ when α ≤ 1
does not necessarily improve strategies’ performance, as shown in figures 2(c)
and 2(d). Note that VΓt diminishes NS-T’s resilience to θ, as this strategy
no longer manages to induce propagation from θ = 5 on. Figure 2 therefore
illustrates a behavior discussed in Section 3.2: when conceiving the seed set,
node costs may become a critical (resp. negligible) concern as the gaps between
costs enlarge (resp. shrink).
Figures 1 and 2 have shown the fraction of activated (influenced) nodes at

19
(a) α = 2.0; V d and V t (b) α = 2.0; VΓd and VΓt

(c) α = 0.5; V d and V t (d) α = 0.5; VΓd and VΓt

Figure 2: Comparison of fraction of activated nodes per threshold at cmat network when
strategies are given regular rankings (left) and cost-weighted rankings (right), upon quadratic
(top) and sub-linear (bottom) cost regimes.

20
(a) Classical (b) DP

Figure 3: Fraction of activated nodes (left) and Diffusion Power (right) per Threshold for a
diffusion simulated on hep network (α = 2.0 and k = 0.0005).

quiescence time. When considering BIM, however, this metric may stand far
from capturing the real effectiveness of a seeding strategy. Indeed, since the
initial budget can be arbitrarily large, a huge seed set may be formed and yet
no propagation come to be triggered. From the perspective of viral marketing
campaigns—wherein the real benefit (profit) comes from persuasion—to hire
many and convince none denotes the worst scenario possible. Thus, to more
properly assess the seeding strategies’ performances, we capture their relative
Outward Influence, which yields what we call diffusion power (DP): a metric
that captures the fraction of non-seeds activated at quiescence time (Eq. 12),
hence denoting how diffusive the seeds are irrespective of the network size.
Figure 3 compares these two different measures of influence (classical and
DP), and illustrates how the understanding on each strategy’s performance
varies according to the metric. Note, for instance, that the classical approach
(Fig. 3(a)) indicates that for any θ > 10 the CH strategy performs better than
the others, and this perception is intensified when θ = 15, wherein CH seem-
ingly influences around 15% more nodes than any other strategy. A completely
different viewpoint is captured by DP (Fig. 3(b)), whereby it becomes evident
that the CH seeds are no longer diffusive as early as θ = 7.

5.6. Comparison with other approaches


Beyond the simple baseline method CH, we have also compared NS with
state-of-the-art algorithms discussed in Section 2, namely GR and BCT. In what
follows, we provide a brief explanation on both algorithms and also describe,
for each one, our method to adapt it to our model.

5.6.1. GR
Improved Greedy is the seminal strategy for determining seeds for BIM; a
cost-normalized greedy algorithm, proposed by Nguyen and Zheng in [31] and

21
referred to in this text as GR. It considers an influence function σ(·) such that,
given S ⊆ V , σ(S) = E[|Aq |] when A0 = S. Thus, σ(S ∪ {v}) − σ(S) expresses
the marginal gain of a node v ∈ V \S. The seed set is then built upon a two-stage
process, briefly described as follows.
First, at each round the node with largest cost-normalized marginal gain,
i.e the node v ∈ V \S with largest (σ(S ∪ {v}) − σ(S))/c(v), is added to the
candidate seed-set S if the remaining budget b covers its cost. This step is
repeated until exhausting b. Then, the estimated total influence σ(S) of the
candidates is compared with the influence σ(u) of the most influential node u
in the network: if σ(S) ≥ σ(u), then A0 = S; A0 = {u}, otherwise.
Note that this algorithm assumes that the initial budget is always at least
as large as the cost of the most expensive node in the network. As discussed
in Section 2, we do not hold such an assumption in this paper. Consequently,
in our experiments the second stage of GR is not guaranteed to occur case
σ(u) > σ(A0 ).
GR was designed under the Independent Cascades (IC) model [17], which
allows the influence σ(·) of each node to be estimated independently. It thus
cannot be readily applied to the LT model. Therefore, in order to determine
the seed set for GR, we run it under the same scheme of [31], namely, IC
with identical activation probabilities p = 0.1 across all edges, and σ(·) esti-
mation via CELF [21] with 10000 runs per node. We have chosen CELF for
Nguyen and Zheng report it as the estimator upon which GR yields the best so-
lution quality among the techniques employed in their evaluation section. Once
determined, this seed set is then used under the LT model as considered in this
paper.

5.6.2. BCT
Albeit designed for the LT model, two adjustments are still made necessary
prior to employing BCT in our context. These relate to (i) the problem BCT
must tackle (BIM instead of CTVM), and (ii) the node thresholds we need BCT
to consider when estimating each node’s influence. Both items are straightfor-
wardly addressed as follows. Towards (i), we simply follow the lines discussed
in Section 2, thus equalizing the benefit b(·) of every node. More specifically,
∀v ∈ V we set b(v) = 1, hence making BCT aim at the BIM problem rather
than CTVM.
Regarding (ii), for a better description of our procedure to cope with the
different threshold values between BCT’s original model (node-specific) and ours
(single value θ across the entire network), we first briefly describe BCT’s original
LT model as follows. For each node v ∈ V , let λv denote v’s (random) threshold,
uniformly drawn from the real interval [0, 1]. Also, ∀u ∈ N (v) let w : V × V →
R+ be the weight associated to every edge (u, v). Time P evolves discretely, such
that t = {1, 2, 3, ...}. Then, v gets activated at t if ∀u∈{N (v)∩At−1 } w(u, v) ≥
λv . As usual, the spreading is assumed to be progressive.
Note that such a model allows us to identically reproduce the deterministic
LT spreading behaviour. We simply need to define each edge weight conve-
niently. Indeed, for each node v ∈ V , we assign the weight λv /θ to all of its

22
P
edges, thus accomplishing this goal. Note that ∀u∈N (v) w(u, v) > 1 becomes
a possible condition in this case, and may hold for many nodes in fact. Yet,
Mossel and Roch proved in [28] that even such instances of the LT model yield a
submodular objective function σ(·). This in turn means that we are preserving
all of BCT’s original model fundamentals, while capturing the desired single-
threshold behaviour. For the remaining parameters, we have adopted the same
values set in [33]. We then determine BCT’s seed set by running the above
setup over the network of interest. Finally, we evaluate its seeds under the
deterministic LT model (which behaves exactly as the adaptation above).
Figure 4 compares the two NS-based strategies with both GR [31] and
BCT [33] on three networks under different node costs and budgets. Note that
when considering the fraction of activated nodes, CH and BCT are slightly su-
perior to NS strategies for small threshold values. This is related to the fact
that these strategies often generate larger seed sets. However, as θ increases, NS
strategies become superior in terms of inducing broader propagation. Indeed,
this is confirmed when considering the Diffusion Power (Figure 4(d-f)) which
shows that for almost all thresholds, NS seeds are at least as diffusive as all
others.
Figures 4(b) and 4(e) illustrate how the network structure may reflect at the
strategies outcome. Indeed, hep is considerably more assortative than any other
network studied. It means that expensive nodes’ neighbors tend to be expensive
as well, and this likely explains why NS performs equivalently to GR.
Results of the flavor of Figure 4(d)—where NS diffusion considerably out-
performs both GR and BCT for various threshold values—were also obtained
from many other scenarios, but not the opposite.

5.7. Extended Surrounding Sets


Our last set of results concerns the Average Diffusion Power (Eq. 13) of
different seeding strategies, for α = {1.0, 2.0} and k = {0.0005, 0.0002}. Fig-
ure 5 illustrates how strategies’ performances tend to increase when considering
two-hop neighborhoods to surround nodes (NS2-D and NS2-T strategies). The
average was taken for each network for θ ∈ [2, θM ], with θM shown in Table 2.
Note that, in general, the NS2-D (resp. NS2-T) strategy performs at least as
well as its NS1-D (resp. NS1-T) counterpart. Interestingly, the gap between
their performances enlarges as both the overall cost increases (controlled by α;
figures 5(b) and 5(d)) and the initial budget diminishes (controlled by k; figures
5(c) and 5(d)), indicating the benefits of extended surrounding sets.

6. Conclusion

The understanding of how propagators—like ideas and information—spread


through a network lies in the core of many applications, such as viral marketing,
information diffusion and epidemic prevention. The reach of a diffusion, how-
ever, strongly depends on where it starts, what settles influence maximization
as a fundamental, largely-studied problem.

23
(a) astro; b = 4000; α = 2.0 (b) hep; b = 2000; α = 0.5 (c) Fb; b = 2000; α = 1.0

(d) astro; b = 4000; α = 2.0 (e) hep; b = 2000; α = 0.5 (f) Fb; b = 2000; α = 1.0

Figure 4: Fraction of Activated Nodes (top) and Diffusion Power (bottom) as a function of
threshold value (θ) for the different strategies, including the Improved Greedy (GR) [31] and
BCT [33].

24
(a) k = 0.0005; α = 1.0 (b) k = 0.0005; α = 2.0

(c) k = 0.0002; α = 1.0 (d) k = 0.0002; α = 2.0

Figure 5: Average Diffusion Power of the different strategies across all evaluated networks
under two different budgets and node cost parameters.

25
This work focuses on the BIM problem, and proposes an efficient strategy
to tackle it. We have considered progressive diffusion upon real networks under
the Linear Threshold model with fixed thresholds. The cost of a node is propor-
tional to how central (important) the node is, and degree centrality is used as a
proxy for node importance. We have proposed the NS seeding policy, evaluating
its efficiency under different sequences (according to some node ranking) and
comparing it with baseline and state-of-the-art methods. Each strategy was
given the same initial budget and a cost function to perform the seeding. We
have focused on scenarios where the initial budget is relatively scarce, with cost
gaps between nodes ranging up to orders of magnitude, as it better relates to
real cases.
We have proposed a novel approach to further leverage cost-effectiveness on
BIM problems, namely the surrounding sets. Our main results indicate ESS
generally favors seeding strategies to yield broad influence spreading through
a wider range of network thresholds, considering state-of-the-art methods. Fi-
nally, we have demonstrated that, opposite to the unit-cost IM, different strate-
gies in BIM upon receiving the same initial budget may still yield seed sets of
very different sizes. On such scenarios, we showed that the classical Fraction
of Activated Nodes may lead to wrong conclusions on which strategy is best.
Conversely, by considering only their diffusion power, we capture their real ben-
efit (activated non-seeds) upon an investment (budget). We showed how this
approach may completely change the understanding around a strategy’s effec-
tiveness.

[1] Aral, S., Muchnik, L., Sundararajan, A., 2013. Engineering social conta-
gions: Optimal network seeding in the presence of homophily. Network
Science 1, 125–153.
[2] Arthur, D., Motwani, R., Sharma, A., Xu, Y., 2009. Pricing strategies for
viral marketing on social networks. In: Internet and Network Economics.
Vol. 5929 of Lecture Notes in Computer Science. pp. 101–112.

[3] Backstrom, L., Huttenlocher, D. P., Kleinberg, J. M., Lan, X., 2006. Group
formation in large social networks: membership, growth, and evolution. In:
ACM International Conference on Knowledge Discovery and Data Mining
(SIGKDD). pp. 44–54.
[4] Bae, J., Kim, S., Feb. 2014. Identifying and ranking influential spread-
ers in complex networks by neighborhood coreness. Physica A: Statistical
Mechanics and its Applications 395.
[5] Boguñá, M., Pastor-Satorras, R., Vespignani, A., Jan 2003. Absence of
epidemic threshold in scale-free networks with degree correlations. Phys.
Rev. Lett. 90, 028701.

[6] Brown, K., Jan. 2016. Here’s how much celebrities make in the instagram
product placement machine. Jezebel.

26
[7] Chen, D.-B., Xiao, R., Zeng, A., Zhang, Y.-C., Jan. 2014. Path diversity
improves the identification of influential spreaders. Europhysics letters 104.
[8] Chen, W., Wang, Y., Yang, S., 2009. Efficient influence maximization in so-
cial networks. In: ACM International Conference on Knowledge Discovery
and Data Mining (SIGKDD). pp. 199–208.
[9] Chen, W., Yuan, Y., Zhang, L., 2010. Scalable influence maximization in
social networks under the linear threshold model. In: IEEE International
Conference on Data Mining (ICDM). IEEE, pp. 88–97.
[10] de Souza, R. C., Figueiredo, D. R., de A. Rocha, A. A., Ziviani, A., 2014.
Evaluation of epidemic seeding strategies under variable node costs. In:
SBC Workshop em Desempenho de Sistemas Computacionais e de Comu-
nicação. WPerformance ’14.
[11] Gong, M., Yan, J., Shen, B., Ma, L., Cai, Q., 2016. Influence maximization
in social networks based on discrete particle swarm optimization. Inf. Sci.
367-368, 600–614.
[12] Granovetter, M., May 1978. Threshold models of collective behavior. Amer-
ican Journal of Sociology 83, 489–515.
[13] Guille, A., Hacid, H., Favre, C., Zighed, D. A., 2013. Information diffusion
in online social networks: A survey. ACM Sigmod Record 42 (2), 17–28.
[14] Han, S., Zhuang, F., He, Q., Shi, Z., 2014. Balanced seed selection for
budgeted influence maximization in social networks. In: Pacific-Asia Con-
ference on Knowledge Discovery and Data Mining. pp. 65–77.
[15] He, X., Song, G., Chen, W., Jiang, Q., 2012. Influence blocking maximiza-
tion in social networks under the competitive linear threshold model. In:
SIAM International Conference on Data Mining (SDM). pp. 463–474.
[16] Hinz, O., Skiera, B., Barrot, C., Becker, J. U., Nov. 2011. Seeding strategies
for viral marketing: An empirical comparison. Journal of Marketing 75, 55–
71.
[17] Kempe, D., Kleinberg, J., Tardos, E., 2003. Maximizing the spread of in-
fluence through a social network. In: ACM International Conference on
Knowledge Discovery and Data Mining (SIGKDD). pp. 137–146.
[18] Kitsak, M., Gallos, L. K., Havlin, S., Liljeros, F., Muchnik, L., Stanley,
H. E., Makse, H. A., 2010. Identification of influential spreaders in complex
networks. Nature Physics 6, 888–893.
[19] Kleinberg, J., Sep. 2007. Cascading behavior in networks: Algorithmic and
economic issues. In: Algorithmic Game Theory. pp. 613–632.
[20] Kornowski, L., May 2013. Celebrity sponsored tweets: What the stars get
paid for advertising in 140 characters. The Huffington Post.

27
[21] Leskovec, J., Adamic, L. A., Huberman, B. A., 05 2007. The dynamics of
viral marketing. ACM Trans. Web 1 (1).
[22] Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J. M.,
Glance, N. S., 2007. Cost-effective outbreak detection in networks. In:
ACM International Conference on Knowledge Discovery and Data Mining
(SIGKDD). pp. 420–429.
[23] Leskovec, J., Krevl, A., 2014. SNAP Datasets: Stanford large network
dataset collection. http://snap.stanford.edu/data.
[24] Li, X., Smith, J. D., Dinh, T. N., Thai, M. T., 2019. Tiptop: (almost) exact
solutions for influence maximization in billion-scale networks. IEEE/ACM
Trans. Netw. 27 (2), 649–661.
[25] Li, Y., Fan, J., Wang, Y., Tan, K., 2018. Influence maximization on social
graphs: A survey. IEEE Trans. Knowl. Data Eng. 30 (10), 1852–1872.
[26] Liu, Y., Wei, B., Du, Y., Xiao, F., Deng, Y., 05 2016. Identifying influential
spreaders by weight degree centrality in complex networks. Chaos, Solitons
and Fractals 86.
[27] Miyauchi, A., Iwamasa, Y., Fukunaga, T., Kakimura, N., 2015. Thresh-
old influence model for allocating advertising budgets. In: International
Conference on Machine Learning (ICML). pp. 1395–1404.
[28] Mossel, E., Roch, S., 2010. Submodularity of influence in social networks:
From local to global. SIAM J. Comput. 39 (6), 2176–2188.
[29] Newman, M. E. J., Oct 2002. Assortative mixing in networks. Phys. Rev.
Lett. 89, 208701.
[30] Newman, M. E. J., 2010. Networks: An Introduction.
[31] Nguyen, H., Zheng, R., 2013. On budgeted influence maximization in social
networks. IEEE Journal on Selected Areas in Communications 31 (6), 1084–
1094.
[32] Nguyen, H. T., Nguyen, T. P., Vu, T. N., Dinh, T. N., 2017. Outward
influence and cascade size estimation in billion-scale networks. In: SIG-
METRICS (Abstracts). ACM, p. 63.
[33] Nguyen, H. T., Thai, M. T., Dinh, T. N., 2017. A billion-scale approxi-
mation algorithm for maximizing benefit in viral marketing. IEEE/ACM
Trans. Netw. 25 (4), 2419–2429.
[34] Perlberg, S., Jun. 2016. Facebook signs deals with media companies,
celebrities for facebook live. The Wall Street Journal.
[35] Seidman, S. B., Sep. 1983. Network structure and minimum degree. Social
Networks 5 (3).

28
[36] Socievole, A., Rango, F. D., Scoglio, C., Mieghem, P. V., 2016. Assessing
network robustness under {SIS} epidemics: The relationship between epi-
demic threshold and viral conductance. Computer Networks 103, 196–206.
[37] Tang, Y., Shi, Y., Xiao, X., 2015. Influence maximization in near-linear
time: A martingale approach. In: SIGMOD Conference. ACM, pp. 1539–
1554.
[38] Tang, Y., Xiao, X., Shi, Y., 2014. Influence maximization: near-optimal
time complexity meets practical efficiency. In: SIGMOD Conference. ACM,
pp. 75–86.

[39] Taxidou, I., Fischer, P. M., 2014. Online analysis of information diffusion
in twitter. In: International Conference on World Wide Web (WWW). pp.
1313–1318.
[40] Ugander, J., Backstrom, L., Marlow, C., Kleinberg, J., 2012. Structural
diversity in social contagion. Proceedings of the National Academy of Sci-
ences 109 (16), 5962–5966.
[41] Valente, T. W., 1996. Social network thresholds in the diffusion of innova-
tions. Social networks 18 (1), 69–89.
[42] Wang, S., Wang, F., Chen, Y., Liu, C., Li, Z., Zhang, X., 2015. Exploit-
ing social circle broadness for influential spreaders identification in social
networks. World Wide Web 18 (3), 681–705.
[43] Xu, W., Liang, W., Lin, X., Yu, J. X., 2016. Finding top-k influential users
in social networks under the structural diversity model. Inf. Sci. 355-356,
110–126.

29
Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
Ronald C. de Souza received a MSc degree in Computer Science from
the Universidade Federal Fluminense (UFF), Niterói, Brazil, in 2016, and is
currently pursuing his PhD degree in Computing & Systems Engineering from
the Universidade Federal do Rio de Janeiro (UFRJ), Brazil, advised by Prof.
Daniel R. Figueiredo. His research interests include Network Science and Com-
puter Networks.

Daniel R. Figueiredo received a PhD degree in Computer Science from


the University of Massachusetts Amherst (UMass) in 2005 after which he worked
as a post-doc researcher at the Swiss Federal Institute of Technology, Lausanne
(EPFL). In 2007, he joined the Department of Computer and Systems Engi-
neering (PESC/COPPE) at the Federal University of Rio de Janeiro (UFRJ),
Brazil, where is currently an Associate Professor. His main interests are in
Network Science and in particular models for processes on dynamic networks.

Antonio A. de A. Rocha is Associate Professor in the Computer Sci-


ence Department from the Institute of Computing at the Fluminense Federal
since 2011. He received his bachelor’s degree in Computer Science at Univer-
sity of Salvador (UNIFACS) in 2000, MSc and PhD degrees in Computer and

31
Systems Engineering (PESC/COPPE) from the Federal University of Rio de
Janeiro (UFRJ) Brazil, in 2003 and 2010, respectively. During PhD, in 2008-
2009, he had been a visiting student in the Computer Science at University of
Massachusetts-Amherst (UMass). In 2010, he worked as a post-doc researcher
at UFRJ, supported by INCT WebScience. Recently, he returned for a sab-
batical as a visiting professor at University of Massachusetts-Amherst. He is
awarded as Research Productivity Fellowship granted by CNPq and Young Sci-
entist of Rio de Janeiro by FAPERJ. His areas of interest include performance
evaluation, traffic engineering, network measurement, next generation Inter-
net, network science and security systems. Dr. Antonio Rocha has published
many papers in important journals and conferences, and some of those works
received a few awards, such as Best Papers in ACM/CoNEXT, SBC/SBRC
and SBC/WPerformance, and nominated among the top-6 PhD theses from
Computer Brazilian Society in 2012.

Artur Ziviani is a Senior Researcher at the National Laboratory for Sci-


entific Computing (LNCC), located in Petrópolis, Brazil. In 2003, he received
a Ph.D. in Computer Science at the LIP6 laboratory of the Université Pierre
et Marie Curie (Paris 6) - Sorbonne Universités, Paris, France, where he has
also been a lecturer during the 2003-2004 academic year. He received a B.Sc.
degree in Electronics Engineering in 1998 and a M.Sc. degree in Electrical En-
gineering (emphasis in Computer Networking) in 1999, both from the Federal
University of Rio de Janeiro (UFRJ), Brazil. From September 2008 to January
2009, he was a visiting researcher at INRIA in France. He is a member of the
Editorial Board of the journals Computer Networks (Elsevier) and IEEE Com-
munications Surveys & Tutorials. His current research interests include network
characterization, modeling, and analysis; network science; and interdisciplinary
research with a networking approach. He is a Member of SBC (the Brazilian
Computer Society), an Affiliated Member of the Brazilian Academy of Sciences,
and a Senior Member of both IEEE and ACM.

32

You might also like