Professional Documents
Culture Documents
Journal Pre-proof
PII: S0020-0255(19)31075-8
DOI: https://doi.org/10.1016/j.ins.2019.11.029
Reference: INS 15021
Please cite this article as: R.C. de Souza, D.R. Figueiredo, A.A. de A. Rocha, A. Ziviani, Efficient Net-
work Seeding under Variable Node Cost and Limited Budget for Social Networks, Information Sciences
(2019), doi: https://doi.org/10.1016/j.ins.2019.11.029
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
Abstract
The efficiency of information diffusion on networks highly depends on both the
network structure and the set of early spreaders. Moreover, in various realis-
tic scenarios, to seed different nodes implies different costs, as in the case of
viral marketing, where costs often correlate with local network structure. The
budgeted influence maximization (BIM) problem consists in determining a seed
set whose diffusion maximizes the total number of influenced nodes, provided
that the seeding cost is within a given budget. We investigate efficient seeding
strategies for the BIM problem under the deterministic fixed threshold diffusion
model. In particular, we introduce the concept of surrounding sets: relatively
cheap seeds neighboring expensive, structurally-privileged nodes, which then
become spreaders at lower costs. Numerical experiments with several real net-
works indicate our method outperforms strategies that seed nodes based on
their influence/cost ratios. A key insight from our evaluation is that larger dif-
fusion is generally attained from the surrounding sets that consider the two-hop
neighborhood of influential nodes, as opposed to their immediate neighbors only.
Keywords: network seeding, influence maximization, seeding strategies,
variable node cost, social network
1. Introduction
Spreading phenomena on real and online social networks, such as the dif-
fusion of content or diseases among individuals receive ever-growing attention
from both academia and industry. Understanding how diffusion can either long
last, reaching large fractions of network nodes, or die out quickly after a negligi-
ble spread, is a fundamental step in designing more effective diffusion processes.
These include more efficient marketing campaigns among users of online social
✩ This work was funded in part by research project grants from CNPq, FAPERJ, and
FAPESP.
Email addresses: rcsouza@cos.ufrj.br (R. C. de Souza), daniel@land.ufrj.br (D. R.
Figueiredo), arocha@ic.uff.br (A. A. de A. Rocha), ziviani@lncc.br (A. Ziviani)
2
receiving the same initial budget may still yield seed sets of very different
sizes, ranging from few key-nodes up to a large fraction of the network. By
considering solely their diffusion power (a metric defined in Section 5.1),
we capture the real benefit (activated non-seeds) of an investment (bud-
get). Diffusion power (DP) is a fundamental metric to properly assess
BIM strategies. It embeds the Outward Influence [32] concept—originally
proposed for IM—to tackle BIM. Indeed, we show that to ignore the seeds
(paid influencers) when measuring a strategy’s performance eliminates po-
tentially large assessment distortions.
• We propose a flexible, single-parameter model for node cost, which corre-
lates cost and network centrality. To the best of our knowledge, this is the
first model to admit non-linear relations between cost and local structure.
It also captures the common real-world scenario wherein node-costs across
a network may differ from one another by many orders of magnitude, thus
being more relevant in practice.
We empirically evaluate different seeding strategies on 7 real networks, vary-
ing parameters for node cost, budget, and activation threshold. We show that
even when the cost of those most central nodes is larger than the budget, ef-
fective propagation can still be induced. Moreover, we show that cost-aware
seeding strategies that select nodes by their marginal-gain/cost ratios fail to
trigger effective diffusion much earlier than strategies that surround central
nodes, considering increasing values for the network threshold. This finding is
strongly related to the diffusion model under consideration in this work which is
different from most prior works (see Section 2).
Another important insight is that larger diffusion is generally attained when
the two-hop neighbors of a central node are also leveraged in order to have
it surrounded, and not only its direct neighbors. Last, we believe the main
contributions and findings of this work, such as the node-surround concept, can
be applied to other contexts.
The remainder of this paper is organized as follows. In Section 2 we discuss
the related work. Section 3 presents the diffusion model and the node cost
model. In Section 4, we describe the different seeding strategies, including the
node surround concept. Section 5 describes the different performance metrics,
the network datasets, and also presents our evaluation for various scenarios. We
then conclude the paper with a brief discussion in Section 6.
2. Related Work
3
models. However, its high running time has led to a myriad of approaches to
tackle the problem more efficiently [17, 2, 18, 8, 16, 1, 7, 42, 26, 4, 24, 25].
Indeed, various prior works have focused on designing heuristics to deter-
mine good seeds, exploring structural features of the network as well as features
associated with nodes (e.g., labels). For example, computationally-inexpensive
heuristics based on node degree [8], particle swarm optimization [11], and node
homophily [1] have all been considered. Heuristics based on k-core decomposi-
tion [35] have also been explored [18], showing a correlation between influential
spreaders and highly connected regions of the network. This idea has been ex-
plored by various subsequent works that also adapt and augment such metric
with node rankings [4], communities [42], disjoint paths [7], and local neighbor-
hoods [26].
The IM problem has also been investigated under diffusion models funda-
mentally different from the widely adopted Independent Cascades (IC) and Lin-
ear Threshold (LT). For instance, Ugander et al [40] propose the structural diver-
sity model, further investigated by Wenzheng et al [43]. However, all these prior
works implicitly assume that nodes have identical costs, since the constraint to
start a propagation is simply the number of seeds.
There are also recent works that have investigated network seeding where
node costs are not fixed (over time) nor identical across the network. For ex-
ample, Leskovec et al. [22] propose strategies for placing sensors on a network
to more quickly detect a diffusion. Arthur et al. [2] propose strategies to price
products and provide cash-back (discount) to nodes in the network to induce
recommendations to their neighbors. Miyanchi et al. [27] formulate an opti-
mization problem wherein a fixed budget is allocated to a bipartite network of
marketing channels and customers with variable node costs (no diffusion con-
sidered). None of these works specifically addresses the BIM problem.
However, BIM has recently been formulated and investigated by Nguyen and
Zheng [31]. The authors depart from the framework introduced by Kempe et al. [17]
and tackle the problem using the IC model. They establish a submodular, cost-
normalized objective function, from which they determine a greedy algorithm—
referred to as GR in this text—with approximation guarantees to a constant
factor.
Other works have also investigated the BIM problem [14, 33, 10]. Han et al. [14]
tackle BIM with a heuristic combining two seeding strategies; one based on
node influence, and the other on node cost. More recently, Nguyen et al.
have formulated a more general problem, Cost-aware Targeted Viral Market-
ing (CTVM) [33], briefly described as follows. Beyond arbitrary selecting cost,
each node v also provides an arbitrary benefit b(v) for being activated. The goal
is thus to maximize not the influence spread but the total benefit provided by
the final active set. Besides CTVM their algorithm, named BCT, also tackles
either the classical IM and the BIM problems. The latter—which is the scope
of this work—corresponds to the case where, given a network G = (V, E) and a
constant C ∈ R∗+ , b(v) = C, ∀v ∈ V . They show that, for IM, BCT significantly
outperforms state-of-the-art algorithms such as TIM/TIM+ [38] and IMM [37]
in terms of running time, with equal performance in what regards the spread
4
of influence. Also, when considering arbitrary selecting costs, they report BCT
outperforms all above-mentioned algorithms, including GR, in terms of conceiv-
ing a seed set that yields a final active set with larger overall benefit. For BIM,
however, they report GR performs better than BCT in terms of total influenced
population. Last, Souza et al. [10] characterize the performance of simple and
traditional seeding strategies to solve the BIM problem, motivating the need for
more clever strategies.
Despite addressing the BIM problem, these prior works have the following
limitations. The theoretical result of Nguyen and Zheng [31] assumes that the
initial budget is larger than the cost of any node. Moreover, their numerical
evaluation uniformly assigns random costs to nodes, from a small range (less
than a factor of 10). Similarly, Han et al. [14] and Nguyen et al.[33] assume
that the initial budget is larger than the cost of any node, and their numerical
evaluation considers that cost and node centrality are linearly tied. These as-
sumptions fall short of capturing more general pricing practices, such as those
adopted by celebrities (nodes) for promoting viral marketing in online social
networks [34, 20, 6]. In particular, marketing campaigns may not have suffi-
cient budget to hire even those more expensive individuals. In what follows, we
propose a flexible node cost model that strictly depends on the network struc-
ture. It allows for an arbitrary range of values, without making assumptions on
the available budget.
3. Models
We now describe the models for diffusion and node cost considered in this
work. While the first corresponds to the classical LT model, the latter is first
proposed here. Table 1 describes the main symbols and abbreviations to be seen
hereafter. We consider a progressive influence spreading, i.e. every node, once
activated (i.e. influenced), will remain active until the diffusion ends. Also, time
evolves discretely, represented by t = {0, 1, 2...}. A set of nodes is assumed to
be active at time zero—the seed set—denoted A0 . At each time step, a node
will pertain to one between sets At of the active nodes or It of the inactive
ones. Thus, At ∪ It = V and At ∩ It = ∅. The diffusion then unfolds until some
quiescence time q, corresponding to the first time t such that At = At+i , ∀i ∈ N∗ .
Thus, Aq denotes the set of activated nodes when the propagation ends.
5
Table 1: Main symbols and abbreviations.
Symbol Description
6
Note that the model evolves deterministically, controlled by a single parameter
θ. If θ = 1 then all nodes in the same connected component of at least one
seed will surely pertain to Aq . Trivially, only those nodes from I0 with degree
of at least θ can possibly meet the threshold condition and thus become active
at some point.
Since its first proposition by Granovetter [12], threshold models have been
widely adopted to represent the collective dynamics through which information
spreads among individuals. This model and its generalizations are found in
a wide range of scenarios, in part due to its simplicity (deterministic, with
a single parameter) and common intuition [13]. Within the context of IM,
threshold models have been widely considered on studies such as information
diffusion in social networks [17, 9, 15], information cascades on online social
networks [39], diffusion of innovation [41], among many others. Last, although
our developments here focus on the LT model, the techniques and algorithms
we soon propose can be applied to other propagation models.
(
∞
P if U = ∅,
c(U ) = d(v)α otherwise. (3)
∀v∈U
7
3.3. Problem Statement
In order to measure the effectiveness of a seed set A0 ⊆ V , we consider the
metric Outward Influence (OI), proposed in [32], defined as σ : V → R+ such
that
4. Seeding Strategies
8
“ranking-policy” combinations) to be evaluated in Section 5. Finally, in Sec-
tion 4.6 we present a cost-aware node ranking based on nodes’ centrality/cost-
to-surround ratios. Benefits and drawbacks of these cost-weighted rankings are
shown in Section 5.
9
therefore propose a node ranking based on triangles, defined as follows. First we
compute, for every node vi , the number τi0 of triangles that have vi as a vertex.
Formally, let N (vi ) be the set of neighbors of vi and 1(·) the indicator function.
Then
(
0 P if |N (v)| < 2,
τi0 = 1 (uk ∈ N (uj )) otherwise. (5)
∀uj ,uk ∈N (vi ),j<k
Next, we determine τi00 , which is the sum of τj0 from every neighbor vj of vi ,
plus τi0 :
X
τi00 = τi0 + τj0 . (6)
∀uj ∈N (vi )
The triangle centrality of node vi is given by τi00 , and larger is more central.
Note that the definition of τi00 implies that every triangle of τi0 is counted three
times, while triangles formed by two of vi ’s neighbors but not vi itself are counted
twice. Thus, τi00 reflects higher appreciation of vi ’s own triangles, a desirable
feature since at each time step only those inactive nodes directly neighboring
the active ones can possibly become influenced.
The second node ranking adopted, therefore, is triangle centrality, wherein
the assessed structural relevance of each node corresponds to its τ 00 index. Note
that such a score provides a different information from that of the node’s clus-
tering coefficient (CC) [30], as the latter captures a relative measure: nodes
with same CC may still hugely differ with respect to their absolute number of
triangles.
Hereafter, to distinguish the node rankings, we will denote degree centrality
by V d and triangle centrality by V t .
10
(
∅ S if |N (v)| < θ ∨ γ(v) = 0,
Γ(v) = γ(v) (8)
arg min c i=1 {wi } otherwise.
w∈N (v)∩I0
Note that Γ(v) = ∅ may either mean that v already has at least θ active neigh-
bors, or that d(v) < θ (and thus v cannot be node-surrounded). If otherwise
Γ(v) 6= ∅ then Γ(v) contains the cheapest γ(v) inactive neighbors of v. Also,
note that if seeded all nodes in Γ(v), then v becomes active at t = 1. Thus, for
each v ∈ I0 , if |N (v) ∩ A0 | ≥ θ, then v is said to be surrounded, since certainly
v ∈ A1 . Finally, Algorithm 1 describes the construction of the surrounding set.
The algorithm: Algorithm 1 firstly verifies whether the input node v is eli-
gible for being surrounded (line 2), returning an empty Γ if this is not the case.
If eligible, however, we then add its γ(v) cheapest inactive neighbors to Γ (lines
5-11).
Computational complexity: Prior to executing any of the algorithms, we first
need to sort each node’s neighbors in ascending order of their costs. To sort all
d(v) neighbors of a node v requires
P O(d(v) log d(v)). The overall computation
for
P all nodes is therefore O( i d(v i ) log d(vi )) which is also O(m log m), since
i d(vi ) = 2m. Thus, pre-processing is O(m log m). Back to Alg. 1, its com-
plexity is dominated by iterating over γ(v) neighbors of v (lines 7-13). Since in
the worst case γ(v) = θ, Alg. 1 is O(θ).
4.3.1. Tiebreakers
Tiebreak is an important matter during the surrounding set formation, since
nodes with identical costs often differ with respect to their neighborhoods.
Whenever the minimum cost can be achieved from more than one node, the
tiebreak is as follows. For V d we choose the node with largest sum over its
neighbors’ degree; for V t , largest τ 0 is preferred. Formally, let ψ : I × I → I
denote the function that receives a pair of inactive nodes eligible for composing
Γ(v) and returns the selected one. Then,
u1 if c(u1 ) < c(u2 ),
u if c(u2 ) < c(u1 ),
2 P
ψ(u1 , u2 ) = (9)
arg maxu∈{u1 ,u2 } ∀w∈N (u) d(w) if V d ,
arg maxu∈{u1 ,u2 } τu0 if V t .
11
4.4. The Extended Surrounding Set (ESS)
We now describe the extended surrounding set concept. The goal here is to
achieve a more effective budget usage by extending the surrounding set approach
(described in Section 4.3) to the two-hop neighborhood of each node, as follows.
Let Γ(·) be as described in Eq. 8. Now let ρ : I → I denote the function
that receives a node w ∈ I0 and returns the cheaper set between {w} and w’s
surrounding set Γ(w) as follows
{w} if c(w) < c(Γ(w)),
ρ(w) = (10)
Γ(w) otherwise.
Note that c(∅) = ∞ (as defined in Eq.3). Also, recall that γ(v) is the number of
nodes that still need to be seeded for having v surrounded. Then, the Extended
Surrounding Set (ESS) Γ+ (v) is
(
∅ S if |N (v)| < θ ∨ γ(v) = 0,
+
Γ (v) = arg min c
γ(v)
ρ(wi ) otherwise. (11)
i=1
w∈N (v)∩I0
Note that if Γ+ (v) is seeded then v is activated at most when t = 2. Also, note
that c(Γ+ (v)) ≤ c(Γ(v))∀v ∈ V . Finally, Algorithm 2 describes the construction
of the extended surrounding set (ESS).
The algorithm: Algorithm 2 verifies whether the input node v is eligible
for being surrounded (line 2), returning an empty Γ+ if this is not the case.
If eligible, however, we then create an array, named arrayΓ, of d(v) initially-
empty sets, one for each u ∈ N (v) (line 3). For each neighbor u, its respective
set remains empty if u is already active. Conversely, if u ∈ I0 then its set within
arrayΓ will be that returned by ρ(·) (Eq. 10): either Γ(u) or {u} itself (lines
4-6). The resulting arrayΓ is then sorted ascending by cost. Because of the
initial validation performed over v (line 2), we are guaranteed the first γ(v) sets
of arrayΓ are not empty. Their union generates Γ+ (lines 8-11).
Computational complexity: Two scopes of Alg. 2 present an overall dominant
complexity, as follows. To form an ESS we first need to determine the surround-
ing set (SS) for each neighbor w of v (via ρ(·), line 5). Thus, the SS computation
is performed d(v) times, O(θ) each. After such iterations, the d(v)-size array
containing a SS for each of v’s neighbors is then sorted ascending by cost (line
20), which is O(d(v) log d(v)). Therefore, Alg. 2 is O(d(v)(θ + log d(v)).
12
Algorithm 1 SurroundingSet.
Require: v //{Input. v = node to be surrounded}
Require: Γ //{Output. The surrounding set Γ}
1: Γ ← ∅
2: if v ∈/ I0 or d(v) < θ or γ(v) = 0 then return Γ
3: total ← 0
4: i ← 1 //{Accesses the node N (v)i from v’s neighbors. N (v), here, is already sorted,
ascending by cost and then descending by tiebreaks. }
5: repeat
6: if N (v)i ∈ I0 then
7: Γ ← Γ ∪ N (v)i
8: total ← total + 1
9: end if
10: i←i+1
11: until total == γ(v)
12: return Γ
Algorithm 2 ExtendedSurroundingSet.
Require: v, //{Input. v = the node to be surrounded}
Require: Γ+ //{Output. The set Γ+ of eligible nodes (no seeding performed)}
1: Γ+ ← ∅
2: if v ∈/ I0 or d(v) < θ or γ(v) = 0 then return Γ+
3: initialize(arrayΓ, d(v), ∅) //{Creates an array of |N (v)| initially-empty sets Γ.}
4: for i = 1, · · · , d(v) do
5: if N (v)i ∈ I0 then arrayΓ[i] ← ρ(N (v)i ) //{ρ(·) is as defined in Eq. 10.}
6: end for
7: sortAscendingByCost(arrayΓ) //{Empty sets have the highest cost, as in Eq. 3.}
8: for i = 1, · · · , γ(v) do
9: Γ+ ← Γ+ ∪ arrayΓ[i]
10: end for
11: return Γ+
13
• Node surround (NS): This policy consists in surrounding each visited node,
skipping those for which such a task is impossible. First, it determines
the ESS Γ+ v1 of v1 —the first node of a given node ranking—, seeding it
case c(Γ+v1 ) ≤ b. It then evaluates v2 in the same way, trying to surround
it. In the case a node v cannot be surrounded (for either |N (v)| < θ or
c(Γ+v ) > b), it is skipped. If the ranking gets completely traversed prior
to the budget exhaustion, then it is traversed again, but this time NS
tries to seed each visited node directly. When the budget finally becomes
residual, the process stops and the current set A0 of seeds is regarded as
complete. Two seeding strategies derive from the NS policy, one for each
node ranking. Indeed, we denote NS-D and NS-T the seeding strategies
which combine NS with V d and V t , respectively. Finally, Algorithm 4
describes both strategies.
Computational complexity: Prior to analyze Alg. 4, we first need to deter-
mine the complexity of seeding a given number of nodes, as described in
Alg. 3. The |C| iterations over the set of candidates (line 2) dominate the
overall complexity. Thus, Alg. 3 is O(|C|). Back to Alg. 4, its complexity
is dominated, in the worst case, by visiting all nodes in V (lines 4-11), in
order to have their ESS Γ+ determined (line 7) and seeded (line 8). Note
that the max-size Γ+ occurs when it is formed by θ surrounding sets of size
θ each. Therefore, Alg. 3 in the worst case will try to seed |Γ+ | = θ2 nodes.
Thus, Alg. 4 main complexity arises from n formations of ESS (which is
O(d(v)(θ + log d(v)))) summed over |C| trials of seeding θ2 nodes. Thus,
Alg. 4 is O(n(d(v)(θ + log d(v)) + θ2 )). Assuming θ ≤ log n, and observing
that d(v) < n, Alg. 4 is O(n2 θ).
• Cheapest nodes (CH): This simple policy is combined only with the de-
gree ranking V d but in ascending order, thus starting from the network’s
smallest degree. For each node vi visited, CH seeds it directly. Clearly,
one single ranking traversal suffices to determine A0 . Note that any seed
set formed here will have the largest size possible for a budget b. The mo-
ment the budget b is no longer effective, the process stops and the current
set A0 of seeds is regarded as complete. Because of its simplicity, we have
omitted its related algorithm. Finally, for its uniqueness we will also refer
to the seeding strategy formed by combining CH with V d as simply CH.
Computational complexity: In the worst case, b is large enough for the
strategy to seed the entire network. Since CH tries to seed one node at a
time, Alg. 3 here is O(1), hence CH is O(n).
14
Algorithm 3 tryToSeed.
Require: C ∈ I0 //{Input. The set C of candidates for being seeded.}
Require: I0 , A0 , b //{Output. Updated I0 , A0 and b}
1: if b ≥ c(C) then
2: for all v ∈ C do
3: I0 ← I0 \ {v}
4: A0 ← A0 ∪ {v}
5: b ← b − c(v)
6: end for
7: end if
15
for being surrounded. By intuition, if nodes with better centrality/cost-to-
surround ratio were always to be firstly surrounded, the budget usage would be
more effective, in the sense of allowing further seeding, with potentially larger
diffusion.
We thus introduce cost-weighted rankings, defined as follows. For each node
v, its ranking score—either d(v) for V d or τ 00 for V t —is divided by the cost of
00
having v surrounded (i.e. d(v)/c(Γ+ +
v ) or τ /c(Γv )). Such a division yields a
second score λv ∈ R+ for v, upon which V is sorted, yielding a new ranking VΓ .
For every v such that Γ+ v = ∅ (nodes that cannot be surrounded), its original
score is divided by a constant M = θ · c(dL ), where dL is the network’s largest
degree and c(dL ) is, consequently, the highest cost. Note that this leads all of
such nodes to become low ranked but still numerically comparable. Hencefor-
ward, cost-weighted rankings will be referred to as VΓd and VΓt for degree-based
and triangle-based rankings, respectively.
5. Evaluation
We start by presenting the performance metrics used. We then describe the
different networks and parameters used, followed by numerical evaluations and
main findings.
σ(A0 )
∆(A0 ) = , (12)
n
where σ(A0 ) = |Aq | − |A0 |, denotes A0 ’s Outward Influence (OI) [32], as
defined in Eq. 4. Note that DP allows comparisons between networks of
different sizes.
• Average Diffusion Power (ADP) over thresholds. Formally, consider a
network G = (V, E) and let θm and θM denote, respectively, the mini-
mum and maximum values of θ from the range of thresholds applied on
G. Consider θm = 2 for all networks and θM being network-specific, as
indicated in Table 2. Now let A0,θ denote a strategy’s seed set for a given
θ. Thus, the average diffusion power ∆ is given by:
θP
M
∆(A0,θ )
θ=2
∆= . (13)
θM − 1
16
5.2. Budget Model
As stated in Section 3.3, a fixed budget must be defined as part of the
problem’s input. Thus, strictly for evaluation purposes, we propose a budget
model which ties the value of b to both the network structure and the cost
function. This favours the comparison between different networks: if we were
to deliver a same amount b to all networks instead, those with larger average
degree—and consequently higher average cost—would likely form smaller seed
sets. Our budget model therefore aims at balancing the relationship between
network connectedness and seeding capability.
The idea for the budget is therefore to cover the cost of a given
P number of
virtual nodes whose degree matches the average. Thus, let d = ( ∀v∈V d(v))/n
denote the network’s average degree, and let 0 < k ≤ 1 denote a fixed fraction.
Then,
17
Table 2: Network datasets and some basic statistics.
Network |V | |E| DA %LCC d dM θM
astro 18772 198050 0.21 95.4% 21.10 504 15
cmat 23133 93439 0.13 92.3% 8.08 281 10
Fb 4039 88234 0.064 100% 43.69 1045 13
Bk 58228 214078 0.011 97.4% 7.35 1134 20
dblp 317080 1049867 0.27 100% 6.62 343 11
enron 36692 183832 -0.11 91.8% 10.02 1383 35
hep 12008 118489 0.63 93.3% 19.73 491 15
Because of the model’s deterministic nature, a single run per setup is enough.
Nevertheless, the various possible parameter combinations lead to a broad range
of scenarios and regimes. Although we have extensively studied many different
setups, only a small fraction of our results is presented here, in order to better
illustrate and highlight our main findings.
18
(a) central nodes first (b) best centrality/cost first (c) central nodes surrounded
Figure 1: Fraction of activated nodes per threshold on Fb network from cost-unaware strategies
(left), cost-awareness through centrality/cost ratio (center), and cost-awareness by surround-
ing the central nodes (right). α = 2.0 and k = 0.0005.
The comparison of figures 1(a) and 1(b) illustrates the importance of cost-
awareness for BIM. Nevertheless, strategies’ performances may still be signifi-
cantly further improved: Figure 1(c) shows that broad diffusion is still possible
for considerably larger values of θ when adopting the NS strategy over either
V d or V t . Indeed, θ = 7 is the maximum threshold to which DS-Dc still in-
duces meaningful propagation, activating almost 40% of the network (Fig. 1(b)),
whereas NS-D manages to influence more than 40% even when θ = 9 (Fig. 1(c)).
As for DS-Tc , note that its spreading is hindered whenever θ > 6 (Fig. 1(b))
whereas NS-T still manages to influence 20% of the network when θ = 10
(Fig. 1(c)).
The cost-weighted ranking VΓ (Section 4.6) may substantially increase the
resilience of the different strategies to growing values of θ, i.e. it may increase
the maximum θ for which strategies still manage to trigger broad diffusion, thus
being adopted in many scenarios. Figure 2 shows the fraction of activated nodes
per threshold from diffusion simulated over the network cmat, when k = 0.0005.
Costs are based on α = {0.5, 2.0}, and hence node degree impacts on the cost
either slightly (α = 0.5) or enormously (α = 2.0). Note, by comparing figures
2(a) and 2(b) (both with α = 2.0), that the use of VΓ (Figure 2(b)) led both
NS-based strategies to induce propagation for greater values of θ.
Interestingly, our results also demonstrate that the use of VΓ when α ≤ 1
does not necessarily improve strategies’ performance, as shown in figures 2(c)
and 2(d). Note that VΓt diminishes NS-T’s resilience to θ, as this strategy
no longer manages to induce propagation from θ = 5 on. Figure 2 therefore
illustrates a behavior discussed in Section 3.2: when conceiving the seed set,
node costs may become a critical (resp. negligible) concern as the gaps between
costs enlarge (resp. shrink).
Figures 1 and 2 have shown the fraction of activated (influenced) nodes at
19
(a) α = 2.0; V d and V t (b) α = 2.0; VΓd and VΓt
Figure 2: Comparison of fraction of activated nodes per threshold at cmat network when
strategies are given regular rankings (left) and cost-weighted rankings (right), upon quadratic
(top) and sub-linear (bottom) cost regimes.
20
(a) Classical (b) DP
Figure 3: Fraction of activated nodes (left) and Diffusion Power (right) per Threshold for a
diffusion simulated on hep network (α = 2.0 and k = 0.0005).
quiescence time. When considering BIM, however, this metric may stand far
from capturing the real effectiveness of a seeding strategy. Indeed, since the
initial budget can be arbitrarily large, a huge seed set may be formed and yet
no propagation come to be triggered. From the perspective of viral marketing
campaigns—wherein the real benefit (profit) comes from persuasion—to hire
many and convince none denotes the worst scenario possible. Thus, to more
properly assess the seeding strategies’ performances, we capture their relative
Outward Influence, which yields what we call diffusion power (DP): a metric
that captures the fraction of non-seeds activated at quiescence time (Eq. 12),
hence denoting how diffusive the seeds are irrespective of the network size.
Figure 3 compares these two different measures of influence (classical and
DP), and illustrates how the understanding on each strategy’s performance
varies according to the metric. Note, for instance, that the classical approach
(Fig. 3(a)) indicates that for any θ > 10 the CH strategy performs better than
the others, and this perception is intensified when θ = 15, wherein CH seem-
ingly influences around 15% more nodes than any other strategy. A completely
different viewpoint is captured by DP (Fig. 3(b)), whereby it becomes evident
that the CH seeds are no longer diffusive as early as θ = 7.
5.6.1. GR
Improved Greedy is the seminal strategy for determining seeds for BIM; a
cost-normalized greedy algorithm, proposed by Nguyen and Zheng in [31] and
21
referred to in this text as GR. It considers an influence function σ(·) such that,
given S ⊆ V , σ(S) = E[|Aq |] when A0 = S. Thus, σ(S ∪ {v}) − σ(S) expresses
the marginal gain of a node v ∈ V \S. The seed set is then built upon a two-stage
process, briefly described as follows.
First, at each round the node with largest cost-normalized marginal gain,
i.e the node v ∈ V \S with largest (σ(S ∪ {v}) − σ(S))/c(v), is added to the
candidate seed-set S if the remaining budget b covers its cost. This step is
repeated until exhausting b. Then, the estimated total influence σ(S) of the
candidates is compared with the influence σ(u) of the most influential node u
in the network: if σ(S) ≥ σ(u), then A0 = S; A0 = {u}, otherwise.
Note that this algorithm assumes that the initial budget is always at least
as large as the cost of the most expensive node in the network. As discussed
in Section 2, we do not hold such an assumption in this paper. Consequently,
in our experiments the second stage of GR is not guaranteed to occur case
σ(u) > σ(A0 ).
GR was designed under the Independent Cascades (IC) model [17], which
allows the influence σ(·) of each node to be estimated independently. It thus
cannot be readily applied to the LT model. Therefore, in order to determine
the seed set for GR, we run it under the same scheme of [31], namely, IC
with identical activation probabilities p = 0.1 across all edges, and σ(·) esti-
mation via CELF [21] with 10000 runs per node. We have chosen CELF for
Nguyen and Zheng report it as the estimator upon which GR yields the best so-
lution quality among the techniques employed in their evaluation section. Once
determined, this seed set is then used under the LT model as considered in this
paper.
5.6.2. BCT
Albeit designed for the LT model, two adjustments are still made necessary
prior to employing BCT in our context. These relate to (i) the problem BCT
must tackle (BIM instead of CTVM), and (ii) the node thresholds we need BCT
to consider when estimating each node’s influence. Both items are straightfor-
wardly addressed as follows. Towards (i), we simply follow the lines discussed
in Section 2, thus equalizing the benefit b(·) of every node. More specifically,
∀v ∈ V we set b(v) = 1, hence making BCT aim at the BIM problem rather
than CTVM.
Regarding (ii), for a better description of our procedure to cope with the
different threshold values between BCT’s original model (node-specific) and ours
(single value θ across the entire network), we first briefly describe BCT’s original
LT model as follows. For each node v ∈ V , let λv denote v’s (random) threshold,
uniformly drawn from the real interval [0, 1]. Also, ∀u ∈ N (v) let w : V × V →
R+ be the weight associated to every edge (u, v). Time P evolves discretely, such
that t = {1, 2, 3, ...}. Then, v gets activated at t if ∀u∈{N (v)∩At−1 } w(u, v) ≥
λv . As usual, the spreading is assumed to be progressive.
Note that such a model allows us to identically reproduce the deterministic
LT spreading behaviour. We simply need to define each edge weight conve-
niently. Indeed, for each node v ∈ V , we assign the weight λv /θ to all of its
22
P
edges, thus accomplishing this goal. Note that ∀u∈N (v) w(u, v) > 1 becomes
a possible condition in this case, and may hold for many nodes in fact. Yet,
Mossel and Roch proved in [28] that even such instances of the LT model yield a
submodular objective function σ(·). This in turn means that we are preserving
all of BCT’s original model fundamentals, while capturing the desired single-
threshold behaviour. For the remaining parameters, we have adopted the same
values set in [33]. We then determine BCT’s seed set by running the above
setup over the network of interest. Finally, we evaluate its seeds under the
deterministic LT model (which behaves exactly as the adaptation above).
Figure 4 compares the two NS-based strategies with both GR [31] and
BCT [33] on three networks under different node costs and budgets. Note that
when considering the fraction of activated nodes, CH and BCT are slightly su-
perior to NS strategies for small threshold values. This is related to the fact
that these strategies often generate larger seed sets. However, as θ increases, NS
strategies become superior in terms of inducing broader propagation. Indeed,
this is confirmed when considering the Diffusion Power (Figure 4(d-f)) which
shows that for almost all thresholds, NS seeds are at least as diffusive as all
others.
Figures 4(b) and 4(e) illustrate how the network structure may reflect at the
strategies outcome. Indeed, hep is considerably more assortative than any other
network studied. It means that expensive nodes’ neighbors tend to be expensive
as well, and this likely explains why NS performs equivalently to GR.
Results of the flavor of Figure 4(d)—where NS diffusion considerably out-
performs both GR and BCT for various threshold values—were also obtained
from many other scenarios, but not the opposite.
6. Conclusion
23
(a) astro; b = 4000; α = 2.0 (b) hep; b = 2000; α = 0.5 (c) Fb; b = 2000; α = 1.0
(d) astro; b = 4000; α = 2.0 (e) hep; b = 2000; α = 0.5 (f) Fb; b = 2000; α = 1.0
Figure 4: Fraction of Activated Nodes (top) and Diffusion Power (bottom) as a function of
threshold value (θ) for the different strategies, including the Improved Greedy (GR) [31] and
BCT [33].
24
(a) k = 0.0005; α = 1.0 (b) k = 0.0005; α = 2.0
Figure 5: Average Diffusion Power of the different strategies across all evaluated networks
under two different budgets and node cost parameters.
25
This work focuses on the BIM problem, and proposes an efficient strategy
to tackle it. We have considered progressive diffusion upon real networks under
the Linear Threshold model with fixed thresholds. The cost of a node is propor-
tional to how central (important) the node is, and degree centrality is used as a
proxy for node importance. We have proposed the NS seeding policy, evaluating
its efficiency under different sequences (according to some node ranking) and
comparing it with baseline and state-of-the-art methods. Each strategy was
given the same initial budget and a cost function to perform the seeding. We
have focused on scenarios where the initial budget is relatively scarce, with cost
gaps between nodes ranging up to orders of magnitude, as it better relates to
real cases.
We have proposed a novel approach to further leverage cost-effectiveness on
BIM problems, namely the surrounding sets. Our main results indicate ESS
generally favors seeding strategies to yield broad influence spreading through
a wider range of network thresholds, considering state-of-the-art methods. Fi-
nally, we have demonstrated that, opposite to the unit-cost IM, different strate-
gies in BIM upon receiving the same initial budget may still yield seed sets of
very different sizes. On such scenarios, we showed that the classical Fraction
of Activated Nodes may lead to wrong conclusions on which strategy is best.
Conversely, by considering only their diffusion power, we capture their real ben-
efit (activated non-seeds) upon an investment (budget). We showed how this
approach may completely change the understanding around a strategy’s effec-
tiveness.
[1] Aral, S., Muchnik, L., Sundararajan, A., 2013. Engineering social conta-
gions: Optimal network seeding in the presence of homophily. Network
Science 1, 125–153.
[2] Arthur, D., Motwani, R., Sharma, A., Xu, Y., 2009. Pricing strategies for
viral marketing on social networks. In: Internet and Network Economics.
Vol. 5929 of Lecture Notes in Computer Science. pp. 101–112.
[3] Backstrom, L., Huttenlocher, D. P., Kleinberg, J. M., Lan, X., 2006. Group
formation in large social networks: membership, growth, and evolution. In:
ACM International Conference on Knowledge Discovery and Data Mining
(SIGKDD). pp. 44–54.
[4] Bae, J., Kim, S., Feb. 2014. Identifying and ranking influential spread-
ers in complex networks by neighborhood coreness. Physica A: Statistical
Mechanics and its Applications 395.
[5] Boguñá, M., Pastor-Satorras, R., Vespignani, A., Jan 2003. Absence of
epidemic threshold in scale-free networks with degree correlations. Phys.
Rev. Lett. 90, 028701.
[6] Brown, K., Jan. 2016. Here’s how much celebrities make in the instagram
product placement machine. Jezebel.
26
[7] Chen, D.-B., Xiao, R., Zeng, A., Zhang, Y.-C., Jan. 2014. Path diversity
improves the identification of influential spreaders. Europhysics letters 104.
[8] Chen, W., Wang, Y., Yang, S., 2009. Efficient influence maximization in so-
cial networks. In: ACM International Conference on Knowledge Discovery
and Data Mining (SIGKDD). pp. 199–208.
[9] Chen, W., Yuan, Y., Zhang, L., 2010. Scalable influence maximization in
social networks under the linear threshold model. In: IEEE International
Conference on Data Mining (ICDM). IEEE, pp. 88–97.
[10] de Souza, R. C., Figueiredo, D. R., de A. Rocha, A. A., Ziviani, A., 2014.
Evaluation of epidemic seeding strategies under variable node costs. In:
SBC Workshop em Desempenho de Sistemas Computacionais e de Comu-
nicação. WPerformance ’14.
[11] Gong, M., Yan, J., Shen, B., Ma, L., Cai, Q., 2016. Influence maximization
in social networks based on discrete particle swarm optimization. Inf. Sci.
367-368, 600–614.
[12] Granovetter, M., May 1978. Threshold models of collective behavior. Amer-
ican Journal of Sociology 83, 489–515.
[13] Guille, A., Hacid, H., Favre, C., Zighed, D. A., 2013. Information diffusion
in online social networks: A survey. ACM Sigmod Record 42 (2), 17–28.
[14] Han, S., Zhuang, F., He, Q., Shi, Z., 2014. Balanced seed selection for
budgeted influence maximization in social networks. In: Pacific-Asia Con-
ference on Knowledge Discovery and Data Mining. pp. 65–77.
[15] He, X., Song, G., Chen, W., Jiang, Q., 2012. Influence blocking maximiza-
tion in social networks under the competitive linear threshold model. In:
SIAM International Conference on Data Mining (SDM). pp. 463–474.
[16] Hinz, O., Skiera, B., Barrot, C., Becker, J. U., Nov. 2011. Seeding strategies
for viral marketing: An empirical comparison. Journal of Marketing 75, 55–
71.
[17] Kempe, D., Kleinberg, J., Tardos, E., 2003. Maximizing the spread of in-
fluence through a social network. In: ACM International Conference on
Knowledge Discovery and Data Mining (SIGKDD). pp. 137–146.
[18] Kitsak, M., Gallos, L. K., Havlin, S., Liljeros, F., Muchnik, L., Stanley,
H. E., Makse, H. A., 2010. Identification of influential spreaders in complex
networks. Nature Physics 6, 888–893.
[19] Kleinberg, J., Sep. 2007. Cascading behavior in networks: Algorithmic and
economic issues. In: Algorithmic Game Theory. pp. 613–632.
[20] Kornowski, L., May 2013. Celebrity sponsored tweets: What the stars get
paid for advertising in 140 characters. The Huffington Post.
27
[21] Leskovec, J., Adamic, L. A., Huberman, B. A., 05 2007. The dynamics of
viral marketing. ACM Trans. Web 1 (1).
[22] Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J. M.,
Glance, N. S., 2007. Cost-effective outbreak detection in networks. In:
ACM International Conference on Knowledge Discovery and Data Mining
(SIGKDD). pp. 420–429.
[23] Leskovec, J., Krevl, A., 2014. SNAP Datasets: Stanford large network
dataset collection. http://snap.stanford.edu/data.
[24] Li, X., Smith, J. D., Dinh, T. N., Thai, M. T., 2019. Tiptop: (almost) exact
solutions for influence maximization in billion-scale networks. IEEE/ACM
Trans. Netw. 27 (2), 649–661.
[25] Li, Y., Fan, J., Wang, Y., Tan, K., 2018. Influence maximization on social
graphs: A survey. IEEE Trans. Knowl. Data Eng. 30 (10), 1852–1872.
[26] Liu, Y., Wei, B., Du, Y., Xiao, F., Deng, Y., 05 2016. Identifying influential
spreaders by weight degree centrality in complex networks. Chaos, Solitons
and Fractals 86.
[27] Miyauchi, A., Iwamasa, Y., Fukunaga, T., Kakimura, N., 2015. Thresh-
old influence model for allocating advertising budgets. In: International
Conference on Machine Learning (ICML). pp. 1395–1404.
[28] Mossel, E., Roch, S., 2010. Submodularity of influence in social networks:
From local to global. SIAM J. Comput. 39 (6), 2176–2188.
[29] Newman, M. E. J., Oct 2002. Assortative mixing in networks. Phys. Rev.
Lett. 89, 208701.
[30] Newman, M. E. J., 2010. Networks: An Introduction.
[31] Nguyen, H., Zheng, R., 2013. On budgeted influence maximization in social
networks. IEEE Journal on Selected Areas in Communications 31 (6), 1084–
1094.
[32] Nguyen, H. T., Nguyen, T. P., Vu, T. N., Dinh, T. N., 2017. Outward
influence and cascade size estimation in billion-scale networks. In: SIG-
METRICS (Abstracts). ACM, p. 63.
[33] Nguyen, H. T., Thai, M. T., Dinh, T. N., 2017. A billion-scale approxi-
mation algorithm for maximizing benefit in viral marketing. IEEE/ACM
Trans. Netw. 25 (4), 2419–2429.
[34] Perlberg, S., Jun. 2016. Facebook signs deals with media companies,
celebrities for facebook live. The Wall Street Journal.
[35] Seidman, S. B., Sep. 1983. Network structure and minimum degree. Social
Networks 5 (3).
28
[36] Socievole, A., Rango, F. D., Scoglio, C., Mieghem, P. V., 2016. Assessing
network robustness under {SIS} epidemics: The relationship between epi-
demic threshold and viral conductance. Computer Networks 103, 196–206.
[37] Tang, Y., Shi, Y., Xiao, X., 2015. Influence maximization in near-linear
time: A martingale approach. In: SIGMOD Conference. ACM, pp. 1539–
1554.
[38] Tang, Y., Xiao, X., Shi, Y., 2014. Influence maximization: near-optimal
time complexity meets practical efficiency. In: SIGMOD Conference. ACM,
pp. 75–86.
[39] Taxidou, I., Fischer, P. M., 2014. Online analysis of information diffusion
in twitter. In: International Conference on World Wide Web (WWW). pp.
1313–1318.
[40] Ugander, J., Backstrom, L., Marlow, C., Kleinberg, J., 2012. Structural
diversity in social contagion. Proceedings of the National Academy of Sci-
ences 109 (16), 5962–5966.
[41] Valente, T. W., 1996. Social network thresholds in the diffusion of innova-
tions. Social networks 18 (1), 69–89.
[42] Wang, S., Wang, F., Chen, Y., Liu, C., Li, Z., Zhang, X., 2015. Exploit-
ing social circle broadness for influential spreaders identification in social
networks. World Wide Web 18 (3), 681–705.
[43] Xu, W., Liang, W., Lin, X., Yu, J. X., 2016. Finding top-k influential users
in social networks under the structural diversity model. Inf. Sci. 355-356,
110–126.
29
Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
Ronald C. de Souza received a MSc degree in Computer Science from
the Universidade Federal Fluminense (UFF), Niterói, Brazil, in 2016, and is
currently pursuing his PhD degree in Computing & Systems Engineering from
the Universidade Federal do Rio de Janeiro (UFRJ), Brazil, advised by Prof.
Daniel R. Figueiredo. His research interests include Network Science and Com-
puter Networks.
31
Systems Engineering (PESC/COPPE) from the Federal University of Rio de
Janeiro (UFRJ) Brazil, in 2003 and 2010, respectively. During PhD, in 2008-
2009, he had been a visiting student in the Computer Science at University of
Massachusetts-Amherst (UMass). In 2010, he worked as a post-doc researcher
at UFRJ, supported by INCT WebScience. Recently, he returned for a sab-
batical as a visiting professor at University of Massachusetts-Amherst. He is
awarded as Research Productivity Fellowship granted by CNPq and Young Sci-
entist of Rio de Janeiro by FAPERJ. His areas of interest include performance
evaluation, traffic engineering, network measurement, next generation Inter-
net, network science and security systems. Dr. Antonio Rocha has published
many papers in important journals and conferences, and some of those works
received a few awards, such as Best Papers in ACM/CoNEXT, SBC/SBRC
and SBC/WPerformance, and nominated among the top-6 PhD theses from
Computer Brazilian Society in 2012.
32