Professional Documents
Culture Documents
Abstract—Social influence plays an essential role in spreading location [3], influence maximization [4], viral marketing [5],
information within online social networks, and can be modeled etc. The main objective is to quantitatively compute the
or measured by analyzing various social networking data, such influence of users and discover the influentials in social
as published content, users’ attributes or interactions among networks, who can be referred to as opinion leader [6],
them. Because of the massive social data, researchers often fail domain expert [3] in various field. Hence, the influentials
to quantify user influence in an accurate and high efficient way. play a vital role in the adoption of innovation, the network
Big data technique can be adopted to alleviate this problem. In group formation, or the dissemination and guidance of
this paper, we introduce a kind of classical individual influence information. Due to the limitation of theoretical models and
algorithm, and implement two parallel versions of this
experimental methods, the early work can only analyze the
algorithm based on different big data processing framework.
Experiment results on a large-scale real dataset demonstrate
user influence qualitatively on the small sample dataset, and
that the computational efficiency of influence algorithm can be verify the existence of influence in the social system.
improved significantly in massive data sets by virtue of big However, in recent years, the online social network has
data processing framework. provided a wealth of available experimental data, making it
possible for researchers to model and quantify the user
Keywords-influence measurement; performance analysis; big influence.
data technique; distributed computing; social networks The existing literature mainly quantifies social influence
from aspects of network structure, user behavior and
I. INTRODUCTION interaction [7]-[9]. In fact, the graph structure model
constructed by social networking data is rather complicated,
Social network platforms, such as Sina Weibo 1 and and usually contains billions of users, hundreds of billions of
Facebook 2 , have gradually become the mainstreams of edges formed by the relationship between users and the huge
network applications and changed the ways of living and amount of information they generate. As of June 2017, Sina
communication. Users can publish, retweet or share Weibo has reached 365 million active users3. This poses new
messages, and even interact with the others in social challenges for the calculation of user influence, making it
networks, thereby affecting the spread of information. In difficult to efficiently measure the influences of users on
general, a message published in a social network can be such an ultra-large scale. However, the advent of different
spread, heavily depending on the interactions between users, categories of big data processing frameworks makes it
who have sophisticated social relationships. Social influence possible to efficiently analyze these massive amounts of data:
is the intrinsic incentive hidden under users’ interactions, and First of all, based on the small sample data set or subgraph
interactions among users are the external manifestation of structure, the user influence measurement model is analyzed
social influence, therefore having a direct impact on the and verified; Then, combined with the specific big data
dissemination of information in social network. processing framework, the user influence measurement
Social influence can be conducive to public opinion model can be implemented in parallel; Finally, individual
guidance and social operation [1], and has a wide range of influence parallelization algorithms can be deployed in a
applications such as social recommendation [2], expertise cluster environment to efficiently compute the influence of
——————
1
http://weibo.com/ ——————
2 3
https://www.facebook.com/ http://www.questmobile.com.cn/blog/blog_98.html
181
the I/O of disk and network to increase the efficiency of big TABLE I. STATISTICS OF THE DATASET
data processing. Resilient Distributed Dataset (RDD) is the Descriptions Sina Weibo
core technology for representing the sliced and immutable
data sets that can be manipulated in parallel. RDD is an microblogs 4,586,584,659
abstract computing and data, and provides two types of users 116,147,966
operators: transformation operators and action operators. The
transformation operator is responsible for converting one or original microblogs 1,079,801,756
more RDDs into a new RDD, while the action operator
forwarded original microblogs 97,351,945
generates the final calculation based on the generated RDDs.
We parallelize (1) through the Spark parallel computing forwarding microblogs 3,506,782,903
framework and run the code in Yarn mode. Pseudo code as
shown in Algorithm 2. Similar to algorithm 1, algorithm 2 is average forwardings of each original microblog 3.25
an iterative algorithm, and the operation of each iteration is
the same: The social network graph structure and other data As we can see, less than 10% original microblogs are
are transformed into RDDs. Flatmap() operator is forwarded and only around 0.2% of them are forwarded
responsible for spreading the influence of users. more than 500 times, indicating that a fraction of users
Reducebykey(add) operator adds all the influence generate and control a large amount of information spread in
components, and map() operator updates the current user social networks. Fig. 1 shows the forwardings distribution of
influence value according to (1). original microblogs in the dataset, which conforms to a
power law distribution with an exponent of 2.13.
ALGORITHM 2. SPARK-BASED USER INFLUENCE ALGORITHM According to (1), measuring user influence requires a
Input˖A weighted social network graph G = {V, E, W}. specific network structure. Here, we construct the weighted
Output˖Users influences vector P. network by extracting social forwarding relationships
Calculate transition probability matrix M = {muv}. between users in the dataset. To protect the privacy, users’
Initialize N = |V|, P = 1N, regularization factor D. ids need to be anonymized, and finally we obtain about
RDD(V, E, M) = SparkContext (G, M).SparkOperator. 59GB forwarding relationships dataset R consisting of many
repeat˖ ˖ triplet tuples <user id, forwarding user id, frequency>. A
foreach v V do tuple <u, v, fuĮv> exists in R indicating that user u has totally
foreach (v, u) E do forwarded v’s microblogs fuĮv times in the collected dataset.
Calculate influence propagation component RDD(v, Puov) :
RDD(V, E, M).flatmap(lamda: Puov = muv u P(u)).
The dataset R contains 3,504,379,868 triplet tuples involving
Update user influence RDD(v, Pv) : RDD(v, Puov).reducebyke 115,205,577 users, which is stored in HDFS with a block
y(add).mapvalue(lamada: P(v) = D u P(v) + (1 D)N). size of 128M. After that, a weighted social network graph G
end = {V, E, W} can be constructed simply with the dataset R,
end where |V| = 115,205,577 and |E| = 3,504,379,868. If a tuple
until convergence <u, v, fuĮv> exists, then (v, u) E and wv,u fuĮv represents
foreach v V do
the edge weight.
Output vc influence Pc(v).
end
IV. EXPERIMENTS
To analyze the impact of big data processing framework
on the performance of user influence in social networks, we
have implemented the parallel algorithms above and run
them on a real large-scale dataset.
A. Experimental Data and Preprocessing
The experimental dataset is crawled from Sina Weibo
with the help of Eefung6 and made up of microblogs users
posted from November 2, 2016 to June 26, 2017. Each
microblog is a text record with five fields: timestamp, user id,
id, forwarding user id and forwarding microblog id. When
users publish an original blog, the forwarding user id and
microblog id are all null. Table I lists statistics of the
collected dataset in length.
Figure 1. Forwardings distribution of original microblogs.
——————
6
http://www.eefung.com/
182
Suppose that user influence spreads along the forwarding zero elements in the adjacency matrix of graph G = {V, E,
relationships through the information within social networks. W}. When there are same number of users in two social
Consequently, the transition probability between users in M network data sets, their network density may be different.
can be calculated as: Hence, we randomly sample data sets with various density
based on D2 to study the effect of density on the
muv = wv,u ķ∑v
V, (v
, u) E wv
,u (2) performance of algorithms.
TABLE II. DESCRIPTION OF EXPERIMENTAL DATA SETS
B. Experimental Environment
Data sets Users Forwardings Density
Experiments are carried on Tencent Cloud M2-server
with the hardware environment as follows: 8-core CPU, D1 100,000 4,185,326 4.19×10B8
64GB memory, 500GB hard disk, 1Mbps bandwidth and 94,091,124
pre-installed Ubuntu Server 14.04.1 LTS 64-bit. We have D2 1,000,000 9.41×10B5
implemented all the proposed algorithms and preprocessed 940,634,687
D2_A 1,000,000 9.41×10B4
the data in Java 1.8. These two parallel algorithms based on
big data processing framework are performed on various 9,411,885
D2_B 1,000,000 9.41×10B6
Hadoop distributed clusters with more than 128 independent
941,482
M2-servers. The versions of Hadoop and Spark are 2.7.4 and D2_C 1,000,000 9.41×10B7
1.6.2 respectively. To further analyze the performance of
662,538,337
algorithms, we build different scale clusters. Specifically, we D3 10,000,000 6.63×10B6
set up six Hadoop clusters with different numbers of servers, 1,368,256,085
which are 4, 8, 16, 32, 64 and 128, respectively. D4 50,000,000 5.47×10B7
3,504,379,868
C. Evaluation Metrics D5 115,205,577 1.19×10B7
Similar to previous work, we exploit widely used metrics,
such as accuracy and running time, to evaluate the D. Results and Analysis
performance of these algorithms. As mentioned earlier, there are no standard datasets to
Due to the lack of standard test datasets for user influence test the performance of user influence measurements. So we
measurement algorithms in social networks, we leverage perform the proposed algorithms on real data sets and
convergence instead of accuracy to analyze. Specifically, analyze the results in terms of convergence and efficiency.
given the user influence Pn after the n-th iteration, when (3) 1) Convergence
holds, the procedure is terminated and the algorithm As both based on (1), these two parallel algorithms
converges: present the same convergent tendency in certain situations.
Therefore, we will take algorithm 2 as an example to
|| Pn+1 Pn ||1ķN < δ (3) demonstrate the performance. Again, given the value of D,
the convergent tendency from the same data set also are all
where N represents the number of users and the margin of the same in various clusters. Surprisingly when algorithm 2
error δ = 10 8. The intuition is when the average margin of satisfies (3) for the first time separately on D1, D2, D3, D4
B
error for each user influence does not exceed 10 8, this result
B
and D5, the iterations are 83, 84, 84, 84 and 85, respectively.
tend to be stable and convergent. This is because the cluster size just results in various
As for computation time, with various parameter settings, computing resources, not changing the fundamental
we will record the wall time and calculate the speedup of mechanism of the algorithm.
corresponding algorithm on different data sets and clusters. Fig. 2 shows the convergent trends of algorithm 2 in a
In addition, we should consider the factor D, usually 0.85 in 16-servers cluster when D = 0.85. In the beginning, the
web ranking. However, user forwarding behavior in social convergence errors drop sharply, and become gently
networks is quite different from that in web surfing, and we afterwards. Obviously, the convergence has nothing to do
set some values, such as 0.5, 0.7, 0.85 and 0.95, to carry out with the user scale with this criterion (3). The value of α also
experiments. affects the convergences of algorithms, as illustrated in Fig.
To explore the acceleration performance of big data 3 drawn from D4 on the 64-servers cluster. Overall, the
processing framework on different size data sets, R is larger the value of α, the faster the algorithm converges.
divided into several subsets D1, D2, D3, D4, D5, as shown When the values of α are 0.5, 0.7 and 0.85, the numbers of
in Table II. The number of online social network users iterations required under (3) are 25, 44 and 84, respectively.
involved in these datasets has increased from one hundred Nevertheless, as α = 0.95, the convergence error at the 212th
iteration is 2.2×10 8, and the convergence condition is still
B
thousand to one hundred million, and the number of
forwarding relationships between users has also increased not satisfied. Practically, we should set a reasonable value of
correspondingly, reaching a maximum of one billion. the regularization factor α depending on user behavior
Density is used to depict the degree of density between characteristics within a specific social network for influence
nodes in social networks, referring to the proportion of non- measurement.
183
processing framework. Consequently, from Fig. 4, the
performance peeks of algorithm 1 on D3, D4 and D5 will be
appear by adding more servers in the experiment. However,
on D1, the speedups of algorithm 1 and 2 are both less than 1.
Because processing the small data set, more time is taken to
launch the parallel task or distribute data in a distributed
environment.
2) Efficiency
Noting that big data processing framework can improve
the ability to process large-scale data. Since finishing our
experiments need lots of time in small clusters, we change
the termination condition for efficiency analysis. Specifically,
algorithms will be terminated if reaching the fixed number of
iterations. Figure 5. Speedups of algorithm 2 in different clusters with differrent data
As expected, speedup is adopted to evaluate the sets (D = 0.85).
performance of the two parallel algorithms. Fig. 4 and Fig. 5
display the speedup in different clusters and the iterations of
algorithms on data sets D1, D2, D3, D4 and D5 are 50, 40,
30, 20 and 20 respectively. Strikingly, the speedup obtained
from algorithm 2 is higher than that in algorithm 1. The
reason is that Spark is a memory-based framework, iterative
computation can bring greater performance improvement. Of
course, more servers in a cluster will certainly boost the
acceleration effect of parallel algorithms, especially on large-
scale data sets, as there are more computing resources
available. Meanwhile, with larger data sets, computing
resources are fully utilized, resulting in more pronounced
acceleration. There are noticeable glitches in all curves in Fig.
5, which are respectively 8, 32, 32, 64 and 64 servers from
D1 to D5, demonstrating that with more servers in the cluster,
Figure 6. Running time of algorithms per iteration with differrent D.
the performance will not continue increasing on the data set
due to the parallel model and the particular big data
184
ACKNOWLEDGMENT
The work is supported by the National Key Research and
Development Program of China (No. 2017YFB0803303, No.
2016QY03D0601), the National Natural Science Foundation
of China (No. 61502517), the National Defense Science and
Technology Project Funds (No. 3101283).
REFERENCES
[1] CIALDINI R B. Influence: science and practice[M]. Boston: Allyn
and Bacon, 2003
[2] TING I H, CHANG P S, WANG S L. Understanding microblog users
for social recommendation based on social networks analysis[J].
Journal of Universal Computer Science, 2012, 18(4):554–576
[3] LI N, GILLET D. Identifying influential scholars in academic social
media platforms[A]. Proceedings of the 2013 IEEE/ACM
Figure 7. Running time of algorithms per iteration on data sets with International Conference on Advances in Social Networks Analysis
different network densities (D = 0.85, 40 iterations). and Mining[C]. Ontario, Canada, 2013. 608-614.
[4] VEGA-OLIVEROS D A, BERTON L, LOPES A D A, et al.
In a 64-servers cluster, Fig. 6 exhibits the average Influence maximization based on the least influential spreaders[A].
running time of algorithms performing 20 iterations on D4 Proceddings of the 1st International Conference on Social Influence
when D takes different values. Compared to algorithm 2, Analysis[C]. Buenos Aires, Argentina, 2015. 3-8.
algorithm 1 takes more time to complete one iteration no [5] DINH T N, ZHANG H, NGUYEN D T, et al. Cost-effective viral
marketing for time-critical campaigns in large-scale social
matter what value D takes. Interestingly, when D = 0.95, networks[J]. IEEE/ACM Transactions on Networking, 2014,
completing one iteration requires the most time, and 22(6):2001-2011.
requiring the least time when α = 0.85. So the determination [6] KATZ E, LAZARSFELD P. Personal influence: the part played by
of α has a direct impact on the efficiency of user influence people in the flow of mass communications[M]. New Jersey:
algorithms based on (1) in social networks. Transaction Publishers, 1966.
We also compare the performance of algorithms on data [7] CHA M, HADDADI H, BENEVENUTO F, et al. Measuring user
sets with different network densities. Fig. 7 shows the influence in twitter: the million follower fallacy[A]. International
Conference on Weblogs and Social Media[C], Washington, Dc, USA,
average running time and their variances on D2, D2_A, 2010. 10-17.
D2_B and D2_C in the 16-servers cluster. We can learn that [8] DING Z, JIA Y, ZHOU B, et al. Mining topical influencers based on
as the density increases, algorithms need more time to finish the multi-relational network in micro-blogging sites[J]. China
one iteration and the variances of each point become larger Communications, 2013, 10(1):93-104.
as well. In conclusion, not only the number of users, but also [9] WENG J, LIM E P, JIANG J, et al. TwitterRank: finding topic-
the density of network graph constructed by the relationships sensitive influential twitterers[A]. Proceedings of the third ACM
between users will affect the computational efficiency of International Conference on Web Search and Data Mining[C]. New
York, USA, 2010. 261-270.
algorithms.
[10] TANG J, SUN J, WANG C, et al. Social influence analysis in large-
V. CONCLUSIONS scale networks[A]. Proceedings of the 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data
This paper is mainly based on a classic social influence Mining[C]. Paris, France, 2009. 807-816.
algorithm, combining with two big data processing [11] LIU X, LI M, LI S, et al. IMGPU: GPU-accelerated influence
frameworks to compare the performance on a real large-scale maximization in large-scale social networks[J]. IEEE Transactions on
Parallel and distributed Systems, 2014, 25(1):136-145.
Sina Wei dataset. The experimental results show that the big
data processing framework can significantly improve the [12] PING Y, XIANG Y, ZHANG B, et al. Implementation of parallel
pageRank algoirthm based on MapReduce[J]. Computer Engineering,
efficiency of the user influence algorithm in social networks. 2014, 40(2):31-34.
Due to the different inherent parallelism of MapReduce and [13] PAGE L, BRIN S, MOTWANI R, et al. The pagerank citation
Spark, there are differences in the performance of algorithms. ranking: bringing order to the web[J]. Stanford Digital Libraries
Practically, the configuration of parameters and the Working Paper, 1998, 9(1):1-14.
properties of datasets all have a direct impact on the [14] TUNKELANG D. A twitter analog to pagerank[EB/OL].
convergence and computational efficiency of corresponding http://tinyurl.com/9byt4z, 2009..
algorithm. [15] SONG X, CHI Y, HINO K, et al. Identifying opinion leaders in the
In the experiment, the related parameters of big data blogosphere[A]. Proceedings of the 6th ACM Conference on
Information and Knowledge Management[C]. Lisbon, Portugal, 2007.
processing framework are configured by default. As a result, 971-974.
further work can improve the performance of the influence
parallelization algorithms by optimizing the parameters of
the big data processing framework.
185