You are on page 1of 6

2018 the 3rd IEEE International Conference on Cloud Computing and Big Data Analysis

Performance Analysis of User Influence Algorithm under Big Data Processing


Framework in Social Networks

Yong Quan Liang Zhang


College of Computer College of Computer
National University of Defense Technology National University of Defense Technology
Changsha, China Changsha, China
e-mail: qy8801@nudt.edu.cn e-mail: gfkdzliang@163.com

Yan Jia Bin Zhou


College of Computer College of Computer
National University of Defense Technology National University of Defense Technology
Changsha, China Changsha, China
e-mail: jiayanjy@vip.sina.com e-mail: binzhou@nudt.edu.cn

Abstract—Social influence plays an essential role in spreading location [3], influence maximization [4], viral marketing [5],
information within online social networks, and can be modeled etc. The main objective is to quantitatively compute the
or measured by analyzing various social networking data, such influence of users and discover the influentials in social
as published content, users’ attributes or interactions among networks, who can be referred to as opinion leader [6],
them. Because of the massive social data, researchers often fail domain expert [3] in various field. Hence, the influentials
to quantify user influence in an accurate and high efficient way. play a vital role in the adoption of innovation, the network
Big data technique can be adopted to alleviate this problem. In group formation, or the dissemination and guidance of
this paper, we introduce a kind of classical individual influence information. Due to the limitation of theoretical models and
algorithm, and implement two parallel versions of this
experimental methods, the early work can only analyze the
algorithm based on different big data processing framework.
Experiment results on a large-scale real dataset demonstrate
user influence qualitatively on the small sample dataset, and
that the computational efficiency of influence algorithm can be verify the existence of influence in the social system.
improved significantly in massive data sets by virtue of big However, in recent years, the online social network has
data processing framework. provided a wealth of available experimental data, making it
possible for researchers to model and quantify the user
Keywords-influence measurement; performance analysis; big influence.
data technique; distributed computing; social networks The existing literature mainly quantifies social influence
from aspects of network structure, user behavior and
I. INTRODUCTION interaction [7]-[9]. In fact, the graph structure model
constructed by social networking data is rather complicated,
Social network platforms, such as Sina Weibo 1 and and usually contains billions of users, hundreds of billions of
Facebook 2 , have gradually become the mainstreams of edges formed by the relationship between users and the huge
network applications and changed the ways of living and amount of information they generate. As of June 2017, Sina
communication. Users can publish, retweet or share Weibo has reached 365 million active users3. This poses new
messages, and even interact with the others in social challenges for the calculation of user influence, making it
networks, thereby affecting the spread of information. In difficult to efficiently measure the influences of users on
general, a message published in a social network can be such an ultra-large scale. However, the advent of different
spread, heavily depending on the interactions between users, categories of big data processing frameworks makes it
who have sophisticated social relationships. Social influence possible to efficiently analyze these massive amounts of data:
is the intrinsic incentive hidden under users’ interactions, and First of all, based on the small sample data set or subgraph
interactions among users are the external manifestation of structure, the user influence measurement model is analyzed
social influence, therefore having a direct impact on the and verified; Then, combined with the specific big data
dissemination of information in social network. processing framework, the user influence measurement
Social influence can be conducive to public opinion model can be implemented in parallel; Finally, individual
guidance and social operation [1], and has a wide range of influence parallelization algorithms can be deployed in a
applications such as social recommendation [2], expertise cluster environment to efficiently compute the influence of
——————
1
http://weibo.com/ ——————
2 3
https://www.facebook.com/ http://www.questmobile.com.cn/blog/blog_98.html

978-1-5386-4301-3/18/$31.00 ©2018 IEEE 180


users [10]-[12]. Currently, Hadoop 4 , an open-source forwarding relationships between users, to calculate user
distributed framework, is widely used. Based on the Hadoop influence in the large-scale real data set.
platform, we select two parallel computing models to
illustrate the impact of big data processing framework on the III. ALGORITHM PARALLELIZATION IMPLEMENTATION
performance of user influence algorithm, as well as compare Big data processing frameworks provide technical
the improved performance of different big data processing support for processing and analyzing large-scale social
frameworks on a real large-scale dataset. networking data. Traditional single serial algorithms can’t
meet the computation needs because of limited hardware
II. USER INFLUENCE MEASUREMENT resources, such as memory, CPU, and I/O. Through the
The topology of a social network can be represented by a following big data parallel computing frameworks, we
graph model G = {V, E}, where V is the set of all users, E is implement (1) in a straightforward manner to analyze the
the set of edges which are made up of the relationships efficiency.
between users. G can be a weighted graph, Zuv indicates the
A. MapReduce-Based Parallel Algorithm
weight of the edge formed by user u and v. Early methods
base on the concepts of complex networks to quantify the MapReduce is a parallel computing framework proposed
user influence, such as the indegree and the outdegree of by Google for large-scale data processing, which is divided
node, degree centrality, closeness, betweenness and K-shell into the map and reduce stage, and the input and output of
and so on. However, these network-based methods also have each stage are key-value pairs whose data types can be
their own limitations, and not consider the user behavior and customized. Based on the computing framework, we
interaction between them, resulting in the inaccurate implement the parallelization programming for (1), mainly
measurement of user influence. rewriting the map function and the reduce function, and the
To accurately measure the social influence of users in pseudo code is shown in the algorithm 1. It is obvious that
social networks, scholars have learned from the classic web the parallel algorithm is an iterative algorithm, and the
ranking model PageRank algorithm [13]. The algorithm uses iterative operation of the algorithm is the same every time
a Markov-based random walk to simulate the behavior of when the termination condition is not satisfied: The map
users browsing the web and considers the importance of a operation is responsible for spreading the influence of each
web page to be determined by the importance of all web user to the other related users according to the weight; the
pages that are linked to it. Suppose G = {V, E} is a graph reduce operation is responsible for collecting the influence
formed by web pages and their links, P is the vector of web components and updating the current user's influence value
page scores, M is the transition probability matrix, then the according to (1).
algorithm can be expressed as a matrix product as follows:
ALGORITHM 1. MAPREDUCE-BASED USER INFLUENCE ALGORITHM
T
P=DM P+' (1) Input˖A weighted social network graph G = {V, E, W}.
Output˖Users influences vector P.
Calculate transition probability matrix M = {muv}.
where D is the regularization factor and ' is the correction
Initialize N = |V|, P = 1N, regularization factor D.
term. It is not difficult to know that (1) is an iterative repeat˖ ˖
algorithm whose time complexity is O(|E|2). In practice, we map˖
can make ' = Ee, e is the vector of all elements of 1, E is the foreach v  V do
adjustment factor. foreach (v, u)  E do
The process of user influence propagation is also a Calculate influence propagation component Puov = muv u P(u).
random walk process. As in [14], the M is constructed by the end
end
followships between users in Twitter, and influence is reduce˖
calculated by letting ' = (1D)eN. To measure user foreach v  V do
influence more fine-grained, scholars put forward another Pc(v) = 0.
method combining user attribute and their interactions with foreach (v, u)  E do
(1). Specifically, M and ' are reconstructed from the existing Linear weight influence components Pc(v) = Pc(v) + D u Puov.
social networking data. For example, Researchers compute end
Updata P(v) = Pc(v) + (1  D)N.
user influence at different topics by constructing a topic-
end
related influence matrix M [10], also consider topic until convergence
similarity based on (1) [9]. Researches proposed the foreach v  V do
InfluenceRank combined with the published information Output vc influence Pc(v).
content [15]. Researchers proposed an influence model based end
on the transition matrix M in the multi-relational network [8].
Thus, PageRank is the basic algorithm for measuring user B. Spark-Based Parallel Algorithm
influence so we select (1) as the basic algorithm in this paper, Spark5 is a memory-based parallel computing framework
and the transfer probability matrix M is constructed by the developed by AMP Labs, which the main idea is to reduce
—————— ——————
4 5
http://hadoop.apache.org/ http://spark.apache.org

181
the I/O of disk and network to increase the efficiency of big TABLE I. STATISTICS OF THE DATASET
data processing. Resilient Distributed Dataset (RDD) is the Descriptions Sina Weibo
core technology for representing the sliced and immutable
data sets that can be manipulated in parallel. RDD is an microblogs 4,586,584,659
abstract computing and data, and provides two types of users 116,147,966
operators: transformation operators and action operators. The
transformation operator is responsible for converting one or original microblogs 1,079,801,756
more RDDs into a new RDD, while the action operator
forwarded original microblogs 97,351,945
generates the final calculation based on the generated RDDs.
We parallelize (1) through the Spark parallel computing forwarding microblogs 3,506,782,903
framework and run the code in Yarn mode. Pseudo code as
shown in Algorithm 2. Similar to algorithm 1, algorithm 2 is average forwardings of each original microblog 3.25
an iterative algorithm, and the operation of each iteration is
the same: The social network graph structure and other data As we can see, less than 10% original microblogs are
are transformed into RDDs. Flatmap() operator is forwarded and only around 0.2% of them are forwarded
responsible for spreading the influence of users. more than 500 times, indicating that a fraction of users
Reducebykey(add) operator adds all the influence generate and control a large amount of information spread in
components, and map() operator updates the current user social networks. Fig. 1 shows the forwardings distribution of
influence value according to (1). original microblogs in the dataset, which conforms to a
power law distribution with an exponent of 2.13.
ALGORITHM 2. SPARK-BASED USER INFLUENCE ALGORITHM According to (1), measuring user influence requires a
Input˖A weighted social network graph G = {V, E, W}. specific network structure. Here, we construct the weighted
Output˖Users influences vector P. network by extracting social forwarding relationships
Calculate transition probability matrix M = {muv}. between users in the dataset. To protect the privacy, users’
Initialize N = |V|, P = 1N, regularization factor D. ids need to be anonymized, and finally we obtain about
RDD(V, E, M) = SparkContext (G, M).SparkOperator. 59GB forwarding relationships dataset R consisting of many
repeat˖ ˖ triplet tuples <user id, forwarding user id, frequency>. A
foreach v  V do tuple <u, v, fuĮv> exists in R indicating that user u has totally
foreach (v, u)  E do forwarded v’s microblogs fuĮv times in the collected dataset.
Calculate influence propagation component RDD(v, Puov) :
RDD(V, E, M).flatmap(lamda: Puov = muv u P(u)).
The dataset R contains 3,504,379,868 triplet tuples involving
Update user influence RDD(v, Pv) : RDD(v, Puov).reducebyke 115,205,577 users, which is stored in HDFS with a block
y(add).mapvalue(lamada: P(v) = D u P(v) + (1  D)N). size of 128M. After that, a weighted social network graph G
end = {V, E, W} can be constructed simply with the dataset R,
end where |V| = 115,205,577 and |E| = 3,504,379,868. If a tuple
until convergence <u, v, fuĮv> exists, then (v, u)  E and wv,u fuĮv represents
foreach v  V do
the edge weight.
Output vc influence Pc(v).
end

IV. EXPERIMENTS
To analyze the impact of big data processing framework
on the performance of user influence in social networks, we
have implemented the parallel algorithms above and run
them on a real large-scale dataset.
A. Experimental Data and Preprocessing
The experimental dataset is crawled from Sina Weibo
with the help of Eefung6 and made up of microblogs users
posted from November 2, 2016 to June 26, 2017. Each
microblog is a text record with five fields: timestamp, user id,
id, forwarding user id and forwarding microblog id. When
users publish an original blog, the forwarding user id and
microblog id are all null. Table I lists statistics of the
collected dataset in length.
Figure 1. Forwardings distribution of original microblogs.

——————
6
http://www.eefung.com/

182
Suppose that user influence spreads along the forwarding zero elements in the adjacency matrix of graph G = {V, E,
relationships through the information within social networks. W}. When there are same number of users in two social
Consequently, the transition probability between users in M network data sets, their network density may be different.
can be calculated as: Hence, we randomly sample data sets with various density
based on D2 to study the effect of density on the
muv = wv,u ķ∑v
V, (v
, u)  E wv
,u (2) performance of algorithms.
TABLE II. DESCRIPTION OF EXPERIMENTAL DATA SETS
B. Experimental Environment
Data sets Users Forwardings Density
Experiments are carried on Tencent Cloud M2-server
with the hardware environment as follows: 8-core CPU, D1 100,000 4,185,326 4.19×10B8
64GB memory, 500GB hard disk, 1Mbps bandwidth and 94,091,124
pre-installed Ubuntu Server 14.04.1 LTS 64-bit. We have D2 1,000,000 9.41×10B5
implemented all the proposed algorithms and preprocessed 940,634,687
D2_A 1,000,000 9.41×10B4
the data in Java 1.8. These two parallel algorithms based on
big data processing framework are performed on various 9,411,885
D2_B 1,000,000 9.41×10B6
Hadoop distributed clusters with more than 128 independent
941,482
M2-servers. The versions of Hadoop and Spark are 2.7.4 and D2_C 1,000,000 9.41×10B7
1.6.2 respectively. To further analyze the performance of
662,538,337
algorithms, we build different scale clusters. Specifically, we D3 10,000,000 6.63×10B6
set up six Hadoop clusters with different numbers of servers, 1,368,256,085
which are 4, 8, 16, 32, 64 and 128, respectively. D4 50,000,000 5.47×10B7
3,504,379,868
C. Evaluation Metrics D5 115,205,577 1.19×10B7
Similar to previous work, we exploit widely used metrics,
such as accuracy and running time, to evaluate the D. Results and Analysis
performance of these algorithms. As mentioned earlier, there are no standard datasets to
Due to the lack of standard test datasets for user influence test the performance of user influence measurements. So we
measurement algorithms in social networks, we leverage perform the proposed algorithms on real data sets and
convergence instead of accuracy to analyze. Specifically, analyze the results in terms of convergence and efficiency.
given the user influence Pn after the n-th iteration, when (3) 1) Convergence
holds, the procedure is terminated and the algorithm As both based on (1), these two parallel algorithms
converges: present the same convergent tendency in certain situations.
Therefore, we will take algorithm 2 as an example to
|| Pn+1  Pn ||1ķN < δ (3) demonstrate the performance. Again, given the value of D,
the convergent tendency from the same data set also are all
where N represents the number of users and the margin of the same in various clusters. Surprisingly when algorithm 2
error δ = 10 8. The intuition is when the average margin of satisfies (3) for the first time separately on D1, D2, D3, D4
B

error for each user influence does not exceed 10 8, this result
B
and D5, the iterations are 83, 84, 84, 84 and 85, respectively.
tend to be stable and convergent. This is because the cluster size just results in various
As for computation time, with various parameter settings, computing resources, not changing the fundamental
we will record the wall time and calculate the speedup of mechanism of the algorithm.
corresponding algorithm on different data sets and clusters. Fig. 2 shows the convergent trends of algorithm 2 in a
In addition, we should consider the factor D, usually 0.85 in 16-servers cluster when D = 0.85. In the beginning, the
web ranking. However, user forwarding behavior in social convergence errors drop sharply, and become gently
networks is quite different from that in web surfing, and we afterwards. Obviously, the convergence has nothing to do
set some values, such as 0.5, 0.7, 0.85 and 0.95, to carry out with the user scale with this criterion (3). The value of α also
experiments. affects the convergences of algorithms, as illustrated in Fig.
To explore the acceleration performance of big data 3 drawn from D4 on the 64-servers cluster. Overall, the
processing framework on different size data sets, R is larger the value of α, the faster the algorithm converges.
divided into several subsets D1, D2, D3, D4, D5, as shown When the values of α are 0.5, 0.7 and 0.85, the numbers of
in Table II. The number of online social network users iterations required under (3) are 25, 44 and 84, respectively.
involved in these datasets has increased from one hundred Nevertheless, as α = 0.95, the convergence error at the 212th
iteration is 2.2×10 8, and the convergence condition is still
B
thousand to one hundred million, and the number of
forwarding relationships between users has also increased not satisfied. Practically, we should set a reasonable value of
correspondingly, reaching a maximum of one billion. the regularization factor α depending on user behavior
Density is used to depict the degree of density between characteristics within a specific social network for influence
nodes in social networks, referring to the proportion of non- measurement.

183
processing framework. Consequently, from Fig. 4, the
performance peeks of algorithm 1 on D3, D4 and D5 will be
appear by adding more servers in the experiment. However,
on D1, the speedups of algorithm 1 and 2 are both less than 1.
Because processing the small data set, more time is taken to
launch the parallel task or distribute data in a distributed
environment.

Figure 2. Convergent tendencies of algorithm 2 on different data sets.

Figure 4. Speedups of algorithm 1 in different clusters with differrent data


sets (D = 0.85).

Figure 3. Convergent tendencies of algorithm 2 with different D.

2) Efficiency
Noting that big data processing framework can improve
the ability to process large-scale data. Since finishing our
experiments need lots of time in small clusters, we change
the termination condition for efficiency analysis. Specifically,
algorithms will be terminated if reaching the fixed number of
iterations. Figure 5. Speedups of algorithm 2 in different clusters with differrent data
As expected, speedup is adopted to evaluate the sets (D = 0.85).
performance of the two parallel algorithms. Fig. 4 and Fig. 5
display the speedup in different clusters and the iterations of
algorithms on data sets D1, D2, D3, D4 and D5 are 50, 40,
30, 20 and 20 respectively. Strikingly, the speedup obtained
from algorithm 2 is higher than that in algorithm 1. The
reason is that Spark is a memory-based framework, iterative
computation can bring greater performance improvement. Of
course, more servers in a cluster will certainly boost the
acceleration effect of parallel algorithms, especially on large-
scale data sets, as there are more computing resources
available. Meanwhile, with larger data sets, computing
resources are fully utilized, resulting in more pronounced
acceleration. There are noticeable glitches in all curves in Fig.
5, which are respectively 8, 32, 32, 64 and 64 servers from
D1 to D5, demonstrating that with more servers in the cluster,
Figure 6. Running time of algorithms per iteration with differrent D.
the performance will not continue increasing on the data set
due to the parallel model and the particular big data

184
ACKNOWLEDGMENT
The work is supported by the National Key Research and
Development Program of China (No. 2017YFB0803303, No.
2016QY03D0601), the National Natural Science Foundation
of China (No. 61502517), the National Defense Science and
Technology Project Funds (No. 3101283).
REFERENCES
[1] CIALDINI R B. Influence: science and practice[M]. Boston: Allyn
and Bacon, 2003
[2] TING I H, CHANG P S, WANG S L. Understanding microblog users
for social recommendation based on social networks analysis[J].
Journal of Universal Computer Science, 2012, 18(4):554–576
[3] LI N, GILLET D. Identifying influential scholars in academic social
media platforms[A]. Proceedings of the 2013 IEEE/ACM
Figure 7. Running time of algorithms per iteration on data sets with International Conference on Advances in Social Networks Analysis
different network densities (D = 0.85, 40 iterations). and Mining[C]. Ontario, Canada, 2013. 608-614.
[4] VEGA-OLIVEROS D A, BERTON L, LOPES A D A, et al.
In a 64-servers cluster, Fig. 6 exhibits the average Influence maximization based on the least influential spreaders[A].
running time of algorithms performing 20 iterations on D4 Proceddings of the 1st International Conference on Social Influence
when D takes different values. Compared to algorithm 2, Analysis[C]. Buenos Aires, Argentina, 2015. 3-8.
algorithm 1 takes more time to complete one iteration no [5] DINH T N, ZHANG H, NGUYEN D T, et al. Cost-effective viral
marketing for time-critical campaigns in large-scale social
matter what value D takes. Interestingly, when D = 0.95, networks[J]. IEEE/ACM Transactions on Networking, 2014,
completing one iteration requires the most time, and 22(6):2001-2011.
requiring the least time when α = 0.85. So the determination [6] KATZ E, LAZARSFELD P. Personal influence: the part played by
of α has a direct impact on the efficiency of user influence people in the flow of mass communications[M]. New Jersey:
algorithms based on (1) in social networks. Transaction Publishers, 1966.
We also compare the performance of algorithms on data [7] CHA M, HADDADI H, BENEVENUTO F, et al. Measuring user
sets with different network densities. Fig. 7 shows the influence in twitter: the million follower fallacy[A]. International
Conference on Weblogs and Social Media[C], Washington, Dc, USA,
average running time and their variances on D2, D2_A, 2010. 10-17.
D2_B and D2_C in the 16-servers cluster. We can learn that [8] DING Z, JIA Y, ZHOU B, et al. Mining topical influencers based on
as the density increases, algorithms need more time to finish the multi-relational network in micro-blogging sites[J]. China
one iteration and the variances of each point become larger Communications, 2013, 10(1):93-104.
as well. In conclusion, not only the number of users, but also [9] WENG J, LIM E P, JIANG J, et al. TwitterRank: finding topic-
the density of network graph constructed by the relationships sensitive influential twitterers[A]. Proceedings of the third ACM
between users will affect the computational efficiency of International Conference on Web Search and Data Mining[C]. New
York, USA, 2010. 261-270.
algorithms.
[10] TANG J, SUN J, WANG C, et al. Social influence analysis in large-
V. CONCLUSIONS scale networks[A]. Proceedings of the 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data
This paper is mainly based on a classic social influence Mining[C]. Paris, France, 2009. 807-816.
algorithm, combining with two big data processing [11] LIU X, LI M, LI S, et al. IMGPU: GPU-accelerated influence
frameworks to compare the performance on a real large-scale maximization in large-scale social networks[J]. IEEE Transactions on
Parallel and distributed Systems, 2014, 25(1):136-145.
Sina Wei dataset. The experimental results show that the big
data processing framework can significantly improve the [12] PING Y, XIANG Y, ZHANG B, et al. Implementation of parallel
pageRank algoirthm based on MapReduce[J]. Computer Engineering,
efficiency of the user influence algorithm in social networks. 2014, 40(2):31-34.
Due to the different inherent parallelism of MapReduce and [13] PAGE L, BRIN S, MOTWANI R, et al. The pagerank citation
Spark, there are differences in the performance of algorithms. ranking: bringing order to the web[J]. Stanford Digital Libraries
Practically, the configuration of parameters and the Working Paper, 1998, 9(1):1-14.
properties of datasets all have a direct impact on the [14] TUNKELANG D. A twitter analog to pagerank[EB/OL].
convergence and computational efficiency of corresponding http://tinyurl.com/9byt4z, 2009..
algorithm. [15] SONG X, CHI Y, HINO K, et al. Identifying opinion leaders in the
In the experiment, the related parameters of big data blogosphere[A]. Proceedings of the 6th ACM Conference on
Information and Knowledge Management[C]. Lisbon, Portugal, 2007.
processing framework are configured by default. As a result, 971-974.
further work can improve the performance of the influence
parallelization algorithms by optimizing the parameters of
the big data processing framework.

185

You might also like