You are on page 1of 5

2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming

Spark㧦A Big Data Processing Platform Based On


Memory Computing
Zhijie Han Yujie Zhang
Institute of Data and Knowledge Engineering Institute of Data and Knowledge Engineering
Nanjing University of Posts and Telecommunications Henan University
Nanjing, Jiangsu, China Kaifeng, Henan, China

Abstract—Spark is a memory-based computing framework Spark is better suited for iterative applications, such as Data
which has a better ability of computing and fault tolerance, Mining and Machine Learning [5].
supports batch, interactive, iterative and flow calculations. In this
paper, we analyze the Spark's primary framework, core Spark revolves around the concept of RDD (Resilient
technologies, and point out the advantages and disadvantages of Distributed Dataset) [6]: RDD is a fault-tolerant collection of
the Spark. In the end, we make a discussion for the future trends elements that can be operated in parallel. Spark provides a new
of the Spark technologies. way in data computing with fault-tolerant, which effectively
reduces the disk and network I/O overhead. RDD is a fault-
Keywords—Spark; Memory Computing; Spark SQL; MLlib; tolerant, parallel data computing structure, which allows users
GraphX; Spark Streaming; to explicitly store the data in disk and memory, and it can
control data partitions. RDD is a read-only data collection,
I. INTRODUCTION natural supports fault tolerance. Because, RDD is an immutable
With the rapid developing of the Big Data over the past few data sets, and it is able to remember the graph of operations. At
years, more and more applications need to be extended to large the same time, RDD provides a rich set of operations to
Clusters [1]. Programmable Clusters environment has brought manipulate the data [6]. The following operations of RDD
several challenges: Firstly, many applications need to be implement the monad mode, which are very suitable for the
rewritten in a parallel manner, and the programmable Clusters Scala, such as map, flatMap, and filter. In addition, RDD
need to process more types of data computing; Secondly, the provides more convenient operations such as join, groupBy,
fault tolerance of the Clusters is more important and difficult; reduceByKey, etc.
Thirdly, Clusters dynamically configure the computing
Currently, Spark has developed into a data computing
resources between shared users ,which increases the
platform with many components. The ecosystem of Spark is
interference of the applications. With the rapid increase of
called BDAS (The Berkeley Data Analytics Stack) [7]. The
applications, Clusters computing requires a working solution to
core framework of BDAS is the Spark, which is supports for
suit different calculations. Spark is a concurrent framework for
Spark SQL (a query and analyze engine), MLlib (an underlying
common Clusters computing and data analyzing which is
distributed Machine Learning repository), GraphX (a parallel
developed by the UC Berkeley AMP Lab Laboratory. Spark
graph calculation framework), Spark Streaming (a flow
offers a very good platform to integrate MapReduce [2],
calculation framework), distributed memory file systems, and
Streaming [3], SQL, Machine Learning, Graph Processing and
resource management framework. This ecosystem is described
other applications. Also, Spark provides a consistent API and
as Fig 1.
the unified deployment, which offers a perfect solution for the
data computing on the Big Data.
This section describes the issues and advantages of Spark.
Section 2 describes what the Spark is and its core technology.
Section 3 describes the four sub-frames of Spark. Section 4
discusses trends of the Spark technologies. Section 5
summarizes the work in the paper and introduces the further
research work.
II. SPARK AND ITS CORE TECHNOLOGY
INTRODUCTION Fig.1. The ecosystem of the Spark
Spark is a general distributed computing framework which
In general, the core technology and the foundation
is based on Hadoop MapReduce algorithms [4]. It absorbs the
architecture of Spark is RDD. And the Spark SQL, MLlib,
advantages of Hadoop MapReduce, but unlike MapReduce, the
GraphX, Spark Streaming are the core members of the Spark
intermediate and output results of the Spark jobs can be stored
ecosystem.
in memory, which is called Memory Computing. Memory
Computing improves the efficiency of data computing. So,

2168-3042/15 $31.00 © 2015 IEEE 172


DOI 10.1109/PAAP.2015.41

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 14:51:48 UTC from IEEE Xplore. Restrictions apply.
III. THE CORE MEMBERES OF SPARK ECOSYSTEM Machine Learning is divided into classification and
regression analysis based on the results [14]. MLlib supports
A. Spark SQL many classification algorithms, such as SVM (Support Vector
As a component of Spark, Spark SQL [8] supports the SQL Mschine), LR (Logistic Regression), DT (Decision Tree), NB
implementation in Spark: Spark SQL is evolved by the Shark (Naive Bayes); MLlib supports regression analysis algorithms
[9], and reducing the dependence on the Hive. Spark SQL such as Linear Least Squares, Lasso and Ridge Regression.
absorbs the memory storage (In-Memory Columnar Storage) of
Shark, compatibility of Hive; it has a great development in Many standard machine learning methods can be
terms of data compatibility, performance optimization, formulated as a convex optimization problem; we can write
components extension. Generally, Spark SQL consists of four this as an optimization problem݉݅݊௪‫א‬ோ೏ ݂ሺ‫ݓ‬ሻ, where the
modules: Core, Catalyst, Hive, and Hive-Thriftserver: objective is equation (1).
ͳ
z The Core processes input and output data , getting the ݂ሺ‫ݓ‬ሻ ؔߣܴሺ‫ݓ‬ሻ ൅ ݊ σ݊݅െͳ ‫ܮ‬൫‫ݓ‬Ǣ ‫ ݅ݔ‬ǡ ‫ ݅ݕ‬൯ (1)
data from different sources (RDD, Parquet, JSON, In this equation, w is a variable vector which is called
etc.), putting the query results output as schemaRDD weights in the code. Vectors ‫ݔ‬௜ ‫ܴ א‬ௗ are the training data
[10]; examples, forͳ ൑ ‹ ൑ , and the corresponding label which we
z The Catalyst processes the query statement during the want to predict is‫ݕ‬௜ ‫ܴ א‬ௗ . We call the method is linear if
entire process, including parsing, binding, ሺ‫ݓ‬Ǣ ‫ݔ‬ǡ ‫ݕ‬ሻ can be expressed as an equation by the ‫ š ் ݓ‬and ›.
optimization, physical plans; Some MLlib’s classification and regression algorithms are the
linear type.
z Hive provides CLI and JDBC / ODBC interfaces for
Hive data processing. In these four modules, Catalyst MLlib can be used to make recommendations . The process
is the core part of the merits and its performance will of movie recommendation is expressed as follows:
affect the overall performance. z Get data set: load dataset in the HDFS which contains
Generally speaking, Spark SQL has three major advantages: 1 million ratings from 5000 users on 3000 movies;
z Data compatibility: Spark SQL is not only compatible z Collaborative filtering: MLlib supports model-based
with Hive, can also be obtained from the RDD, collaborative filtering, which is described by a small
parquet, JSON files. set of latent factors to predict missing entries. In the
process, we use the alternating least squares (ALS)
z Performance optimization: in addition to use In- algorithm to make movies recommendation;
Memory Columnar Storage, byte-code generation and
other optimization techniques, Spark SQL introduces z Setup: In this process, we use a standalone project
Cost Model for dynamic evaluation query, obtains the template for this exercise. The first, we configure the
best physical plans. path and file , then we create a SparkConf object and
use it to create a Spark context object;
z Component extension: redefine and expand the parser,
analyzer and optimizer of SQL. z Running the program: compiling the MovieLensALS
class and creating a JAR file, the output can be seen
Spark SQL is the core project of the Spark, the optimization on the screen;
of Spark SQL has an important significance, which mainly is
full utilized the hardware resources and distributed parallel z Rating elicitation: counting ratings which are
computing performance of the system. received from each movie and sort movies by rating
counts. Then, obtaining the highest rating of 50
B. MLlib movies and sampling a small subset to rating
MLlib [11] (Machine Learning library) is a scalable elicitation;
Machine Learning library of Spark, it includes relevant tests z Splitting training data: MLlib will train different
and data generators. The performance of Machine Learning models based on the training set, and select the best
algorithms it has a 100 times increase than MapReduce [2]. model on the validation set based on RMSE (Root
MLlib supports the main Machine Learning algorithms, such as Mean Squared Error). Finally, add the users ratings in
classification, regression, clustering, collaborative filtering, the training set and make recommendations;
dimensionality reduction, and supports Sparse Matrix [12]. So,
MLlib has brought a great convenience for the users. Among z Training using ALS: The movies recommendations
the algorithms, the classification is the most basic work of use the method of ALS.train to train a bunch of
Machine Learning. Typical application scenario is to get the models, select and evaluate the best one;
profits resource of Internet companies. Most Machine Learning
z Movie recommendation: Accord to the training
algorithms include two parts training and predicting [13],
model we can list the top 50 movies on the screen.
which Spark uses in Machine Learning. MLlib provides three
programming languages API: Python, Java and Scala. In In general, MLlib is very outstanding between general
addition, MLlib also provides some sample code to facilitate Machine Learning algorithms and tools.. Its evaluation index
users with different backgrounds. contains Accuracy, Recovery, F-Value, ROC (Receiver
Operating Characteristic Curve) [15], Precision Recovery

173

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 14:51:48 UTC from IEEE Xplore. Restrictions apply.
Curve, AUC (Area Under The Curve) [16], AUC is used to z Triangle Counting [22] algorithm is used to count the
compare model accuracy, Recovery, F-Value, ROC is used to number of sub-graphs in a large graph. In the GraphX,
determine the threshold. it can be used for community discovery. It is used on
the Microblogging applications to express the mutual
C. GraphX concern objects. The relationship between the
GraphX [17] is a parallel computation API used for Spark concerns objects will form many triangles, the
charts and graphs processing. GraphX is developed based on triangles represent that there has a stable and intimate
Bagel, and has a great improvement on the performance and relationship between the two objects.
the memory overhead reducing. GraphX extends the Spark’s
RDD by introducing Resilient Distributed Property Graph; a D. Spark Streaming
multiple graph can be attached to each vertex and edge Real-time processing of data streams motivation is to obtain
attribute [18]. In order to support graph computing, GraphX real-time data information, and get the most valuable data at
discloses a basic set of operators, such as: Subgraph, the arrival time [23]. Spark Streaming is a stream computing
JoinVertices, MapReduceTriplets, optimal transition Pregel framework based on Spark [24], it provides a rich API, and
API. Moreover, GraphX contains a graphical algorithms and integrates streaming, batch and interactive query applications.
builders package to simplify graphical analysis tasks. Spark Streaming disperses stream computing tasks into a series
of short DStream (Discretized Stream) data [25], each piece of
As a graph processing framework of Spark, GraphX data is converted into RDD. The DStream’s transformation
supports the following algorithms: PageRank algorithm, operation turns it into the Transformation of RDD. The
ConnectedComponents algorithm, TriangleCounting algorithm. Transformation of RDD intermediate results can be stored in
z PageRank algorithm is Google's proprietary algorithm memory. Fig.3 shows the entire process of Spark Streaming.
[19], is used to measure the relative importance index
of specific pages with other pages in SE (Search
Engine). As shown in equation (2). The importance of
a page can be determined by "votes". The votes of a
page are determined by all the chains. P1, P2... Pn are
studied the pages, M (Pi) is the number of linked in
page Pi, L (Pj) is the number of linked out page Pj, and
N is the total number of the pages.
ଵି௤ ௉௔௚௘ோ௔௡௞ሺ௣ೕሻ
ܴܲܽ݃݁ܽ݊݇ሺܲ௜ ሻ ൌ ൅ ‫ ݍ‬σ௣ ೕ (2)
ே ௅ሺ௣ೕ ሻ

z Connected Components algorithm is used to find the


related users for a theme [20]. In GraphX, it is means
Fig.3. Spark Streaming processing flow
that in an undirected graph, if vertex vi to vertex vj has
a path, called vi and vj is communication. If the map Spark Streaming becomes a hot topic on real-time
between any two vertices is communicating, called the streaming data computing due to several advantages:
map is a Connected Graph [21], On the contrary, the
map is Unconnected Graph. One of the great z Fault Tolerance: Fault tolerance is very important in
communication sub-graphs is called Connected the stream computing. Spark Streaming uses the Fault
Component. Connected Graph has only oneself tolerance of RDD, each RDD is connected by lineage
Connected Component; Unconnected Graph has [7], if input data is fault-tolerant, you can re-calculate
multiple Connected Components. As shown in Fig 2. the error or unavailable partitions by transforming the
Finding Connected Graph is a core application of original input data. Otherwise, Spark Streaming can
graph calculation such as identifying clusters by copy input data streams when they are loading from
keywords. the disk and network. So, the Fault tolerance of Spark
Stream is more efficient than the continuous model,
such as Storm.
z Expansibility and Throughput: Spark Stream is able to
linearly extend to 100 nodes (each node 4Core) on
EC2 (Elastic Compute Cloud) [26], and it can process
data by the 60M records / s in a few seconds delay, In
the GREP (Globally search a Regular Expression and
Print) [24] test, the throughput of Spark Streaming
each node is 670k records / s, is far greater than Storm.

IV. DIISCUSSION
Fig.2. Unconnected Graph G and its two connected component H1, H2
Spark specializes in a variety of business scenarios, such
as Machine Learning, graph computing, stream processing. The

174

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 14:51:48 UTC from IEEE Xplore. Restrictions apply.
applications of Spark can be divided into the following three REFERENCES
categories: The first is Machine Learning and the computation [1] Chongchitnan S, Silk J. A Study of High-order Non-Gaussianity with
of complexity graph, such as relation analysis, similarity Applications to Massive Clusters and Large Voids[J]. Astrophysical
measure, community discovery and data mining; The second is Journal, 2010, 724(1):285-295.
the migration of Complex ǃ Interactive ǃ OLAP(On-Line [2] Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large
Analytical Processing)[27]; The third is Stream Processing. Clusters[J]. Operating Systems Design & Implementation, 2004,
51(1):147–152.
Spark provides a rich application programming models and [3] Lee Y, Song S. Distributed Indexing Methods for Moving Objects based
many data processing applications. It has a high performance on Spark Stream[J]. International Journal of Contents, 2015, 11(1):69-
and convenience. Thus, Spark has a great advantage in 72.
becoming to the next nuclear platform for Big Data analyzing. [4] Solovyev A, Mikheev M, Zhou L, et al. SPARK: A Framework for
Multi-scale Agent-based Biomedical Modeling[J]. International Journal
Meanwhile, the Spark also needs a further improve in of Agent Technologies & Systems, 2010, 2(3):págs. 1-7.
robustness, availability, stability and performance. And it can [5] Zaharia M, Chowdhury M, Franklin M J, et al. Spark: cluster computing
be improved by the following aspects: with working sets[J]. Book of Extremes, 2010, 15(1):10-10.
z Memory Management. The core advantage of the [6] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster
Spark is memory computing, so how to manage computing[C]//Proceedings of the 9th USENIX conference on
memory is very important for the Spark, such as we Networked Systems Design and Implementation. USENIX Association,
can optimize the code to free up initial memory space; 2012: 2-2.
we can formulate a hierarchical storage strategy to [7] Franklin M J. Making sense of big data with the Berkeley data analytics
rationally allocate the disk, hard disk, and memory. stack[C]//SSDBM. 2013: 1.
By this methods we will make a full use of the cache, [8] Armbrust M, Xin R S, Lian C, et al. Spark SQL: Relational data
get a greater space to analyze the data, reduce GC processing in Spark[C]//Proceedings of the 2015 ACM SIGMOD
International Conference on Management of Data. ACM, 2015: 1383-
(Garbage Collection) [28] overhead and improve 1394.
efficiency.
[9] Armbrust M, Xin R S, Lian C, et al. Spark SQL: Relational data
z Improving I/O throughput. Such as data local storage, processing in Spark[C]//Proceedings of the 2015 ACM SIGMOD
International Conference on Management of Data. ACM, 2015: 1383-
by this method we can reduce network bandwidth 1394.
consumption, improve network utilization. Otherwise,
[10] Conese A. Inferring latent user attributes in streams on multimodal
in some sophisticated Machine Learning algorithms social data using spark[J]. 2015.
we can use algebraic operations, such as matrix [11] Franklin M. Mllib: A distributed machine learning library[J]. NIPS
calculation, if we take advantage of instruction-level Machine Learning Open Source Software, 2013.
optimization, it will be greatly improved on [12] Biswas S, Poddar S, Dasgupta S, et al. Subspace based low rank and
performance. joint sparse matrix recovery[J]. arXiv preprint arXiv:1412.2700, 2014.
[13] Alpaydin E. Introduction to machine learning[M]. MIT press, 2014.
z CPU Optimizing. Taking account of the optimization
[14] Gronlund S D, Wixted J T, Mickes L. Evaluating eyewitness
of the whole software stack can improve the identification procedures using receiver operating characteristic
efficiency of the data computing. so, improving and analysis[J]. Current Directions in Psychological Science, 2014, 23(1): 3-
optimizing the CPU performance plays an important 10.
role for the balance of a system. [15] Demler O V, Pencina M J, D'Agostino R B. Equivalence of
improvement in area under ROC curve and linear discriminant analysis
In general, Spark will play an important role in the coefficient under assumption of normality[J]. Statistics in medicine,
ecosystem in data computing. If we can efficiently use the 2011, 30(12): 1410-1418.
memory, network, disk, and CPU, Spark will be better. [16] Pereira F, Mitchell T, Botvinick M. Machine learning classifiers and
fMRI: a tutorial overview[J]. Neuroimage, 2009, 45(1): S199-S209.
V. CONCLUSIONS [17] Xin R S, Gonzalez J E, Franklin M J, et al. Graphx: A resilient
distributed graph system on spark[C]//First International Workshop on
In this paper, we provide a general introduction of the core Graph Data Management Experiences and Systems. ACM, 2013: 2.
technologies of the Spark. Based on the analysis of the present [18] Langville A N, Meyer C D. Google's PageRank and beyond: The
core technologies, we propose the future trends of Spark science of search engine rankings[M]. Princeton University Press, 2011.
technologies, and make a preparation work for the next step [19] Blei D M. Probabilistic topic models[J]. Communications of the ACM,
research work on memory optimization of the Spark. 2012, 55(4): 77-84.
[20] Gonzalez J E, Xin R S, Dave A, et al. Graphx: Graph processing in a
ACKNOWLEDGMENT distributed dataflow framework[C]//Proceedings of OSDI. 2014: 599-
613.
The subject is sponsored by the National Natural Science [21] Chakrabarti A. CS85: Data Stream Algorithms Lecture Notes, Fall
Foundation of P. R. China (No.61572260, No.61373017, 2009[J].
No.61572261, No.61300215, No.61170243), Scientific & [22] Sakaki T, Okazaki M, Matsuo Y. Earthquake shakes Twitter users: real-
Technological Support Project of Jiangsu Province (No. time event detection by social sensors[C]//Proceedings of the 19th
BE2015702), Postdoctoral Research Funding Plan of Jiangsu international conference on World wide web. ACM, 2010: 851-860.
Province (No. 1302084B), China Postdoctoral Science [23] Zaharia M, Das T, Li H, et al. Discretized streams: an efficient and fault-
tolerant model for stream processing on large clusters[C]//Proceedings
Foundation (No. 2014M560439). of the 4th USENIX conference on Hot Topics in Cloud Ccomputing.
USENIX Association, 2012: 10-10.

175

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 14:51:48 UTC from IEEE Xplore. Restrictions apply.
[24] Possanzini C, Van Liere P, Roeven H, et al. Scalability and channel
independency of the digital broadband dStream architecture[C]//Proc
Intl Soc Magn Reson Med. 2011, 1863.
[25] Zhao H, Canny J F. High Performance Machine Learning through
Codesign and Rooflining[D]. PhD thesis, EECS Department, University
of California, Berkeley, 2014.
[26] Biggs J, Ebmeier S K, Aspinall W P, et al. Global link between
deformation and volcanic eruption quantified by satellite imagery[J].
Nature communications, 2014, 5.
[27] Hu M H, Tu S T, Xuan F Z, et al. On-Line Structural Damage Feature
Extraction Based on Autoregressive Statistical Pattern of Time
Series[C]//ASME 2014 Pressure Vessels and Piping Conference.
American Society of Mechanical Engineers, 2014: V005T10A009-
V005T10A009.
[28] Boyle W B, Fallone R M. System and method for optimizing garbage
collection in data storage: U.S. Patent 8,706,985[P]. 2014-4-22.

176

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 14:51:48 UTC from IEEE Xplore. Restrictions apply.

You might also like