You are on page 1of 10

2016 IEEE International Conference on Cluster Computing

Smart-MLlib: A High-Performance
Machine-Learning Library
David Siegal Jia Guo Gagan Agrawal
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Ohio State University Ohio State University Ohio State University
Columbus, OH 43210 Columbus, OH 43210 Columbus, OH 43210
siegal.8@osu.edu guo.980@osu.edu agrawal@cse.ohio-state.edu

AbstractAs the popularity of big data analytics has continued in a surge in its popularity over the past few years. In 2015
to grow, so has the need for accessible and scalable machine- alone, Spark received a $300 million investment from IBM
learning implementations. In recent years, Apache Sparks and became the most active project, in terms of number of
machine-learning library, MLlib, has been used to fulll this
contributors, in the Apache Software Foundation [6].
need. Though Spark outperforms Hadoop, it is not clear if it is
the best performing underlying middleware to support machine In part to demonstrate the support for cyclic dataow in
learning implementations. Building on a C++ and MPI based Spark, its makers implemented a full machine-learning library,
middleware system, -Situ MApReduce liTe (Smart), we present a coined MLlib, on top of the framework. This library not
machine-learning library prototype (Smart-MLlib). Like MLlib, only provides concrete examples of iterative applications in
Smart MLlib allows machine learning implementations to be Spark, but also provides an extremely user-friendly API for
invoked from a Scala program, and with a very similar API. To
test our librarys performance, we built four machine-learning production-quality, machine-learning algorithms. For many of
applications that are also provided in Sparks MLlib: k-means the algorithms, users need less than 20 lines of code to build
clustering, linear regression, Gaussian mixture models, and complex statistical models from stored semi-structured data
support vector machines. On average, we outperformed Sparks [20].
MLlib by over 800%. Our library also scaled better than Sparks Although Spark was designed with iterative algorithms in
MLlib for every application tested. Thus, the new machine-
mind [25], it is unclear whether Spark is the most efcient
learning library enables higher performance than Sparks MLlib
without sacricing the easy-to-use API. system for a machine-learning specic library. For example,
a system that offers a comparable API to Spark, r in-Situ
I. I NTRODUCTION MApReduce liTe (Smart) system, have been shown to out-
Big-data analytics, and the frameworks that support it, perform Spark signicantly [23]. Smart outperforms Spark
are becoming continuously more important in todays world. because of several factors, like use of higher performance
Corporations, government agencies, and individuals are in- language (C++ instead of Scala, Java, or Python), handling
creasingly turning to big-data analytics to guide their decision communication internally through MPI, use of an API that
making processes [5], [22]. The trend of data-driven decision avoids storage of (and sorting/shufing) of key-value pairs,
making has created a new multi-billion dollar big-data technol- among others.
ogy and services industry [8]. At the heart of this industry are Thus, building a machine-learning library that mimics
the numerous distributed-computing technologies that make Sparks MLlib on top of Smart could lead to increased
computation feasible on a massive scale. performance without compromising the easy-to-use interface.
Since it was introduced in 2004, one of the most popular However, there are several challenges in invoking an MPI-
technologies used in distributed-computing have been the based middleware from Scala programs. In this paper, we
implementations of Googles MapReduce paradigm [2]. The present a machine-learning library prototype, comparable to
programming model supported under this paradigm offers Sparks MLlib, built on top of our Smart system. As a
users an easy-to-use API that can be utilized to solve a wide demonstration of feasibility, this Smart-MLlib currently con-
variety of problems. Due to its exible nature and simple sists of four machine-learning algorithms: k-means clustering,
application, implementations of MapReduce are still popular linear regression, Gaussian mixture models, and support vector
in both industry and academia [12]. machines. Each algorithm has been implemented in Smart and
Although MapReduce has been very successful, the pro- has a Sparks MLlib-inspired Scala interface that is used to run
gramming model is focused around linear dataows. As the de- the Smart application.
mands of big-data analytics grew to include iterative machine- In addition to presenting the specic Smart-MLlib appli-
learning applications, a new type of distributed-computing cations in this paper, we will also describe the underlying
framework was needed. Apache Spark [1] was proposed in architecture of our system. The main focus of our architec-
2010 to address this need. Spark not only boasts an easy-to- tural discussion revolves around launching Smarts native jobs
use Scala interface, but also has built-in support for cyclic from within Scalas Java virtual machine (JVM) environment.
dataows [25]. The notable benets of Spark have resulted After describing the complications surrounding this issue, we

2168-9253/16 $31.00 2016 IEEE 336


DOI 10.1109/CLUSTER.2016.49
explain why utilizing the scala.sys.process package and a
single intermediate le remains the best way to communicate
between Scala and Smart.
Beyond introducing Smart-MLlib and its architecture, we
also detail results from testing our system against Sparks ML-
lib. Through experimentation with inputs ranging from 1GB to
16GB and clusters ranging from 4 nodes to 32 nodes, we show
that Smart-MLLib mplementations outperform Sparks MLlib
implementations for every tested conguration. Specically,
Smart-MLlib outperforms Sparks MLlib by an average of over
800% across all experiments. In addition to these performance
results, we also show that Smart-MLlib scales better than
Sparks MLlib in almost all cases. On average, Smart-MLlib
scales from 4 nodes to 32 nodes between 90% and 220% better
than Sparks MLlib for every algorithm tested.
II. BACKGROUND
Fig. 1. A simplied example dataow in Spark
In this section, we rst cover the basic ideas of the MapRe-
duce programming model. Next, we discuss the motivation
and an overview of the design of Spark. Third, we give and fault-tolerant way to pull massive amounts of data into
an overview of Smart, our distributed-computing framework. memory. In addition to dening these new distributed datasets,
Finally, we highlight the benets of Sparks machine-learning Spark dened a series of operations on RDDs that supported
library, MLlib. parallel computation.
A. MapReduce RDD operations can be loosely grouped into two categories:
transformations and actions [21]. Transformations take an
In an effort to both improve the accessibility of dis- RDD of one type, A, and transform it into an RDD of another
tributed computing and to simplify distributed-application type, B, using a user-dened function. Examples of transfor-
code, Google released the MapReduce programming model in mations include map(), atMap(), and lter(). Actions, on the
2004 [2]. The model provides a very simple API containing other hand, require an actual computation to be performed.
only two core functions: map() and reduce(). By implementing Actions process a particular RDD and produce some type
these functions, users are able to write applications which are of result. Examples of actions include reduce() and collect().
automatically capable of operating on massively distributed Both transformations and actions are performed in parallel by
systems. This means that the complex details inherent to Spark. Figure 1 shows this dataow in Spark. First, data is
distributed applications are hidden from the programmer. The loaded from the le system into an RDD. After being loaded,
simple dataow of MapReduce, coupled with its straight- a series of transformations are performed on the RDD. Finally,
forward, functional-style API, has made the programming an action is performed and the program is terminated.
model very popular for a variety of applications [3], [4].
Despite its success, the MapReduce implementations have C. Smart
had performance issues for certain types of algorithms. In
Smart [23] is a parallel-computing framework from Ohio
particular, iterative algorithms do not perform well within
State that has evolved from FREERIDE (FRamework for
MapReduces linear programming model [25].
Rapid Implementation of Data Mining Engines) [10] and
B. Spark MATE (Map-reduce with an AlternaTE API) [9]. All of these
Spark was introduced in 2010 to fulll the need for a frameworks expose APIs that are similar to MapReduces API.
general-purpose, parallel-processing framework with built-in While the API is similar to MapReduces API in many ways,
support for nonlinear dataows [25]. Sticking to the MapRe- the Smart system processes data in a substantially different
duce style, Spark used the Scala [17] programming language way. Instead of map and reduce phases of computation, Smart
to provide users with a friendly, functional programming feel. uses reduction and combination phases. Both of these phases
Spark specically focused on two use cases: iterative jobs are supported by two map data structures: a combination map
and interactive analytics [25]. In order to be performant when and a reduction map. These data structures are composed
handling both of these tasks, Spark needed to be capable of of user-dened reduction objects that store the accumulated
holding a working set of data in memory within a distributed information of relevant input.
environment. More specically, Smart processes data in the following
More specically, Spark used a new dataset abstraction way. First, the reduction phase takes place. Smart receives
called resilient distributed datasets (RDDs). As dened in its a chunk of data and maps it to a specic key. With this key,
original paper, a RDD is a read-only collection of objects Smart locates a reduction object in the runtimes reduction
partitioned across a set of machines that can be rebuilt if a map. The chunk of data is then accumulated (or reduced)
partition is lost [25]. These RDDs gave Spark a distributed into the reduction object specied by the key. After all the

337
TABLE I
S MART S MAIN API FUNCTIONS

Smarts User API


virtual int gen key(const Chunk& chunk, const In data,
const map < int, unique ptr < RedObj >> & com map)
const
Generates a single key given the unit chunk (and combination
map if necessary)
virtual void gen keys(const Chunk& chunk, const In data,
vector<int> & keys, const map<int,
unique ptr<RedObj >> & com map) const
Generates multiple keys given the unit chunk (and combination
map if necessary)
virtual void accumulate(const Chunk& chunk, const In data,
unique ptr<RedObj>& red obj) = 0
Accumulates the unit chunk on a reduction object
virtual void merge(const RedObj& red obj,
unique ptr<RedObj>& com obj) = 0
Merges the rst reduction object into the second reduction object,
i.e., a combination object
virtual void process extra data(const void extra data,
map<int, unique ptr<RedObj >> & com map)
Processes the extra input data to help initialize the combination
map if necessary
virtual void post combine(map<int, unique ptr<RedObj >>
& com map)
Performs post-combination processing and updates the combina-
tion map if necessary
virtual void convert(const RedObj& red obj, Out out) const
Converts a reduction object to an output result if necessary

Fig. 2. An iterative dataow in Smart

combination map to an output result.

data has been reduced, the combination phase begins. First, D. MLlib
all reduction maps on a node are merged together locally into To relieve users of the need to create their own commonly
a single combination map. After the local merge, all of the used machine-learning implementations, MLlib was devel-
combination maps are further merged into a nal combination oped in 2013 [13]. MLlib is a production-ready distributed
map on the master node. machine-learning library developed on top of Spark. Since
Smart supports iterative algorithms by distributing the nal Sparks 0.8 release, MLlib has come packaged with Spark
combination map from the master node to every Smart instance and has dramatically simplied machine-learning for Spark
between each pass through the data. The information contained users. Through utilization of MLlib, users can easily load data
in the map is then available to be referenced throughout into Spark, build models of the data with optimized machine-
every data-processing stage. Supporting this ow required the learning algorithms, and then query those models to extract
addition of a post-combination procedure that allows users meaning from the data. Amazingly, all three of these steps
to make any nal updates to the master combination map can commonly be implemented in less than 30 lines of Scala
before it is distributed. Furthermore, Smart provides functions code.
to initialize the combination map before the rst iteration and As of the 1.6.1 release, MLlib contains algorithms for many
to convert the combination map to an output result after the common machine-learning tasks including classication, re-
last iteration. gression, collaborative ltering, clustering, and dimensionality
To support all of the functionality described above, Smart reduction [20]. Beyond these conventional machine-learning
provides an API, whose core functions can be seen in Table tasks, MLlib also contains modules for basic statistical calcu-
I. Functions gen key() and gen keys() map data chunks to lations. These statistics modules perform tasks ranging from a
specic reduction objects within the local reduction map. simple mean calculation to hypothesis testing and random data
The accumulate() function reduces each data chunk into the generation. MLlib has successfully emphasized Sparks abil-
reduction objects specied by gen key/s(). After every data ity to gracefully handle iterative algorithms while providing
chunk has been reduced, merge() is used to combine all of tremendous utility to the big-data community.
the reduction maps into a single combination map. Before
III. S YSTEM A RCHITECTURE
the combination map is distributed to each Smart instance,
post combine() can be used to perform any extra processing. Smart-MLlib is a Scala-based API that is used to execute
Figure 2 shows this dataow graphically. Two functions not machine-learning algorithms on Smart. Scala is a exible,
shown are process extra data() which initializes the com- high-level language that most importantly for our discussion
bination map and convert() which is used to convert the is executed within the Java Virtual Machine (JVM). Smart,

338
process blocks until the Smart job terminates.
4) The Smart job executes the desired machine-learning
algorithm.
5) Before nishing, the Smart job writes the result of Step
4 (as a model) to disk.
6) The Smart job nishes and the Scala process unblocks.
7) The model produced by Step 5 is immediately read into
a Scala object.
8) The Scala API returns the model from Step 7 to the user
as a Scala object.
Architecting the system in the ways outlined above pro-
vides several advantages over other possible designs. First,
calling Smart as an external command, as it would be called
without the MLlib wrapper, ensures that all necessary run-
time congurations are properly set up. The assurance of
Fig. 3. The dataow of Smart-MLlib when the API is used by a driver proper runtime initialization is particularly convenient, since
program Smart leverages OpenMP and MPI for parallelization which
makes manual conguration of the runtime environment very
complex. Beyond the simplication of launching Smart jobs,
on the other hand, is written in C++ with parallelization the architecture also forces every MLlib algorithm to dene
handled by a combination of OpenMP and MPI. The critical a savable model. As shown in Step 5 of Figure 3, this
design decisions made while developing our system revolved requirement comes from the Smart executable, which nishes
around connecting a Scala API, which is called from within only after saving a model to disk. Having a savable model
the JVM, to the natively run Smart system. has clear benets in big-data processing. When jobs can take
Connecting a JVM-based language, like Scala, with a native hours or even days to complete, the ability to save the model
language, like C++, is not a new problem. In fact, there is a produced from these jobs can save a lot of time. Additionally,
common interface called the Java Native Interface (JNI) [11] since the model is saved to persistent storage, if the JVM
that is regularly used for this purpose. By using JNI, native process crashes after a Smart job terminates, the machine-
code can be imported directly into JVM-based languages. learning job doesnt need to be executed again.
This allows a C++ function to be seamlessly called from While the system architecture utilized has several benets,
within an executing Scala program. While the JNI appears it also introduces a few disadvantages. The main downside is
to be a perfect t for our machine-learning library, there is a the systems lack of control over the execution of a Smart
substantial barrier to using this technology. Smart is not just a application. Since Smart is called as an external process,
C++ library, but a C++ library that uses MPI to distribute work exception handling and graceful recoveries are very difcult to
across clusters of nodes. MPI requires a runtime environment achieve. The only real usable information Scala gets directly
to be initialized through the use of a mpirun or mpiexec from Smart is the exit status after the Smart job nishes
command. While it is theoretically possible to set up this execution.
runtime environment programmatically within a C or C++ In addition to the loss of control over the execution of Smart
program, it is not recommended. jobs, the architecture also brings additional latency into the
Due to the complexity of initializing the MPI runtime system. Writing and reading from disk is a very slow way
environment from within C or C++, we decided not to use to communicate between two processes. That being said, the
JNI to perform the communication between our Scala API and additional time spent writing and reading a small le to disk is
Smart. Instead, we determined the best way to communicate miniscule when compared with the total execution time of the
between Scala and Smart was through the scala.sys.process Smart job. Finally, the system architecture does introduce an
package [18]. This package allows us to conveniently launch additional point of failure into the system. Writing and reading
a Smart job from within Scala using the mpiexec command. from disk can cause failures, and because the writing and
In doing so, the package removes all of the complexity that reading occurs on different processes, it is sometimes difcult
surrounds initializing the MPI runtime environment. After to recover from and detect these failures. Fortunately, these
deciding on using scala.sys.process as the communication failures are infrequent, and many of the issues that do appear
method between the Scala API and Smart, the rest of the can easily be handled in the language in which they occur.
system design fell into place. The overall ow of the system,
which is shown pictorially by Figure 3, is as follows: IV. S MART-ML LIB I MPLEMENTATION
1) The Scala API is called by the user. In order to establish the feasibility of Smart-MLlib, we
2) The Scala API prepares a mpiexec command complete implemented four machine-learning algorithms for the library
with all arguments needed by Smart. that are also available in Sparks MLlib: k-means clustering,
3) Using scala.sys.process, the Smart job is executed with linear regression, Gaussian mixture models, and support vector
the mpiexec command prepared in Step 2. The Scala machines. This section presents each of these algorithms by

339
TABLE II v a l model = KMeans . r u n (
S MART-ML LIB K-M EANS API PARAMETERS
nativeExeContext ,
Parameter Description dataPath ,
A wrapper for the MPI/OpenMP information dataVarName ,
nativeExeContext
needed to run the Smart executable. dims ,
The location of the data le that will be
dataPath
processed.
k,
The variable type of the data to be processed iterations ,
dataVarName
(only relevant to scientic formats e.g. netcdf). initialModel ,
dimensions The dimensionality of the data to be processed. modelOutputFile
The number of clusters to create for the k-
numCentroids );
means model.
The number of iterations for which to run the (a) Smart-MLlib API
iterations
k-means algorithm.
The initial model (i.e. cluster centers) to use
initialKMeansModel
for the k-means algorithm.
v a l model = KMeans . t r a i n (
The le used for the communication between parsedData ,
modelOutputFile
Smart and Scala. Smart writes the model to numClusters ,
this le and, subsequently, Scala reads the
model from the le.
iterations ,
KMeans .RANDOM,
)
giving a brief overview and discussing the Smart implemen- (b) Sparks MLlib (singleton-based) API
tation. As all four of the APIs are similar, only k-means API Fig. 4. Comparison of k-means API using Smart-MLlib and Sparks MLlib
is discussed in here.
A. K-Means Clustering seen in Table II. From examining the table, it is clear that
K-means is an unsupervised machine-learning technique the interface supplies options for declaring the initial k-means
used to separate a dataset into k groups such that each group model, determining the number of clusters to use (i.e. k),
contains similar patterns [19]. The basic k-means algorithm and setting the number of iterations for the algorithm. The
works as follows: 1) the initial k centers are set; 2) each parameters unused for these tasks provide general information
data point in the dataset is assigned to the nearest center; 3) on the data being processed and the environment Smart is
each center recomputes its location as the mean of all data using to execute the distributed algorithm.
points assigned to it; and 4) Steps 2 and 3 are repeated until The k-means API for Sparks MLlib and Smart-MLlib
a stopping condition is met. provides very similar functionality for the basic k-means
1) Smarts Implementation: To fully describe Smarts k- algorithm. Both interfaces allow the user to easily specify
means implementation, both the reduction object and core API the number of clusters to create and the maximum number
functions need to be dened. The reduction object, ClusterObj, of iterations the algorithm should perform. In addition, as
represents a single cluster center. It has two main roles: 1) to depicted in Figure 4, both APIs are used in a similar way.
maintain the current location of the clusters center and 2) to Although the APIs are similar, one slight difference between
accumulate all of the input vectors assigned to the cluster. the two libraries is that Sparks MLlib implementation can
The implemented functions, which were introduced in Sec- pick the initial k centers by randomly selecting k points
tion II-C, process every data chunk as an input point. First, from the provided dataset; however, Smart-MLlibs version
gen key() maps each input point to the nearest ClusterObj in requires the initial centers to be included in the API call. It
the combination map and returns its key. Second, accumulate() should also be noted that Sparks MLlib interface provides
is used to accumulate the input vectors components into the the ability to utilize a more sophisticated version of k-means
reduction object specied by gen key(). Next, merge() accu- called k-means [20]. This modied algorithm uses a more
mulates all the reduction maps produced by accumulate() into intelligent method for selecting the initial k centers which can
a single combination map. The nal combination map, which dramatically increase the algorithms convergence speed.
holds one ClusterObj for each cluster in the algorithm, is then Beyond the additional k-means implementation, Sparks
updated by post combine(). The update uses the accumulated MLlib API also provides the ability to specify a convergence
vector components and the number of vectors mapped to each condition. Once the convergence condition is met, the algo-
reduction object to move each ClusterObjs centroid to the rithm returns the k-means model and is not required to nish
mean of the data points assigned to it. Finally, the updated any remaining iterations. This feature is not yet supported
combination map is distributed to all the Smart instances and by Smart and, thus, Smart-MLlibs API does not provide an
the process repeats. option for specifying a convergence condition.
2) Smart-MLlib Interface: The k-means API in Smart-
MLlib is currently implemented as a single function. Figure B. Linear Regression
4 shows the usage of the API and compares it against the Linear regression is a technique used to model the relation-
corresponding API call using Sparks MLlib. The explanation ship between variables [24]. In the version we discuss, the idea
of all the formal parameters in the Smart-MLlib API can be is to determine the linear combination of independent variables

340
that will best explain a single dependent variable within a the Gaussian weights, means, and covariance matrices are
dataset. Therefore, for independent variables x1 , x2 , ..., xn , updated; and 4) Steps 2 and 3 are repeated until convergence.
and dependent variable y in Equation 1, linear regression aims 1) Smarts Implementation: Smarts GMM implementation
to select the constants w0 , w1 , ..., wn such that the relationship can be described through dening the algorithms reduction
between the independent variables and dependent variable in object and core API functions. GMMs reduction object,
the dataset is most accurately captured. GMMRedObj, represents a full Gaussian mixture model. This
means that it must maintain every Gaussian functions weight,
y = w0 + w1 x1 + . . . + wn xn (1) mean, and covariance information. In addition to supplying
these items, GMMRedObj is used for accumulating responsi-
A common way to select these constants, and the method bility information for each Gaussian as the data is processed.
used in this paper, is through the least-mean-square algorithm The logic of the data processing in this GMM implementa-
which leverages gradient descent [7]. The basic algorithm tion is very similar to the traditional EM algorithm but has
proceeds as follows: 1) the initial weights are selected; 2) been adapted to t Smarts alternative dataow. Each data
for each sample in the dataset, the weighted error between chunk, which is viewed as an individual data point, is rst
the actual dependent variable (from the sample) and the processed by the gen key() function. Since this algorithm only
guessed dependent variable (from the samples independent requires a single reduction object, gen key() simply returns
variables and current values of w0 , w1 , . . . , wn ) is summed; a constant number. For each point processed, accumulate()
3) w0 , w1 , . . . , wn are updated based on the sum of weighted computes and accumulates a wide range of responsibility in-
errors computed in the previous step; and 4) Steps 2 and 3 are formation into a reduction object. After accumulate() nishes
repeated until convergence. reducing the input, merge() combines all produced reduction
1) Smarts Implementation: The Smart implementation for maps into a single combination map. In post combine(), this
linear regression can be fully described by dening both the nal combination map updates its Gaussian mixture model
reduction object and the core API functions for the algorithm. information based on the responsibilities accumulated. Finally,
The reduction object in this algorithm, WeightObj, represents the accumulating data structures are cleared and the process
the state of a full linear model. More specically, a WeightObj repeats with the updated GMM information.
contains a set of weights that describe a linear function.
In addition, WeightObj is responsible for accumulating the D. Support Vector Machine
number of points processed and the sum of weighted errors Support vector machines (SVMs) are used to classify data
detected throughout the linear regression program. into two groups. Unlike various other types of classiers
Smarts life cycle functions process each chunk of data as that do not make determinations on the goodness of a
an (output, input vector) pair. Because this algorithm utilizes classication (e.g. perceptrons [16]), SVMs attempt to opti-
a single reduction object, gen key() returns a constant number. mally classify datasets [7]. The classication is dened with a
In accumulate(), the input pair is processed and both the the hyperplane that separates the dataset into two classes. To nd
number of points processed and the sum of weighted errors, the optimal hyperplane, an objective function is optimized that
which is based on the input vector and current model weights, rewards a large margin of separation between the hyperplane
are reduced into the reduction object. merge() further accu- and the dataset and penalizes misclassications [7], [19].
mulates the reduction maps produced by accumulate() into While complex SVMs exist, we will be focusing on two-
a single combination map. Once nalized, the combination class linear SVMs. The gradient-descent based algorithm we
map utilizes the accumulated values inside post combine() use to build these SVMs utilizes a hinge loss function and
to perform a gradient-descent based update on its weights. works as follows: 1) the initial weights for the model are
Finally, the updated combination map is distributed to all the selected; 2) the partial derivatives of the hinge loss function,
Smart instances and the process repeats. with respect to the weights, are accumulated for each point; 3)
based on the partial derivatives found in the previous step as
C. Gaussian Mixture Model well as the number of points and iteration number, the weights
A Gaussian mixture model (GMM) is a probability distri- are updated with a gradient-descent rule; and 4) Steps 2 and
bution that is constructed using a weighted combination of k 3 are repeated until convergence. For a more comprehensive
Gaussian functions. The aim of training a GMM is to modify explanation of the SVM algorithm, please refer to Sparks
the weights (i.e. linear coefcients), means, and covariance MLlib guide [20].
matrices of the Gaussian functions in order to maximize the 1) Smarts Implementation: The Smart implementation for
likelihood that a particular dataset could be generated by the SVM can be described by dening the algorithms reduction
mixture model [15]. Typically, a GMM is trained through the object and core API functions. The reduction object for this
utilization of the expectation-maximization (EM) algorithm algorithm, GradientObj, uses an array of weights to represent
[14], [15]. a complete linear model. In addition to providing this model,
Within the context of GMM, the EM algorithm works as GradientObj is also responsible for keeping track of the iter-
follows: 1) the initial k Gaussians are selected; 2) the respon- ation number, the number of input chunks processed, and the
sibility of each Gaussian to every data point is determined; 3) sum of the hinge loss functions partial derivatives throughout
based on the responsibilities computed in the previous step, the execution of the algorithm.

341
Since this SVM is a binary classier, each data chunk is addition to outperforming Spark in head-to-head experiments,
interpreted as a (class, input) pair. First, the input pair is Smart-MLlib also out-scaled Sparks MLlib. Figure 9a shows
sent to the gen key() function. As this algorithm requires this by tracking Smarts speedup over Spark while increasing
only one reduction object per reduction map, gen key() always the number of nodes. Clearly, as all three of the speedups
returns a constant number. In accumulate(), the size and partial have positive slopes, Smart is becoming faster relative to Spark
derivatives, which are based on the input chunk and the combi- with each additional node. Averaging over all three input sizes
nation maps current weights, are accumulated into a reduction tested in every node conguration, the Smart-MLlib k-means
object. After all the data has been accumulated, merge() implementation scales 220% better than Sparks between 4
combines the reduction maps produced from accumulate() nodes and 32 nodes. As these results are consistent with all
into a single combination map. Using this master combination other results within this section, please refer to Section V-F
map, post combine() performs a gradient-descent update on for a detailed analysis.
the weights of the model using all of the accumulated values.
Finally, the accumulating data structures are cleared and the C. Linear Regression Experiments
process is repeated with the updated weights. For the linear regression experiments, the performance of
V. E XPERIMENTAL R ESULTS the linear regression implementation on Smart-MLlib and
Sparks MLlib were compared. The tests included input sizes
To benchmark the system against the industry standard, we of 1GB, 2GB, 4GB, and 16GB and computing cluster sizes of
compare the performance and scalability of Smart-MLlib with 4, 8, 16, and 32 nodes. As in Section V-B, the results reported
the performance and scalability of Sparks MLlib. for each conguration are an average of ve independent trials.
A. Environment In all of the tests, the linear regression processed input with 15
dimensions and 1 output dimension and ran for exactly 1000
Our experiments were all conducted on the same homo-
iterations. Furthermore, each linear regression model included
geneous, multi-core computing cluster. Specically, our tests
an intercept term resulting in a total of 16 weights trained
were performed using 4, 8, 16, and 32 nodes congurations.
by the algorithm. Note that the Spark source code had to be
Each node in the cluster uses two Intel(R) Xeon(R) E5630
modied to guarantee all 1000 iterations of the algorithm were
processors and contains 12 GB of main memory. The proces-
completed.
sors each have a combined total of eight computing cores that
run at a base frequency of 2.53GHz. 1) Results: The results of the linear regression experiments
For the Smart based experiments, MPI (MPICH hydra can be seen in Figure 6. From examining the graphs, it
version 3.1) was used to communicate between nodes and is clear that the Smart-MLlib implementation outperforms
OpenMP (libgomp-4.4.7) was used for intra-node commu- Sparks in every conguration. More specically, the Smart-
nication. The Spark experiments, on the other hand, used MLlib version is at least twice as fast as Sparks MLlib
Sparks standalone cluster for communication. To keep the implementation in every experiment. In the most extreme case,
performance comparisons fair, all tests were conducted with Smart-MLlibs linear regression performs 15 times faster than
one process per node and eight threads per process. This Sparks. As with k-means, the Smart-MLlib version also scales
translates to Smart using one MPI process and eight OpenMP better than Sparks. Figure 9b shows this superior scaling
threads per node and Spark using one executor and eight graphically. As the number of nodes increases, so does Smart-
executor cores per node. It should also be noted that all Spark MLlibs performance relative to Sparks. Averaging over all
tests use version 1.5.2 of both Spark and MLlib. three input sizes tested in every node conguration, Smart-
MLlibs linear regression implementation scales 220% better
B. K-Means Clustering Experiments than Sparks between 4 nodes and 32 nodes. As these results
For the k-means experiments, the performance of the basic are consistent with all other results in this section, please refer
k-means implementation was compared between Smart-MLlib to Section V-F for a detailed analysis.
and Sparks MLlib. The algorithm was tested using input sizes
of 1GB, 2GB, 4GB, and 16GB and computing clusters of D. Gaussian Mixture Model Experiments
4, 8, 16, and 32 nodes. The performance reported for each The Gaussian mixture model (GMM) experiments were
conguration is an average of ve independent trials. In all conducted to compare the performance of GMM on Smart-
of the tests, k-means was run with four cluster centers for MLlib and Sparks MLlib. Tests were carried out using input
exactly 1000 iterations on 16-dimensional input. Since Spark sizes of 1GB, 2GB, 4GB, and 16GB and cluster sizes of 4,
will stop iterating when a default convergence condition is met, 8, 16, and 32 nodes. The results reported for each congu-
the source code was modied to ensure all iterations actually ration are the average of ve independent tests. Since GMM
occurred. takes substantially longer to execute than the other algorithms
1) Results: The results of all the k-means experiments covered, each trial was only run for 100 iterations using a
can be seen in Figure 5. For every conguration tested, four Gaussian model. Furthermore, to mitigate the inuence
the Smart library outperformed Sparks implementation by of different linear algebra libraries, an input dimensionality
at least 150%. In the most dramatic case, a 1GB input le of two was used. Again, it should be noted that the Spark
was processed by 32 nodes 17 times faster with the Smart- source code had to be modied to guarantee all iterations of
MLlib implementation than with the Sparks MLlib version. In the algorithm were completed.

342
800 500 1200 700
Smart-MLlib Smart-MLlib Smart-MLlib Smart-MLlib
450
700 Spark's MLlib Spark's MLlib Spark's MLlib 600 Spark's MLlib
1000
Total Processing Time (secs)

Total Processing Time (secs)

Total Processing Time (secs)

Total Processing Time (secs)


400
600
350 500
800
500
300
400
400 250 600
300
200
300
400
150 200
200
100
200
100 100
50

0 0 0 0
1 2 4 1 2 4 1 2 4 16 1 2 4 16
Input Size (GB) Input Size (GB) Input Size (GB) Input Size (GB)

(a) 4-node cluster (b) 8-node cluster (c) 16-node cluster (d) 32-node cluster
Fig. 5. Performance comparison of k-means on Smart-MLlib and Sparks MLlib

350 250 600 400


Smart-MLlib Smart-MLlib Smart-MLlib Smart-MLlib
300 Spark's MLlib Spark's MLlib Spark's MLlib 350 Spark's MLlib
500
Total Processing Time (secs)

Total Processing Time (secs)

Total Processing Time (secs)

Total Processing Time (secs)


200
300
250
400
250
150
200
300 200
150
100
150
200
100
100
50
100
50 50

0 0 0 0
1 2 4 1 2 4 1 2 4 16 1 2 4 16
Input Size (GB) Input Size (GB) Input Size (GB) Input Size (GB)

(a) 4-node cluster (b) 8-node cluster (c) 16-node cluster (d) 32-node cluster
Fig. 6. Performance comparison of Linear Regression on Smart-MLlib and Sparks MLlib

2500 2500 2000 3000


Smart-MLlib Smart-MLlib Smart-MLlib Smart-MLlib
1800
Spark's MLlib Spark's MLlib Spark's MLlib Spark's MLlib
2500
Total Processing Time (secs)

Total Processing Time (secs)

Total Processing Time (secs)

Total Processing Time (secs)


2000 2000 1600

1400
2000
1500 1500 1200

1000 1500

1000 1000 800


1000
600

500 500 400


500
200

0 0 0 0
1 2 1 2 4 1 2 4 1 2 4 16
Input Size (GB) Input Size (GB) Input Size (GB) Input Size (GB)

(a) 4-node cluster (b) 8-node cluster (c) 16-node cluster (d) 32-node cluster
Fig. 7. Performance comparison of the Gaussian mixture model algorithm on Smart-MLlib and Sparks MLlib

1) Results: The Gaussian mixture model results can be seen does not cache intermediate results.
in Figure 7. The charts in the gure show that the Smart- Another interesting aspect of these results, not explained in
MLlib implementation greatly outperformed the Spark imple- Section V-F, can be seen in Figure 9c. In all other algorithms
mentation for all tested congurations. Interestingly, Smarts presented, the Smart-MLlib implementation scales better than
performance relative to Sparks was much stronger for this Sparks in all cases. This means that every line segment in
algorithm than any of the other algorithms presented in the the gures depicting scale comparisons have had exclusively
section. Results from k-means, linear regression, and SVM positive slopes. In Figure 9c, we see a single negatively sloping
show Smart having roughly a 2 to 15 times advantage over line segment for both input sizes plotted. Following the one
Smart; however, for the GMM tests, this range balloons to a negatively sloping segment within the 1GB input line and
13 to 54 times advantage. the one negatively sloping segment within the 2GB input
Smarts large advantage in this algorithm, beyond the factors line, both lines continue trending in a typical positive-slope
explained in Section V-F, could be a result of the complex fashion. Even though the Spark implementation out-scaled
nature of GMM. Spark achieves efciency by caching inter- Smarts in one situation, on average, Smart-MLlibs GMM
mediate results (i.e. RDDs) in memory and reusing them in implementation scales from 4 nodes to 32 nodes 90% better
every iteration. In complex algorithms like GMM, Spark can than Sparks.
be forced to remove cached RDDs to free up memory for The trend in Figure 9c seems to reinforce our view that
execution. These RDDs are later recomputed, but performance Spark is having memory strain when executing the GMM al-
suffers. This issue is not present in Smart since our system gorithm. In both negative sloping line segments, the improved

343
scalability occurred as the input size went from 14 GB to 18 GB between iterations. Third, Spark relies heavily on network
per node. We suspect that this decrease in input size allowed communication for transmitting information even when that
each Spark executor to cache one or more additional RDDs re- data is being transmitted to the same node. In contrast, Smart
sulting in signicantly improved performance. Following this leverages the shared-memory environment on each node to
boost from additional memory, the scalability trend returned reduce network trafc as much as possible.
to one typical of the other algorithms studied in this section. In addition to outperforming Sparks MLlib, the results of
the experiments also show that Smart-MLlib scales better than
E. SVM Experiments Sparks MLlib for the algorithms tested. In every experiment,
For the SVM experiments, the performance of the linear with the exception of a special case discussed in Section
SVM implementation is compared between Smart-MLlib and V-D1, the more nodes added to the problem, the better Smarts
Sparks MLlib. Tests were conducted using input sizes of 1GB, implementation performed relative to Sparks. On average,
2GB, 4GB, and 16GB and cluster sizes of 4, 8, 16, and 32 the Smart-MLlib implementations scaled from 4 to 32 nodes
nodes. As mentioned in other experiments, the result reported about 2 to 3 times better than the equivalent applications in
for each conguration is an average of ve independent trials. Sparks MLlib. Interestingly, there were many instances in
The parameters used for the SVM tests closely mirror those which Sparks version performed worse when more nodes
used in the linear regression experiments. Each test ran for were added. Usually, Sparks performance worsened when the
exactly 1000 iterations on samples with 15 input dimensions amount of data being sent to each node dropped below about
and 1 output dimension. Additionally, the SVM model always 128 MB. In these same situations, Smart was able to achieve
included an intercept term so a total of 16 weights were trained a speedup by utilizing additional nodes.
by the algorithm in each test. Note that the Spark source code The reason Smart scales better than Spark is probably very
had to be modied to guarantee all iterations of the algorithm closely related to the reason it performs faster in general.
were completed. As described previously, Smarts reduction objects eliminate
1) Results: The results of the SVM experiments are shown the need for data creation and grouping. Additionally, Smart
in Figure 8. As with the previous algorithms, the Smart- focuses on minimizing network trafc through utilization of a
MLlib SVM implementation outperformed the Sparks MLlib shared-memory environment on each node. In contrast, Spark
SVM implementation in every conguration. More speci- does require grouping and relies more heavily on the network
cally, Smart-MLlibs version ran 90% to 1100% faster than for communication. These differences allow Smart to have
Sparks. In addition to outperforming Spark, Smart-MLlib also lower overhead than Spark for additional nodes and make the
out-scaled Sparks MLlib. Figure 9d shows this graphically system scale more efciently.
through the positively sloped line segments. As the number
of nodes increases, Smart-MLlibs SVM performance gets VI. C ONCLUSION AND F UTURE W ORK
better relative to Sparks MLlib. Averaging over all three input
sizes tested in every node conguration, Smart-MLlibs SVM The need for accessible and scalable machine-learning im-
implementation scaled 200% better than Sparks between 4 plementations is continuously growing. This paper presents a
nodes and 32 nodes. Since these results are consistent with machine-learning library prototype geared to address this need.
the other results in this section, please refer to Section V-F Smart-MLlib provides an easy-to-use Scala API for distributed
for a detailed analysis. machine-learning algorithms that executes on top of Smart.
These interfaces are modeled off of Sparks MLlib API, but
F. Analysis and Discussion for all four of the algorithms tested k-means clustering,
All of the results presented in this section show that, for linear regression, Gaussian mixture models, and support vector
the algorithms discussed, Smart-MLlib performs strictly better machines the Smart library dramatically outperformed and
than Sparks MLlib. In every conguration tested, the Smart out-scaled Sparks version. Since the interfaces used by both
implementation performed at least 90% times better than libraries are so similar, this implies that users can achieve
the Spark implementation. Moreover, for the k-means, linear a performance boost by switching libraries without any real
regression, and SVM tests, Smart-MLlibs implementation impact on developer effort.
performed an average of 380% better than Sparks MLlib Although the ndings presented in this paper look promis-
implementation. If the GMM results are included in that ing, Smart-MLlib is still in its infancy. Sparks MLlib im-
average, the performance multiple grows to 800%. plements dozens of algorithms that have yet to be explored
The performance advantages of Smart result from three by Smart. To gain better understanding of how the two
key differences between Smart and Spark [23]. First, Spark systems compare, more MLlib algorithms should be added
produces a large amount of intermediate data after map to Smart-MLlib and performance tested. Furthermore, Smarts
operations that need to be grouped and reduced by the sys- functionalities must be expanded to support some of the
tem. In comparison, Smart performs reductions directly into features common in Sparks MLlib implementations (e.g. a
reduction objects which removes the extra data creation and convergence criteria that terminates a program once met).
need for grouping. Second, Spark applications create and store These improvements to Smart will allow Smart-MLlib to truly
many immutable intermediate states (i.e. RDDs) throughout match Sparks MLlib in terms of functionality. Lastly, to
a programs execution. On the other hand, Smart operations seriously contend with Sparks MLlib, a fault-tolerant version
all occur directly on reduction maps that can be reused of Smart must be developed as well. While previous versions

344
400 250 500 350
Smart-MLlib Smart-MLlib Smart-MLlib Smart-MLlib
450
350 Spark's MLlib Spark's MLlib Spark's MLlib 300 Spark's MLlib
Total Processing Time (secs)

Total Processing Time (secs)

Total Processing Time (secs)

Total Processing Time (secs)


200 400
300
350 250
250
150 300
200
200 250
150
100 200
150
150 100
100
50 100
50 50
50

0 0 0 0
1 2 4 1 2 4 1 2 4 16 1 2 4 16
Input Size (GB) Input Size (GB) Input Size (GB) Input Size (GB)

(a) 4-node cluster (b) 8-node cluster (c) 16-node cluster (d) 32-node cluster
Fig. 8. Performance comparison of SVM on Smart-MLlib and Sparks MLlib

18 16 55 14
1GB Input 1GB Input 1GB Input 1GB Input
16 2GB Input 14 2GB Input 50 2GB Input 12 2GB Input
4GB Input 4GB Input 4GB Input
Smart Speedup Over Spark*

Smart Speedup Over Spark*

Smart Speedup Over Spark*

Smart Speedup Over Spark*


14 45
12 10
12 40
10 8
10 35
8 6
8 30
6 4
6 25

4 4 20 2

2 2 15 0
4 8 16 32 4 8 16 32 4 8 16 32 4 8 16 32
Number of Nodes Number of Nodes Number of Nodes Number of Nodes

(a) K-Means (b) Linear Regression (c) GMM (d) SVM


Fig. 9. Comparison of scalability on Smart-MLlib and Sparks MLlib
* Note: Smart speedup over Spark is calculated by dividing the execution time of the algorithm on Sparks MLlib by the execution time of the algorithm on
Smart-MLlib.

of Smart, such as MATE [9], have implemented a fault-tolerant [12] Seema Maitrey and CK Jha. Handling big data efciently by using
option, this option is not currently available for Smart. map reduce technique. In Computational Intelligence & Communication
Technology (CICT), 2015 IEEE International Conference on, pages 703
708. IEEE, 2015.
R EFERENCES [13] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram
Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde,
[1] Spark. https://spark.apache.org/, 2016. [Online; accessed 25-March- Sean Owen, et al. Mllib: Machine learning in apache spark. arXiv
2016]. preprint arXiv:1505.06807, 2015.
[2] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplied data [14] Tood K Moon. The expectation-maximization algorithm. Signal
processing on large clusters. Communications of the ACM, 51(1):107 processing magazine, IEEE, 13(6):4760, 1996.
113, 2008. [15] Douglas Reynolds. Gaussian mixture models. Encyclopedia of Biomet-
[3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: a exible data rics, pages 827832, 2015.
processing tool. Communications of the ACM, 53(1):7277, 2010. [16] Frank Rosenblatt. The perceptron: a probabilistic model for information
[4] Dan Gillick, Arlo Faria, and John DeNero. Mapreduce: Distributed storage and organization in the brain. Psychological review, 65(6):386,
computing for machine learning. Berkley, Dec, 18, 2006. 1958.
[5] Big Data Poses Challenges For Federal Agencies. http: [17] Scala. http://www.scala-lang.org/. [Online; accessed 25-March-2016].
//www.informationweek.com/government/big-data-analytics/ [18] Scala.sys.process. http://www.scala-lang.org/api/rc2/scala/sys/process/
big-data-poses-challenges-for-federal-agencies/d/d-id/1322525, 2015. package.html. [Online; accessed 25-March-2016].
[Online; accessed 10-March-2016]. [19] Alex Smola and SVN Vishwanathan. Introduction to machine learning.
[6] Derrick Harris. Survey shows huge popularity spike for Apache Spark. Cambridge University, pages 3234, 2008.
http://fortune.com/2015/09/25/apache-spark-survey/, 2015. [Online; ac- [20] Machine Learning Library (MLlib) Guide. https://spark.apache.org/docs/
cessed 19-March-2016]. latest/mllib-guide.html, 2016. [Online; accessed 25-March-2016].
[7] Simon S Haykin, Simon S Haykin, Simon S Haykin, and Simon S [21] Spark Programming Guide. http://spark.apache.org/docs/latest/
Haykin. Neural networks and learning machines, volume 3. Pearson programming-guide.html, 2016. [Online; accessed 5-March-2016].
Education Upper Saddle River, 2009. [22] Jonathan Vanian. More companies willing to spend big
[8] New IDC Forecast Sees Worldwide Big Data Technology and Services bucks on big data technology. http://fortune.com/2015/07/06/
Market Growing to $48.6 Billion in 2019, Driven by Wide Adop- companies-willing-spend-big-data-technology/, 2015. [Online;
tion Across Industries . http://www.idc.com/getdoc.jsp?containerId= accessed 10-March-2016].
prUS40560115, 2015. [Online; accessed 10-March-2016]. [23] Yi Wang, Gagan Agrawal, Tekin Bicer, and Wei Jiang. Smart: A
[9] Wei Jiang, Vignesh T Ravi, and Gagan Agrawal. A map-reduce system mapreduce-like framework for in-situ scientic analytics. In Proceedings
with an alternate api for multi-core environments. In Proceedings of the of the International Conference for High Performance Computing,
2010 10th IEEE/ACM International Conference on Cluster, Cloud and Networking, Storage and Analysis, page 51. ACM, 2015.
Grid Computing, pages 8493. IEEE Computer Society, 2010. [24] Xin Yan. Linear regression analysis: theory and computing. World
[10] Ruoming Jin and Gagan Agrawal. A middleware for developing parallel Scientic, 2009.
data mining implementations. In In Proceedings of the rst SIAM [25] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker,
conference on Data Mining. Citeseer, 2001. and Ion Stoica. Spark: Cluster computing with working sets. HotCloud,
[11] Java Native Interface. http://docs.oracle.com/javase/7/docs/technotes/ 10:1010, 2010.
guides/jni/. [Online; accessed 25-March-2016].

345