Professional Documents
Culture Documents
Using Machine Learning To Optimize Parallelism in Big Data Applications
Using Machine Learning To Optimize Parallelism in Big Data Applications
net/publication/318472941
CITATIONS READS
24 1,654
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by María S. Pérez on 16 December 2017.
Abstract
In-memory cluster computing platforms have gained momentum in the last years, due to their ability to
analyse big amounts of data in parallel. These platforms are complex and difficult-to-manage environments.
In addition, there is a lack of tools to better understand and optimize such platforms that consequently
form backbone of big data infrastructure and technologies. This directly leads to underutilization of avail-
able resources and application failures in such environment. One of the key aspects that can address this
problem is optimization of the task parallelism of application in such environments. In this paper, we
propose a machine learning based method that recommends optimal parameters for task parallelization in
big data workloads. By monitoring and gathering metrics at system and application level, we are able to
find statistical correlations that allow us to characterize and predict the effect of different parallelism set-
tings on performance. These predictions are used to recommend an optimal configuration to users before
launching their workloads in the cluster, avoiding possible failures, performance degradation and wastage
of resources. We evaluate our method with a benchmark of 15 Spark applications on the Grid5000 testbed.
We observe up to a 51% gain on performance when using the recommended parallelism settings. The model
is also interpretable and can give insights to the user into how different metrics and parameters affect the
performance.
Keywords: machine learning, spark, parallelism, big data
2.2. Spark: A large data processing engine slot. Spark offers a way of dynamically asking for
additional executors in the cluster as long as an
Spark [9] is a fault-tolerant, in-memory data an- application has tasks backlogged and resources
alytics engine. Spark is based on a concept called available. Also executors can be released by
resilient distributed dataset (RDDs). RDDs are an application if it does not have any running
immutable resilient distributed collections of data tasks. This feature is called dynamic allocation
structures partitioned across nodes that can reside [10] and it turns into better resource utilization.
in memory or disk. These records of data can be We use this feature in all of our experiments.
manipulated through transformations (e.g map or This means that we do not have to specify the
filter). A RDD is lazy in the way that will only number of executors for an application as Spark
be computed when an action is invoked (e.g count will launch the proper number, based on the
number of lines, save as a text file). A Spark ap- available resources. The aim of this paper is to
plication is implemented as a series of actions on find a combination for spark.executor.memory
these RDDs. When we execute an action over a and spark.executor.cores to launch an optimal
RDD, a job triggers. Spark then formulates an exe- number of parallel tasks in terms of performance.
cution Directed-Acylic-Graph (DAG) whose nodes
will represent stages. These stages are composed
of a number of tasks that will run in parallel over 3. Studying the impact of parallelism
chunks of our input data, similar to the MapRe-
duce platform. In Figure 2 we can see a sample The possibility of splitting data into chunks and
DAG that represents a WordCount application. processing each part in different machines in a di-
When we launch a Spark application, an ap- vide and conquer manner makes it possible to anal-
plication master is created. This AM asks the yse big amounts of data in seconds. In Spark, this
resource manager, YARN in our case, to start the parallelism is achieved through tasks that run inside
containers in the different machines of the cluster. executors across the cluster. Users need to choose
These containers are also called executors in the the memory size and the number of tasks running
Spark framework. The request consists of number inside each executor. By default, Spark considers
of executors, number of cores per executor and that each executor size will be of 1GB with one
memory per executor. We can provide these values task running inside. This rule of thumb is not opti-
through the --num-executors option together mal, since each application is going to have different
with the parameters spark.executor.cores requirements, leading to wastage of resources, long
and spark.executor.memory. One unit of running times and possible failures in the execution
spark.executor.cores translates into one task among other problems.
3
spark.executor.memory spark.executor.cores
512m 1
1g 1
2g 1
2g 2
2g 3
3g 1
3g 3
3g 4
4g 1
4g 3
4g 6
6g 1
Energy in Watts
400 2 tion from crashing while for kMeans it is only a
small gain. Even so, this moderate benefit may
300 have a large impact in organizations that run the
1
same jobs repeatedly. Avoiding bad configurations
200
is equally important. Providing a recommendation
100 to the user can help him to avoid out of memory er-
rors, suboptimal running times and wasting cluster
0 0
1g/1 4g/6 512m/1
resources through constant trial and error runs.
Energy Duration
4. Model to find the optimal level of paral-
Figure 3: Run time in seconds and power consumption in lelism
watts for a kMeans App. The default, best and worst
configurations are depicted As we have seen in the previous section, setting
the right level of parallelism can be done for each
Page Rank App application through its spark.executor.memory
·105
1,400 2 and spark.executor.cores parameters. These
values are supposed to be entered by the user and
1,200 are then used by the resource manager to allocate
Duration in seconds
1.5
Energy in Watts
1,000
JVM’s. However, clusters are often seen as black-
boxes where it is difficult for the user to perform
800 1 this process. Our objective is to propose a model
to facilitate this task. Several challenges need to be
600 tackled to achieve this objective:
0.5
400 • The act of setting these parallelization
0
parameters cannot be detached from the
1g/1 (Failed) 2g/2 6g/1 status of the cluster: Normally, the user
Energy Duration
set some parameters without considering the
memory and cores available at that moment in
Figure 4: Run time in seconds and power consumption in the cluster. However, the problem of setting
watts for a Page Rank App. The default, best and worst the right executor size is highly dependent on
configurations are depicted the memory available in each node. We need
the current resource status on YARN as a vari-
able when making these decisions.
ison to kMeans, PageRank caches data into mem-
ory that grows with every iteration, filling the heap • Expert knowledge about the application
space often. It also shuffles gigabytes of data that behaviour is needed: This needs monitoring
have to be sent through the network and buffered the resources consumed and metrics of both the
in memory space. The best configuration is the 2g application plus the system side.
and 2 spark.executor.cores configuration since
• We have to build a model that adapts to
it provides the best balance between memory space
the environment we are using: Different
and number of concurrent tasks. For the default
machines and technologies give room to differ-
one, the executors are not able to meet the mem-
ent configurations and parallelism setting.
ory requirements and the application crashed with
an out of memory error. Also, we observed that • We have to be able to apply the experi-
for the 6g and 1 spark.executor.cores, the par- ence of previous observations to new ex-
allelism is really poor with only a few tasks running ecutions: This will simplify the tuning part
concurrently in each node. to the user every time he launches a new ap-
We can draw the conclusion that setting the right plication.
5
To solve these problems we leverage the machine 4.1.1. Parallelism features
learning techniques. Our approach is as follows. A Spark stage spans the number of tasks that are
First, we will use our knowledge about YARN in- equal to the number of partitions of our RDD. By
ternals to calculate the number of tasks per node default, the number of partitions when reading an
we can concurrently run for different configurations, HDFS file is given by the number of HDFS blocks of
depending on the available cluster memory. These that file. It can also be set statically for that imple-
calculations, together with metrics gathered from mentation through instructions in the Spark API.
previous executions for that application, will form Since we want to configure the workload parallelism
the input of the machine learning module. Then, at launch time, depending on variable factors, like
the predictions will be used to find the best configu- the resources available in YARN, we will not con-
ration. In the following subsections we will explain sider the API approach. The goal is to calculate,
how we gather the metrics, how we build a dataset before launching an application, the number of ex-
that can be used for this problem and the method- ecutors, tasks per executors and tasks per machine
ology we use to make the recommendations. that will run with a given configuration and YARN
memory available. These metrics will be later used
4.1. Gathering the metrics and building new fea- by the model to predict the optimal configuration.
tures To formulate this, we are going to use the calcula-
Normally predictive machine learning algorithms tions followed by YARN to size the containers. We
need a dataset with a vector of features x1 , x2 , ...xn need the following variables:
as an input. In this section we will explain which
features we use to characterize Spark applications. • Filesize (fsize ): The number of tasks spanned
They can be classified in three groups: by Spark is given by the file size of the input
• Parallelism features: They are metrics that file.
describe the concurrent tasks running on
the machine and the cluster. Please note • dfs.block.size (bhdf s ): An HDFS parameter
that these are different from the paralleliza- that specifies the size of the block in the filesys-
tion parameters. The parallelization pa- tem. We keep the default value of 128MB.
rameters are the spark.executor.cores and
spark.executor.memory parameters, which • yarn.scheduler.minimum.allocation-mb
are set up by the users, while the parallelism (minyarn ): The minimum size for any
features describe the effect of these parameters container request. If we make requests under
on the applications. We elaborate upon these this threshold it will be set to this value. We
features in Subsection 4.1.1. use its default value of 1024MB.
• Application features: These are the metrics • spark.executor.memory (memspark ): The size
that describe the status of execution in Spark. of the memory to be used by the executors.
Examples of these features are number of tasks This is not the final size of the YARN con-
launched, bytes shuffled or the proportion of tainer, as we will explain later, since some off-
data that was cached. heap memory is needed.
• System features: These are the metrics that
• spark.executor.cores (corspark ): The number
represent the load of the machines involved.
of tasks that will run inside each executor.
Examples include CPU load, number of I/O
operations or context switches.
• yarn.nodemanager.resource.memory-mb pa-
In the following subsections, we will explain how rameter (memnode ): This sets the amount of
we monitor these metrics and how we build new memory that YARN will have availabe in each
additional metrics, like the number of tasks waves node.
spanned by a given configuration or the RDD per-
sistence capabilities of Spark. An exhaustive list of • spark.yarn.executor.memoryOverhead
metrics is included as an appendix (see Table A.3), (overyarn ): The amount of available off-
together with a notation table, to ease the under- heap memory. By default, its value its
standing of this section (Table B.4). 384.
6
• Total available nodes (Nnodes ): This represents
the number of nodes where we can run our ex-
ecutors.
7
Application metrics for different execution sizes This vector will describe if the stage processes
data that is persisted in memory, disk or if it was
40 Tasks not on that node at all and had to be read through
81 Tasks the network. We also use the HDFS variable to
128 Tasks know if the stage reads the data from the dis-
Normalized value
Time in ms
40 1,000
%
500
20
Figure 8: Changes in the metrics for a stage of a support vector machine implementation in Spark. Note how the CPU
metrics change depending on the number of tasks per node. Also garbage collection and the deserialize time of an stage
increases with this number
Algorithm 1 Building the training dataset for the machine learning algorithm
1: procedure buildataset
2: stages = {(Xapp , Xsystem , Xparallelism , Yduration )n } . The stage executions that we have
3: dataset = ∅
4: for all s ∈ stages do
5: B = { x ∈ stages | x = s } . All the stages that are the same as s
6: new = s[Xapp , Xsystem , Xparallelismref ] × B[Xparallelismrun , Yduration ]
7: add new to dataset
8: end for
9: return dataset
10: end procedure
10
rors between predicted and true values for all the 10-Fold CV evaluation for the different estimators
data points. MAE will allow us to evaluate how ·105
close the different predictors are to the real values. 1
We do not include SGD Regressor, since its MAE
is just too high to compare with the other meth-
ods. As we can see, boosted regression trees have 0
the best accuracy amongst the seven methods and
MAE
also the lowest variance. In addition, decision trees
are interpretable. This means that we will be able −1
to quantify the effect of any of the features of the
model on the overall performance of the applica-
−2
tion.
Figure 10: An example with a Sort App, where we iterate through a configuration list. Notice the different parallelism
features: {p1 , ..., pn }ref represent the parallelism conditions under which the metrics were collected, {p1 , ..., pn }run the
parallelism conditions under which the application run for that execution and {p1 , ..., pn }new the new parallelism conditions
that we want to try and the ones we will pass to the machine learning module. The rest of the metrics are kept constant.
Then we just have to group by configuration and sum up each of the stages predictions to know which one is best
12
not continue. For that stage we find the k-nearest
neighbours from all the stages that we have seen
so far. These neighbouring stages belong to a set
of applications. Then we follow a conservative ap-
proach where we choose the configuration amongst
these applications that resulted in the most num-
ber of stages completed. The intuition behind it is
that a configuration with a high number of com-
pleted stages means better stability to the appli-
cation. With this, we aim to achieve a complete
execution of the application and to collect the fea-
tures of all the stages with it. If the user executes
the same application in the future, we can use this
complete information together with the standard
boosted gradient model to recommend a new config-
uration. By trying different values for the number
of k neighbours, we found that 3 is a good number
for this parameter. Going above this value can re-
trieve too many diverse stages, which could make
Figure 11: A deadlock situation in which the same
the application to crash again. Going below it can
configuration is recommended continuously. The take us to the same deadlock depicted in Figure
application crashes with the default configuration. Based 11. Remember that the objective is not to find an
on whichever stages executed successfully the next optimal configuration, but a conservative configu-
recommendation also crashes. If we do not get past the
crashing point the recommendation will always be the same
ration, which enables a complete execution without
failures.
The kNeighbours approach uses a K-D Tree
search approach, where the complexity is
were able to complete successfully in our experi-
O(log(N )) with N being the number of data
ments were the ones with 4GB and 1 core and 6GB
points.
and 1 core. The reason behind it is that graph
processing applications are resource hungry and it- 4.5. The recommendation workflow
erative by nature. This can put a lot of pressure in
the memory management, specially when we cache Now that we have developed the models with
RDD’s like in these kind of implementations. In which we make recommendations, we can connect
this particular example, the 8GB file we initially everything together and describe the workflow we
read from HDFS grows to a 17.4GB cached RDD follow to find the best parallelism settings. When a
in the final stages of the application. user wants to execute an application, first we check
if we have any information of it in our database of
Thus, if we choose an insufficient memory setting, previously seen applications. If we do not have it,
like the default 1GB configuration, the application we execute it with the default configuration. This
will not finish successfully and we will not have ei- execution will give us the metrics we need to tune
ther the complete number of stages or the metrics the application in future executions. If we do have
for that application. Incomplete information means information about that application, we check if it
that the next configuration recommended could be completed successfully or not:
non-optimal, drawing us into a loop where the ap-
plication continuously crashes, as depicted in Fig- • In case it did not, we apply the procedure of
ure 11. Algorithm 3 for crashed executions.
To solve this, we use the approach explained in • In case it did, we apply the Algorithm 2 with
Algorithm 3. When a user launches an applica- the boosted regression trees model.
tion, we check if there are any successful executions
for it. If we cannot find any then we retrieve the Whatever the case is, the monitoring system adds
furthest stage the application got to. This can be the metrics for that run to our database so they
considered as the point where the execution could can be applied for future executions. Applications
13
Algorithm 3 Finding a configuration for an application that failed
1: procedure predictconfkneighbours(app)
2: S = { s ∈ stages | s ∈ DAG(app) ∧ s[status] = f ailed }
3: x = lastexecutedstage ∈ S
4: neighbours = f ind3neighbours(x, stages)
5: res = { n ∈ neighbours | n.stagesCompleted = max(n.stagesCompleted) }
6: return res.conf iguration
7: end procedure
in a cluster are recurring [17]. This means that the a Parquet data format7 . This workflow consist
precision of the model will increase with the time of four stages in total.
since we will have more data points. We assume
the machine learning algorithm has been trained • NGramsExample: We read a text file and we
beforehand and we consider out of the scope of this construct a dataframe where a row is a line in
paper and as future work, when and how to retrain the text. Then we separate these lines in words
it. A final algorithm for the process is depicted in and we calculate the NGrams=2 for each line.
Algorithm 4. Finally these Ngrams are saved in a text file in
HDFS. This gives us a stage in total since it is
all a matter of mapping the values and there is
no shuffles involved.
5. Evaluation
• GroupByTest: Here we do not read any data
We perform an offline evaluation of the model from disk but we rather generate an RDD with
with the traces we got from a series of experiments pairs of (randomkey, values). The keys are
in Grid 5000. generated inside a fixed range so there will be
Cluster setup: Six nodes of the Adonis clus- duplicate keys. The idea is to group by key and
ter in the Grid 5000 testbed. These machines have then perform a count of each key. The number
two processors Intel Xeon E5520 with four cores of pairs, the number of partitions of the RDD
each, 24GB of memory, 1 Gigabit Ethernet and and consequently the number of tasks can be
1 card InfiniBand 40G and a single SATA hard changed through the input parameters of the
disk of 250 GB. The operating system installed in application. This gives us two stages like in a
the nodes is Debian 3.2.68, and the software ver- typical GroupBy operation.
sions are Hadoop 2.6.0 and Spark 1.6.2. We con- We run all the applications of this unified bench-
figure the Hadoop cluster with six nodes working mark with three different file sizes as we want to
as datanodes, one of them as master node and the check how the model reacts when executing the
other five as nodemanagers. Each nodemanager has same application with different amounts of data.
21GB of memory available for containers. We set Some implementations, like the machine learning
the value of vcores to the same number as physical ones, expect a series of hyperparameters. These are
cores. We build a benchmark that is a combination kept constant across executions and their effect is
of Spark-bench [18], Bigdatabench [19] and some left out of the scope of this paper. We also try
workloads that we implemented on our own. The different values for spark.executor.memory and
latter ones were implemented to have a broader va- spark.executor.cores for each application and
riety of stages and each one has different nature: data size combinations. These executions generate
a series of traces that we use to train and evaluate
• RDDRelation: Here we use Spark SQL and the our model offline. Since the usefulness of our ap-
dataframes API. We read a file that has pairs proach is that it can predict the performance of new
of (key, value). First we count the number of unseen applications, we train the model by leav-
keys using the RDD API and then using the ing some apps outside of the training set. Those
Dataframe API with a select count(*) type
of query. After that, we do a select from
query in a range of values and store the data in 7 https://parquet.apache.org/ (last accessed Jan 2017)
14
Algorithm 4 Providing an optimal configuration for a application and a given input file size
1: procedure configurationforapplication(application, f ilesize)
2: A = { apps ∈ appDB | apps.name = application) }
3: if A = ∅ then
4: conf = (1g, 1)
5: else if A 6= ∅ then
6: if (∀app ∈ A | app.status = f ailed) then
7: conf = predictconf kneighbours(app)
8: else
9: conf = predictconf boosted(app, f ilesize)
10: end if
11: end if
12: submittospark(app, conf )
13: add monitored features to stageDB and appDB
14: end procedure
has 10206 points in the test set. Note that this pro-
cess is run on a laptop but it can be run on a node
of the cluster with less load and more computation 1,000
power, like the master node. We must also consider
that we trained the model with sklearn, but there
are distributed implementations of boosted regres-
sion trees in MLlib8 for Spark that can speed up the 500
process. We leave as future work when and how the
model should be trained.
0
00 00 00 00
5.2. Accuracy of predicted times ,0 ,0 ,0 ,0
20 40 60 80
Now we evaluate the accuracy of the predicted Number of data points
durations. As we mentioned earlier, we built the
training and the test set by splitting the benchmark Figure 13: Duration of training the model depending on
in two different groups. First, we start by evaluat- the number of data points. For the whole dataset of 85664
ing the duration predictions for different stages. To points the duration is 1536 seconds.
do that, we take all the data points for a stage, we
16
Overhead of the predictions applications recurrently with different file sizes. If
0.60 the application has not been launched before, the
recommendation will be the default one. After that
first execution, the system will have monitored met-
rics for that application. We can start recommend-
Duration in seconds
110
300 100
100
200 80
100 90
60
1g1
2g 1
2g 2
2g3
3g 1
3g4
4g1
4g 3
6g 1
1g1
2g 1
2g 2
2g3
3g1
3g4
4g1
4g 3
6g 1
1g1
2g1
2g2
2g3
3g1
3g4
4g1
4g3
6g 1
Real duration Predicted duration
Figure 15: Predictions for stages of different apps with the highest duration and so highest impact inside their applications.
For Shortest Path the predictions are pretty accurate. For SVM and Logistic Regression the trend of the predicted values
follows the real ones.
the accuracy. As an example, we keep executing manually choose a configuration. For example, in
the shortest path application first with 11GB and Figure 20 we see the partial dependence of three
then with 8GB until the model converges to an op- applications. Shortest Paths is a graph processing
timal solution. The evolution can be seen in Figure application and, as we saw earlier, it benefits from
18. For the 11GB execution, the first point is the a low number of tasks per node. kMeans sweet spot
duration for the neighbours approach. After that, seems to be around a default 10-12 tasks per node.
the execution recommended with the boosted re- For PCA, the points between 0 and 5 correspond
gression model fails, but the model goes back to to stages with only one or few tasks, something
an optimal prediction in the next iteration. In the very common in machine learning implementations.
8GB case, it goes down again from the conserva- Again everything from 8 to 15 is a good configura-
tive first recommendation to a nearly optimal one. tion while more than that, it is counterproductive.
This is an added value of the model. The system Also note how the partial dependence shows us the
can become more intelligent with every new execu- degree to which parallelism affects that workload.
tion and will help the user to make better decisions For example in Shortest Paths it ranges from 2000
about parallelism settings. to -4000 while in kMeans it goes from 100 to -100.
Finally, we include in Figure 19 the time that Indeed, as we saw in the experiments of the motiva-
it takes to find an optimal configuration for each tion section, the benefit of parallelism is more ob-
application. For the boosted decision trees ap- vious on graph applications than in kMeans. This
proach, the average overhead is 0.3 seconds and it feature of a decision tree proves to be useful and
increases slightly for those applications that have can help the user to understand how different pa-
more stages, like logistic regression. We consider rameters affect their workloads.
this negligible, since a 0.3 seconds delay will not be
noticeable by the user. The kNeighbours approach 6. Related Work
is faster, but it is only applied to those applications
that crash with the default execution of 1g/1core. There are several areas of research that are re-
lated to our work:
5.4. Interpretability of the model Using historical data to predict the out-
Another advantage of decision trees is that they come of an application: Several lines of research
are interpretable. That means that we can evalu- try to optimize different aspects of applications. In
ate the impact of certain features on the outcome [20], the authors present the Active Harmony tun-
variable (duration in our case). One of the possi- ing system. It was born from the necessity to auto-
ble applications is to explain how the number of matically tune scientific workflows and e-commerce
tasks per node affects a given workload. This in- applications. The challenge lies on choosing an op-
formation can be valuable when the user wants to timal parametrization from a large search space. To
18
Non-graph executions
Best
Predicted
200
Time in seconds
Default
Worst
100
B
B
B
B
G
G
G
G
18
10
18
10
18
10
18
10
p
r
A
eg
eg
re
re
PC
PC
SV
SV
gR
gR
G
Lo
Lo
Figure 16: Executions for the non-graph processing applications of the test set. Best shows the lowest possible latency that
we could get for that application. Predicted is the configuration proposed by our model. Default is the runtime for a 1g and 1
task slot per executor configuration. Worst is the highest run time amongst the executions that finished successfully
Graph executions
Evolution of Shortest Path
Best Predicted Worst
8GB
Time in seconds
11GB Optimal
10,000
2,000
Error
0
1,000
B
B
B
8G
G
8G
11
11
8GB Optimal
nt
h
nt
th
at
ne
ne
tP
Pa
d
po
ur
te
te
te
po
or
rt
o
om
os
os
os
hb
Sh
om
Sh
bo
bo
bo
C
ig
C
on
ne
d
on
1s
2n
3r
C
Figure 17: Executions for the graph processing applications Figure 18: Evolution for the predictions of a recurrently
of the test set. Best shows the lowest possible latency that executed shortest path application. From the conservative
we could get for that application. Predicted is the kNeighbours approach, we go to nearly optimal execution
configuration proposed by our model. Worst is the worst times. Note how the first prediction of the boosted model
running time. All the default configurations crashed for results in an OOM error
these kinds of applications
19
Time to calculate recommended configuration we can speed up applications or any configuration
0.500 parameters. Multilayer neural networks were used
BoostedTrees in [23] to predict execution times on the parallel ap-
kNeighbours plication SMG2000. It explains a typical machine
0.400
learning process where data from different execu-
Duration in seconds
Partial Dependence
Partial Dependence
2,000 100
500
0 0 0
−2,000
−100 −500
−4,000
0 10 20 10 20 0 10 20
Tasks per Node Tasks per Node Tasks per Node
Figure 20: Partial dependence for different applications. Shortest Paths works better with a low number of tasks per node.
kMeans best setting is 12 tasks per node. For PCA the points between 0 and 5 corresponds to stages with only one or few
tasks, while with many tasks 7 is a good configuration
hill climbing algorithm. The tuning is made at task acquired from these executions, they build a block
level. This system is oriented to MapReduce and diagram through a trial and error approach of dif-
needs tasks test runs to converge into the optimal ferent parameters. The obvious drawback is that
solution. Cheng et al. [29] use genetic algorithms we have to execute the application several times and
to configure MapReduce tasks in each node, taking follow the process every time the input size changes.
into account cluster heterogeneity. The parameters Tuning garbage collection has also been considered
are changed, depending on how well a parameter in [33]. The authors analyse in depth the logs of
value works in a machine. However, it does not JVM usage and come to the conclusion that G1GC
work well for short jobs, since it needs time to con- implementation improves execution times. Finally,
verge into a good solution. in [34], a new model for shuffling data across nodes
is proposed. The old model of creating a file in each
Optimizing and predicting performance of map tasks for each reduce task was too expensive
Spark: Spark is a relatively new technology and in terms of IO. Their solution is to make the map
its popularity is on the rise. Tuning up its perfor- tasks running in the same core write to the same
mance is an important concern of the community set of files for each reduce task.
and yet there is not much related work. In [30],
the authors present, to the best of our knowledge, Task contention: One of the objectives of our
the only Apache Spark prediction model. Again model is to detect task contention and act accord-
sampling the application with a smaller data size ingly, by increasing or decreasing the number of
is used to get statistics about the duration of the tasks. This topic has also been explored in [35].
tasks and plugged into a formula that gives an esti- Bubble up proposes a methodology to evaluate the
mation of the total run time for a different file size. pressure that a given workload generates in the sys-
However, it does not considers tuning any parame- tem and the pressure that this same workload can
ters and their effect on duration. MEMTune [31] is tolerate. With this characterization, better colo-
able to determine the memory of Spark’s executors cation of tasks can be achieved. Nonetheless, this
by changing dynamically the size of both the JVM requires using the stress benchmark in all the new
and the RDD cache. It also prefetchs data that is applications that we launch. Paragon [36] takes
going to be used in future stages and evicts data into account both heterogeneity of machines and
blocks that are not going to be used. However, the contention between tasks when scheduling. By us-
method is an iterative process that takes some time ing SVD, it can calculate performance scores of a
to achieve the best result and it does not work well task executed in a given machine configuration and
with small jobs. It does not learn from previous co-located with another workload. Same as in pre-
executions either. In [32], the authors perform sev- vious references, it needs benchmarking each time
eral runs of a benchmark and iterate through the a different workload comes in. Moreover, we are
different parameters of Spark to determine which not trying to create a scheduler, but to assist the
ones are of most importance. With the experience user in using big data platforms. In [37], the au-
21
thors propose another scheduler that is based on root cause analysis mechanisms to understand the
the assumption that resource demands of the differ- health of big data systems and to optimize its per-
ent stages can be known in advance. By consider- formance.
ing the maximum amount of resources an executor
will use, Prophet can schedule JVM’s on different Acknowledgment
machines to avoid over-allocation or fragmentation
of resources. This results in a better utilization of This work is part of the BigStorage project, sup-
the cluster and less interference between applica- ported by the European Commission under the
tions. This work assumes again that the user al- Marie Sklodowska-Curie Actions (H2020-MSCA-
ready knows the parallelism settings for the appli- ITN-2014-642963). Experiments presented in this
cation and tries to optimize from there. Our work paper were carried out using the Grid’5000 testbed,
helps the user to choose a right configuration before supported by a scientific interest group hosted by
sending it to the cluster manager and also gives an Inria and including CNRS, RENATER and sev-
interpretation of this decision. eral Universities as well as other organizations (see
https://www.grid5000.fr).
7. Conclusions and Future Work
References
The results of this paper show that it is possible
to accurately predict the execution time of a big [1] A. Nadkarni, D. Vesset, Worldwide Big Data Tech-
nology and Services Forecast, 2016–2020. International
data-based application with different file sizes and Data Corporation (IDC).
parallelism settings using the right models. This is [2] J. Dean, S. Ghemawat, Mapreduce: simplified data pro-
a first step in a longer path towards a much higher cessing on large clusters, Communications of the ACM
control of big data analytics processes. It is nec- 51 (1) (2008) 107–113.
[3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker,
essary to reduce complexity and allow centralized I. Stoica, Spark: cluster computing with working sets.,
management of environments with multiple tech- HotCloud 10 (2010) 10–10.
nologies. To achieve this, the next generation of [4] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Frey-
big data infrastructure solutions need to be vendor- tag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser,
V. Markl, et al., The stratosphere platform for big data
agnostic and user friendly. In addition, processes, analytics, The VLDB Journal 23 (6) (2014) 939–964.
monitoring and incident response must be increas- [5] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agar-
ingly automated to boost speed and eliminate hu- wal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah,
S. Seth, et al., Apache hadoop yarn: Yet another re-
man error, thereby increasing system uptime. source negotiator, in: Proceedings of the 4th annual
Regarding our research plans, we aim at apply- Symposium on Cloud Computing, ACM, 2013, p. 5.
ing these concepts to an heterogeneous environment [6] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi,
with different machine specifications. This could A. D. Joseph, R. H. Katz, S. Shenker, I. Stoica, Mesos:
A platform for fine-grained resource sharing in the data
be done by modeling at task-level, instead of stage- center., in: NSDI, Vol. 11, 2011, pp. 22–22.
level. We also want to consider choosing these par- [7] D. Balouek, A. Carpen Amarie, G. Charrier, F. De-
allelism settings when there are already other ap- sprez, E. Jeannot, E. Jeanvoine, A. Lèbre, D. Margery,
plications running in the cluster and modeling their N. Niclausse, L. Nussbaum, O. Richard, C. Pérez,
F. Quesnel, C. Rohr, L. Sarzyniec, Adding virtualiza-
interference. We also plan to focus on mitigating tion capabilities to the Grid’5000 testbed, in: I. Ivanov,
issues related to data skew, where some executors M. Sinderen, F. Leymann, T. Shan (Eds.), Cloud Com-
receive more shuffled data than others, causing out- puting and Services Science, Vol. 367 of Communica-
of-memory errors if the JVM is not sized correctly. tions in Computer and Information Science, Springer
International Publishing, 2013, pp. 3–20. doi:10.1007/
As the IT arena expands and large workloads execu- 978-3-319-04519-1\ 1.
tion pave the way for it, it is important to develop [8] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski,
concrete methods to reduce the resource wastage S. Shenker, I. Stoica, Dominant resource fairness: Fair
and optimize utilization at every stage of execution. allocation of multiple resource types., in: NSDI, Vol. 11,
2011, pp. 24–24.
All in all, both industry and academy need to [9] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
continue collaborating to create solutions that make M. McCauley, M. J. Franklin, S. Shenker, I. Stoica, Re-
it possible to manage the lifecycle of applications, silient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing, in: Proceedings of the
data and services. In particular, new solutions 9th USENIX conference on Networked Systems Design
will be necessary that simplify management of big and Implementation, USENIX Association, 2012, pp.
data technologies and generate reliable and efficient 2–2.
22
[10] Dynamic allocation in spark, http: [25] A. D. Popescu, V. Ercegovac, A. Balmin, M. Branco,
//spark.apache.org/docs/latest/job- A. Ailamaki, Same queries, different data: Can we pre-
scheduling.html/. dict runtime performance?, Proceedings - 2012 IEEE
[11] Electrical consumption on grid’5000 lyon site, https:// 28th International Conference on Data Engineering
intranet.grid5000.fr/supervision/lyon/wattmetre/. Workshops, ICDEW 2012 (2012) 275–280doi:10.1109/
[12] Managing cpu resources, https://hadoop.apache.org/ ICDEW.2012.66.
docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/ [26] J. L. Berral, N. Poggi, D. Carrera, A. Call, R. Rein-
CapacityScheduler.html. auer, D. Green, ALOJA-ML : A Framework for Au-
[13] J. Montes, A. Sánchez, B. Memishi, M. S. Pérez, G. An- tomating Characterization and Knowledge Discovery in
toniu, Gmone: A complete approach to cloud monitor- Hadoop Deployments, Proceedings of the 21th ACM
ing, Future Generation Computer Systems 29 (8) (2013) SIGKDD International Conference on Knowledge Dis-
2026–2040. covery and Data Mining (2015) 1701–1710doi:10.1145/
[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, 2783258.2788600.
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, [27] P. Lama, X. Zhou, Aroma: Automated resource alloca-
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, tion and configuration of mapreduce environment in the
D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, cloud, in: Proceedings of the 9th international confer-
Scikit-learn: Machine learning in Python, Journal of ence on Autonomic computing, ACM, 2012, pp. 63–72.
Machine Learning Research 12 (2011) 2825–2830. [28] M. Li, L. Zeng, S. Meng, J. Tan, L. Zhang, A. R. Butt,
[15] R. Kohavi, et al., A study of cross-validation and boot- N. Fuller, Mronline: Mapreduce online performance
strap for accuracy estimation and model selection, in: tuning, in: Proceedings of the 23rd international sym-
Ijcai, Vol. 14, 1995, pp. 1137–1145. posium on High-performance parallel and distributed
[16] C. J. Willmott, K. Matsuura, Advantages of the mean computing, ACM, 2014, pp. 165–176.
absolute error (mae) over the root mean square error [29] D. Cheng, J. Rao, Y. Guo, X. Zhou, Improving mapre-
(rmse) in assessing average model performance, Climate duce performance in heterogeneous environments with
research 30 (1) (2005) 79–82. adaptive task tuning, in: Proceedings of the 15th Inter-
[17] N. Bruno, S. Agarwal, S. Kandula, B. Shi, M.-C. Wu, national Middleware Conference, ACM, 2014, pp. 97–
J. Zhou, Recurring job optimization in scope, Proceed- 108.
ings of the 2012 international conference on Manage- [30] K. Wang, M. M. H. Khan, Performance prediction for
ment of Data - SIGMOD ’12 (2012) 805doi:10.1145/ apache spark platform, in: High Performance Comput-
2213836.2213959. ing and Communications (HPCC), 2015 IEEE 7th In-
[18] M. Li, J. Tan, Y. Wang, L. Zhang, V. Salapura, Spark- ternational Symposium on Cyberspace Safety and Secu-
bench: a comprehensive benchmarking suite for in rity (CSS), 2015 IEEE 12th International Conferen on
memory data analytic platform spark, in: Proceedings Embedded Software and Systems (ICESS), 2015 IEEE
of the 12th ACM International Conference on Comput- 17th International Conference on, IEEE, 2015, pp. 166–
ing Frontiers, ACM, 2015, p. 53. 173.
[19] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, [31] L. Xu, M. Li, L. Zhang, A. R. Butt, Y. Wang,
W. Gao, Z. Jia, Y. Shi, S. Zhang, et al., Bigdatabench: Z. Z. Hu, MEMTUNE : Dynamic Memory Manage-
A big data benchmark suite from internet services, in: ment for In-memory Data Analytic Platforms, Proceed-
2014 IEEE 20th International Symposium on High Per- ings - 2016 International Parallel and Distributed Pro-
formance Computer Architecture (HPCA), IEEE, 2014, cessing Symposium (2016) 1161—-1170doi:10.1109/
pp. 488–499. IPDPS.2016.105.
[20] I. H. Chung, J. K. Hollingsworth, Using Information [32] P. Petridis, A. Gounaris, J. Torres, Spark parameter
from Prior Runs to Improve Automated Tuning Sys- tuning via trial-and-error, in: INNS Conference on Big
tems, Proceedings of the ACM/IEEE SC 2004 Con- Data, Springer, 2016, pp. 226–237.
ference: Bridging Communities 00 (c). doi:10.1109/ [33] J. Huang, Tuning Java Garbage Collection for Spark
SC.2004.65. Applications Introduction to Spark and Garbage Col-
[21] A. Jayakumar, P. Murali, S. Vadhiyar, Matching Ap- lection How Java ’ s Garbage Collectors Work, https:
plication Signatures for Performance Predictions Using //databricks.com/blog/2015/05/28/tuning-java-
a Single Execution, Proceedings - 2015 IEEE 29th In- garbage-collection-for-spark-applications.html
ternational Parallel and Distributed Processing Sym- (2015).
posium, IPDPS 2015 (2015) 1161–1170doi:10.1109/ [34] A. Davidson, A. Or, Optimizing Shuffle Performance in
IPDPS.2015.20. Spark, University of California, Berkeley-Department
[22] T. Miu, P. Missier, Predicting the execution time of of Electrical Engineering and Computer Sciences, Tech.
workflow activities based on their input features, Pro- Rep.
ceedings - 2012 SC Companion: High Performance [35] J. Mars, L. Tang, R. Hundt, K. Skadron, M. L.
Computing, Networking Storage and Analysis, SCC Soffa, Bubble-Up: Increasing Utilization in Modern
2012 (2012) 64–72doi:10.1109/SC.Companion.2012.21. Warehouse Scale Computers via Sensible Co-locations,
[23] E. Ipek, B. R. De Supinski, M. Schulz, S. A. McKee, An Proceedings of the 44th Annual IEEE/ACM Interna-
approach to performance prediction for parallel applica- tional Symposium on Microarchitecture - MICRO-44
tions, in: European Conference on Parallel Processing, ’11 (2011) 248doi:10.1145/2155620.2155650.
Springer, 2005, pp. 196–205. URL http://dl.acm.org/citation.cfm?doid=
[24] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, 2155620.2155650
F. B. Cetin, S. Babu, Starfish: A self-tuning system [36] C. Delimitrou, C. Kozyrakis, Paragon: QoS-aware
for big data analytics., in: CIDR, Vol. 11, 2011, pp. Scheduling for Heterogeneous Datacenters, Proceedings
261–272. of the eighteenth international conference on Architec-
23
tural support for programming languages and operat-
ing systems - ASPLOS ’13 (2013) 77–88doi:10.1145/
2451116.2451125.
[37] G. Xu, C.-z. Xu, S. Jiang, Prophet : Scheduling
Executors with Time-varying Resource Demands on
Data-Parallel Computation Frameworks, in: Auto-
nomic Computing (ICAC), 2016, 2016, pp. 45,54.
24
Table A.3: In this table we summarize the different features we used as input for the machine learning models. The monitor
system column refers to the system that gathered the metric, namely Gmone, Slim or None, the latter meaning that this
metric was derived from other ones.
Appendix A. Table with the variables and metrics used for machine learning
25
Table B.4: Here we provide a notation table for the different terms we have used through the paper. We also include the
section in which the term was first mentioned
26