You are on page 1of 6

Stateful MapReduce

Ahmed Elgohary
Electrical and Computer Engineering Department
University of Waterloo
200 University Avenue West, Waterloo, ON, Canada

Hadoop is considered the cornerstone of todays cloud analytics. Much work is being carried out towards developing and enhancing its capabilities. However, an opposite
research direction has started to emerge. In that direction,
researchers are arguing that Hadoop is not suitable for some
applications so, new frameworks need to be developed. Examples on such applications are graph analytics, online incremental processing, and iterative algorithms.
In this paper it is envisioned that by adding and maintaining
states across multiple Hadoop jobs, a wide range of applications will perfectly fit into Hadoop eliminating the need
to develop new frameworks. A Stateful MapReduce API
in addition to efficient design and implementation to extend
Hadoop are presented. Our experimental evaluation demonstrates the effectiveness of the proposed extensions.

map function is invoked for each key-value pair of the input

dataset. The output of the map function is also a set of
key-value pairs. In the reduce phase, the resulting key-value
pairs of the map phase are grouped based on the key and for
each group (key and set of values) the reduce function is invoked. Reduce output is also a set of key-value pairs which
combination represents the job output. Map and Reduce
functions can be denoted as:
map(KeyIn, ValIn):<KeyOut, ValOut>
reduce(KeyIn, <List>ValsIn):<KeyOut, ValOut>

Hadoop[7] is the most commonly used MapReduce implementation. Much work is being carried out to develop and
improve its capabilities. For example, the authors in [2] presented policies for grouping and scheduling multiple MapReduce jobs in order to improve the overall system throughCategories and Subject Descriptors
put. In [12] opportunities for sharing portions of the work
D.3 [Programming Techniques]: Concurrent Programcarried out by multiple MapReduce jobs were identified. An
ming, Distributed programming
analytical model for grouping jobs together was accordingly
developed. Another interesting direction towards Hadoop
General Terms
development was presented in [8]. In that work, the authors
considered the problem of automatically tuning Hadoop parameters based on the expected behaviour of the submitted
jobs. Also, [1] presented a hybrid model to combine Hadoop
Distributed computing, Cloud Computing, MapReduce, Hadoop with relational databases in order to enhance the systems



MapReduce[6] was presented as a framework to facilitate

large scale data analytics over a distributed environment of
commodity machines. Basically, two functions (map and reduce) need to be implemented by users and the underling
framework takes care of partitioning and executing the job
over the available machines in the cluster. The framework
also takes the responsibility of handling machine failures
and dealing with the heterogeneity of their specifications.
The processed dataset is modelled as key-value pairs. The

Recently, researcher started to argue that Hadoop(or MapReduce framework in general) is not suitable for some applications so, new frameworks need to be developed to suit these
applications. For examples, the authors in [11] stated that
graph algorithms does not fit into MapReduce. So, they
built a totally new framework (Pregel) designed specifically
for graph processing. In [10] a new architecture for stateful
bulk processing for dataflow programs was presented. In [5]
the authors were concerned about using Hadoop for online
analytics due to the latency introduced by materializing the
intermediate date. Hence, they proposed a modified MapReduce architecture that allows data to be pipelined between
operators. For iterative processing using Hadoop, the authors in [4] developed Haloop in which loop-invariant data
are cached locally at the worker machines.
It can be noticed from the paragraph above that eventually
we will end up with several frameworks (totally new frameworks or different variations of Hadoop). In this paper, we

envision that by only adding and maintaining states across

multiple job submissions, all the above applications will perfectly fit into Hadoop. We presented Stateful MapReduce
in which user accessible states are maintained internally by
the system. We show that these states can be exploited to
ease the programming of many applications into MapReduce
and also to enhance the performance of those applications.
Efficient Hadoop extensions are proposed so that maintaining the states gets achieved with the minimum additional
The reset of this paper is organized as follows. In Section 2,
the Stateful MapReduce API is described followed by Section 3 in which examples on applications that can utilize the
states are presented. Section 4 gives the details of our design
and implementation to extend Hadoop to support the Stateful API. In Section 5, our evaluation process is described followed by the obtained results. In Section 6, we discuss the
achieved results and presented our next directions. Finally,
the paper is concluded in Section 7.



We modified MapReduce API to provide users with an access to the states. Users can store/retrieve key-value pairs
to/from the state of each task (Mapper or Reducer). The
Stateful Mapper/Reducer are defined as:
map(keyIn, valIn, State): <KeyOut, ValOut>
reduce(keyIn, <List> ValsIn, State): <Keyout, ValOUt>
Users can access State as follows:
int count = state.get("count")
state.set("count", count)
Users also need to specify which tasks should be stateful and
which tasks should be stateless. The API is flexible enough
so users can combine stateful and stateless tasks in the same

StatefulJob job = new StatefulJob ()

job.setStatefulMapper(StatefulMapper, IsReadOnly)
job.setStatefulReducer(StatefulReducer, IsReadOnly)
Initialization functions can be provided to the system so the
system sets the initial state of each task. State initialization
is up to the user preferences. For examples, initial states
might be loaded from HBase or an HDFS file, or set programmatically.



In this sections examples on applications for which Stateful

MapReduce can be exploited.


Online/Incremental Analytics: Sessionization

In online/incremental analytics, users have datasources that

continuously produce new data and the task usually is to
measure some statistics or obtain certain results using the
data that the system has received so far. The traditional
way for carrying out this task using Hadoop is to collect
the data received so far and resubmit it over and over again
into MapReduce Jobs. This approach results in significant
amount of redundant communication and computation costs.
Even the work done in [5] was concerned with pipelining the
intermediate data across tasks which does tackle the problem of repeating the communications and computations of
these intermediate data.
Using Stateful MapReduce, users can use the states to store
the computed results out of the received data so far. When
new data arrives at the system, only the new data is submitted as the Job input to be used to update the stored
states. This way, the system will only process the newly
arriving data which eliminates all the redundant communications and computations.
As an example, we consider here the sessionization problem
in which user clicks are analyzed to determine how many
urls each user has clicked so far. Here a stateless mapper is
used to extract user ID from each input record. A stateful
reducer is used. In the state of each reduce task, the total
number of the urls each user has clicked (denoted by count
in the pseudo code below). As the reduce function is invoked
for each user ID (reduce KeyIn), the total count is retrieved,
updated and stored back again. The pseudo code of the map
and reduce functions is provided below:
map(logRecordId, logRecord)
UserId = extractUserID(logRecord)
EmitIntermediate (userId, 1)

reduce(UserId, <List> Clicks, State)

count = state.get("count")
count = count + sum(clicks)
state.set("count", count)
Emit(UserId, count)


Iterative Algorithms: PageRank

Iterative Algorithms can be executed on top of Hadoop by

considering each job submission a single iteration. To do
multiple iterations, the job gets resubmitted to the system
after changing the appropriate settings. It was noticed in
[4] that the performance of the iterative algorithms over
Hadoop can be improved by caching loop-invariant data locally at the worker nodes which eliminates the need to redundantly include this data in the communication between
the nodes of system. Using Stateful MapReduce the states
can be used to store loop-invariant data and the underlying state management components takes the responsibility
of maintaining these states over the different iterations (job
As an example on iterative algorithm, we considered PageRank. Using Hadoop or Haloop[4], each PageRank iteration

is carried out using two MapReduce Jobs. Using stateful

MapReduce, each iteration can be expressed using a single
job. Here we will use a read-only stateful mapper and a
stateful reducer. The state of each mapper includes the list
of the hyperlinks of each page and the state of each reducer
includes the current rank of each page. The input to the map
function is the url and its current rank. The map function,
outputs each of the hyperlinks along with the addition portion of the url rank it will receive. At the reduce phase, the
portions each hyperlink has received in addition to its current rank are aggregated. The following pseudo code shows
both the map and reduce function.
map(url, currentRank, State)
Iterator hyperlinks = State.get("hyperlinks")
rankPortion = currentRank/size(hyperlinks)
EmitIntermediate(, rankPortion)

reduce(hyperlink, <List> rankPortions, State)

currentRank = state.get("current-rank")
newRank = currentRank + sum(rankPortions)
state.set("current-rank", newRank)
Emit(hyperlink, newRank)


Graph Processing: Shortest Path

The authors of [11] argued that MapReduce is not suitable

for graph processing. Carrying out graph algorithms using
Hadoop results in a significant amount of redundant communications is wasted due to resubmitting the graph structure
as an input at each new submission. However, this problem
can be solved by storing the graph structure as states which
eliminates the redundant communications. The basic idea
that Pregel relies on is message passing between the nodes
of the graph. It was mentioned that graph algorithms can
be expressed easily in this way.
Using the Stateful MapReduce, a read only stateful mapper
and a stateful reducer are used. Mapper state include the
outgoing edges and their weights of each node. In the map
phase, each node can send a message to its neighbors by
simply outputting the neighbour ID along with the message.
MapReduce framework will collect the messages sent to each
node and pass them to that node in the reduce phase.
For instance, we considered the single source shortest path
problem. Mapper input the the current distance to each
node. The mapper sends a message to the neighbors of each
node node indicating a possible new distance to that neighbor. In the reduce phase, the message with the minimum
distance is obtained and the nodes minimum distance is
updated. The current distance to each node is stored as a
reduce state. The following pseudo code shows the implementation of map and reduce functions.

map(nodeID, minDistance, State)

Iterator outEdges = state.get("outgoing-edges")
edge =

Jobs Queue

Job Tracker
New Job

Init Job
Create Tasks (set Backup state) Scheduler
Add Job to Queue
Schedule Tasks on Try to Schedule on the

BackupStates Table


previous TaskTracker

Key: <jobName, M/R, IdWithinJob>

Value: <off, len, previous TaskTracker>





Execution JVM
Retrieve State from HDFS if
Invoke Stateful API
Write new state to HDFS
Return new State & Backup


Task Tracker
Retrieve previous state from table
fter task execution
Update state after

States Table

Figure 1: The Modified Hadoop Architecture to

Support Stateful MapReduce


reduce(nodeID, <List> distances, State)

currentDistance = state.get("current-distance")
newMinDistance = min(distances, currentDistance)
state.set("current-distance", newMinDistance)
Emit(nodeID, newMinDistance)



In this section, the proposed design and implementation details of extending Hadoop to support the Stateful MapReduce API described in Section 2 are given. In order for the
Stateful MapReduce to be acceptable, maintaining states
should be achieved with the minimum additional overhead.
Also, the new API should not affect the scalability and the
fault tolerance of Hadoop.
In the basic architecture of Hadoop a JobTracker process
runs on the master node and a TaskTracker process runs on
each slave node. When a job is submitted to the system, the
JobTracker initializes the job, creates the map and reduce
tasks, and then adds the job to the execution queue. TaskTrackers communicate with the JobTracker in a heartbeat
communication. When the JobTracker receives a heartbeat
from a TaskTracker indicating that this TaskTracker can accept new tasks, Task scheduler picks the suitable tasks from
the jobs queue and assigns them to that TaskTracker. The
task scheduler tries to assign map tasks on the same machines where their inputs exist. TaskTracker creates a new
execution JVM for each task.
The proposed extensions are based on: 1) States are maintained locally at Task Trackers, 2) A Persistent copy of each
state is written to HDFS and 3) At the end of each task, the
JobTracker is informed with the location of the persistent
state of each task.
Figure 1 shows our modifications to the overall system architecture. BackUpState table is maintained by the JobTracker to store the location of the persistent state of each

At each TaskTracker an in-memory table is used to store the

states of the previously run tasks on that machine. When a
new task gets scheduled on a certain node, the TaskTracker
on that node tries to retrieve the state of this task from its
States table, otherwise the state will be retrieved from the
persistent storage by the execution JVM. At the end of task
execution, a persistent copy of the task is written to the persistent storage and the states table of the TaskTracker is also
updated with the new state. In the current implementation
HDFS is used as the persistent storage. However, any other
persistent storage like HBase can be used. The Backup state
table of the JobTracker is updated with the location of the
persistent copy of each state.
Users can indicate that task state is ready only which means
that the state is read only using the initialization function
provided by the user. For example, the mapper states of the
PageRank and the Shortest Path examples described above
in 3.2 and 3.3. In this case, the system does not store a persistent copy. Instead, the system invokes the initialization
function to recreate the task state when needed.
One important design consideration was maximizing the optimization opportunities to make it easy to integrate the
stateful MapReduce with Hadoop improvements that are
being developed. In the presented design, the system treats
each resubmission of each job as a new job allowing the system to carry out all the possible optimization depending on
the current systems status. It also worth noting that in our
design each task does at most one persistent storage access
to retrieve/store all the states of the map or reduce keys
processed in it which reduce the incurred latency resulting
from maintaining states.



In this section, the experimental evaluation of the stateful

MapReduce is described followed by the achieved results.
The sessionization examples given in section 3.1 is considered as evaluation task. We used the WorldCup98 dataset
[3]which consists of the requests made to the 1998 World
Cup Web site between April 30, 1998 and July 26, 1998.
Each log entry contains request timestamp, client ID and
Object ID. The goal of our task was to count the number of
the objects requested by each user as indicated in the logs
received so far. The total size of the dataset was 13.3GB.
To simulate the online incremental processing, the dataset
was partitioned to 10 equal size portions and the system is
notified by the arrival of each portion separately. We com-

Comparing Running Time of Sessionization Task using Stateless and Stateful MapReduce

Stateless MapReduce
Stateful MapReduce


Running Time (mins)

task along with the previous TaskTracker on which the task

was run (which is the TaskTracker that stores the state locally). When receiving a new submission of a stateful job,
the JobTracker adds the backup information to each initialized task. Running a task on the same TaskTracker it
was run in the previous submission becomes an additional
scheduling criteria of the Task Scheduler. For a reduce task,
the scheduler tries first to find a task which state is stored
locally (in memory) on the TaskTracker. If it fails, it tries
to find a task which persistent state is stored at the TaskTracker. If it fails, it schedules any available task of the
TaskTracker and only in this case, the state is retrieved from
a remote machine. Improving the scheduling of stateful map
tasks is considered as a future work.







4(5.32) 5(6.65) 6(7.98) 7(9.31)

Run Number (Data Size GB)

8(10.64) 9(11.97) 10(13.3)

Figure 2: Comparing the Running Time of Stateful

and Stateless Sessionization Task

pared the running time of two MapReduce jobs: 1) Stateless MapReduce job and 2) Stateful MapReduce job. In the
stateless job, the the system combines all the logs received
after each notification and resubmit all of them as a new
MapReduce job. In the stateful job, stateful reducers are
used to maintain the total number of objects requested by
each user so far and only the newly arriving logs are submitted as the job input. The running time of processing
each notification is recorded in addition to the latency overhead introduced by maintaining the states in the stateful
The used evaluation infrastructure consisted of a cluster of
10 slave Amazon EC2 small instances in addition to 1 master
small instance. Each instance had 1.7GB memory, 1 EC2
Compute Unit 1 virtual core with 1 EC2 Compute Unit)
and 160GB local storage. All the machines were running
fedora-core linux, java 1.6.0 07, and Hadoop 0.203.0. We
created a new customized Amazon Machine Image (AMI)
on which the Stateful MapReduce implementation inside
Hadoop 0.203.0 was deployed and recreated a similar cluster to run the stateful jobs. All the default Hadoop configurations were not changed expect for the number of the
reducers. We used 25 reducers for both experiments.
Figure 2 shows the running time of jobs launched after notification. Using stateless MapReduce, the running time keeps
increasing as more data is received which indicates the low
performance of the stateless MapReduce when used in such
applications especially when the much data needs to be processed. On the other hand, stateful MapReduce achieves
almost constant running time as more data arrives to the
system since it avoids all the redundant communications
(resubmitting all the previously received logs after each new
notification) and computations (recounting the number of
the objects requested by each user).
To provide an estimate for the incurred overhead resulting
from maintaining states, the latency of writing each task
state to the persistent storage (HDFS) in addition to the
size of the state. Figure 3 shows the average latency and

Online MapReduce[5] can be considered a complementary work to the stateful MapReduce since online MapReduce is concerned with avoiding the latency of materializing the intermediate data while our work is concerned with avoid the latency of repeating computations and data transfers.

Overhead of Storing a Persistent Copy of each Task State



Average Latency (Sec)

Average State Size (KB)





Average State Size (KB)

Average Latency (Sec)


Building a system that is aware of the states introduces a lot of optimizations opportunities. For example as described in section 4 the scheduler at the JobTracker utilizes the information about the TaskTracker
on which a task was previously run to make a better
scheduling design to avoid retrieving states from the
persistent storage.


Run Number


Figure 3: The Average Incurred Latencies Caused

By Maintaining Persistent Copy of Each Task State
state size of the tasks of each new job submission. As more
logs are processed by the system, the more the state size becomes since more userIDs are encountered in the logs. However, the latency of writing the state almost does not get
affected as the size of the state increases. The figure also
shows that the latency lies in the range [3.7 to 5.7] seconds
which can be considered an accepted cost compared to the
significant savings gained by using the stateful API. However, the latency of using other persistent storage needs to be
compared to writing directly to HDFS which is considered
one of the possible future investigations.


A second set of experiments in which we investigate the performance of stateful MapReduce using other types of jobs
are currently in progress. In these experiments we consider
the PageRank an the Single Source Shortest Path problems
described in 3.2 and 3.3 respectively. We prepared a semisynthetic large graph dataset. The LiveJournal [9] graph
that consists of 4847571 nodes and 68993773 edges is used.
Weights for the edges were generated randomly from the
range [0,1]. To enlarge the size of the dataset, a long string
was appended to each node Id. We managed to make the
graph size around 12GB. We plan to compare the running
time of stateless and stateful versions of the two jobs.
There are other three possible directions to investigate towards the development of the stateful MapReduce:

1. We need to assess the latency incurred when using

other persistent storage like HBase for example.


This paper started with defining the stateful MapReduce

and its different applications and benefits, then we moved
to its design and implementation details inside the most
commonly used MapReduce implementation Hadoop. Afterwards, our experiments to assess the performance of the
stateful MapReduce were provided.

2. In our current implementation we optimized the scheduling of reduce tasks. However, it is more challenging to
consider the states when scheduling map tasks. Map
task scheduling is based on avoiding loading map input from a remote machine so, loading task state also
should be considered when deciding on the machine on
which each map task should be run.

We believe that what makes the idea of stateful MapReduce

interestingly different can be summarized in the following
Unlike Pregel[11] and CBP[10], we did not have to rebuild a totally new framework to support a certain application. As shown in section 4, stateful MapReduce
can easily be integrated into Hadoop which makes it
more attractive to users than relying on new frameworks. Moreover, stateful MapReduce can benefit from
all the improvements that are being proposed to be
added to Hadoop. Additionally, relying on a single
framework (Hadoop) to carry out all the cloud analytics tasks is easier for cluster administrators and operators.
Stateful MapReduce is more general than Haloop[4]
which application is limited to iterative algorithms.
Proving users with a full control to the states makes
stateful MapReduce suits a wider range of application
than just the iterative ones.

3. Currently, the local copy of each state is maintained in

the memory of the slave machines. Users directly access and update that in-memory copy. However, larger
states might not fit into a commodity machiness memory. So, a local on-disk database engine might be need
to allow for maintaining task states while the task is
being executed.



In this project, a simple modification to MapReduce API

was envisioned to be beneficial in many ways. Stateful
MapReduce widens the range of the applications that can
easily be written using MapReduce eliminating the need
to build specific APIs for those applications. It also saves
a significant amount of computations and communication
which improve the performance of several MapReduce applications. Efficient design to extend Hadoop was provided
and evaluated. Evaluation results indicate the performance
gain that can be achieved using stateful MapReduce.



[1] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi,

A. Silberschatz, and A. Rasin. Hadoopdb: an
architectural hybrid of mapreduce and dbms
technologies for analytical workloads. Proc. VLDB
Endow., 2(1):922933, aug 2009.
[2] P. Agrawal, D. Kifer, and C. Olston. Scheduling
shared scans of large data files. Proc. VLDB Endow.,
1(1):958969, Aug. 2008.
[3] M. Arlitt and T. Jin. 1998 world cup web site access
logs., August
[4] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst.
Haloop: efficient iterative data processing on large
clusters. Proc. VLDB Endow., 3(1-2):285296, Sept.
[5] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein,
K. Elmeleegy, and R. Sears. Mapreduce online.
Technical Report UCB/EECS-2009-136, EECS
Department, University of California, Berkeley, Oct
[6] J. Dean and S. Ghemawat. Mapreduce: simplified
data processing on large clusters. Commun. ACM,
51(1):107113, Jan. 2008.
[7] Apache hadoop.
[8] H. Herodotou and S. Babu. Profiling, what-if analysis,
and cost-based optimization of mapreduce programs.
PVLDB, pages 11111122, 2011.
[9] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W.
Mahoney. Community structure in large networks:
Natural cluster sizes and the absence of large
well-defined clusters. CoRR, abs/0810.1355, 2008.
[10] D. Logothetis, C. Olston, B. Reed, K. C. Webb, and
K. Yocum. Stateful bulk processing for incremental
analytics. In Proceedings of the 1st ACM symposium
on Cloud computing, SoCC 10, pages 5162, New
York, NY, USA, 2010. ACM.
[11] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,
I. Horn, N. Leiser, and G. Czajkowski. Pregel: a
system for large-scale graph processing - abstract. In
Proceedings of the 28th ACM symposium on Principles
of distributed computing, PODC 09, pages 66, New
York, NY, USA, 2009. ACM.
[12] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and
N. Koudas. Mrshare: sharing across multiple queries
in mapreduce. Proc. VLDB Endow., 3(1-2):494505,
Sept. 2010.