Distributed Mapreduce Engine With Fault Tolerance: June 2014

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/269328650
Distributed MapReduce engine with fault tolerance
Conference Paper · June 2014

DOI: 10.1109/ICC.2014.6883884
CITATIONS READS
2 93
4 authors, including:
Lixing Song Shaoen wu

University of Notre Dame Ball State University
17 PUBLICATIONS 66 CITATIONS 48 PUBLICATIONS 323 CITATIONS
SEE PROFILE SEE PROFILE
Qing Yang
Montana State University
69 PUBLICATIONS 651 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Energy Harvesting View project
Internet of Things View project
All content following this page was uploaded by Qing Yang on 26 October 2015.
The user has requested enhancement of the downloaded file.

IEEE ICC 2014 - Selected Areas in Communications Symposium
Distributed MapReduce Engine with Fault Tolerance
Lixing Song, Shaoen Wu Honggang Wang Qing Yang

Dept. of Computer Science Dept. of Electrical and Computer Engineering Computer Science Department
Ball State University University of Massachusetts Dartmouth Montana State University
Muncie, IN Dartmouth, MA Bozeman, MT
{lsong, swu}@bsu.edu hwang1@umassd.edu qing.yang@cs.montana.edu
Abstract—Hadoop is the de facto engine that drives current node to another node. Though, there exists one major potential
cloud computing practice. Current Hadoop architecture suffers point of failure in the current Hadoop MapReduce architecture
from single point of failure problems: its job management lacks of that can result in significant performance degradation—the
fault tolerance. If a job management fails, even if its tasks remains master. If the master node fails, its managed user jobs are
still active on cloud nodes, this job loses all state information not be able to be completed because there is currently no
and has to restart from scratch. In this work, we propose a
redundancy for the master node.
distributed MapReduce engine for Hadoop with the Distributed
Hash Table (DHT) algorithm that drives the scalable peer-to- To address the single point of failure (SPOF) issue on the
peer networks today. The distributed Hadoop engine provides master node of the MapReduce engine, this paper proposes a
the fault-tolerance capability necessary to support efficient job fault-tolerant architecture for the master node in MapReduce
computation required in the cloud computing with numerous
jobs running at a moment. We have implemented the proposed
by employing the Distributed Hash Table algorithm [7] that
distributed solution into Hadoop and evaluated its performance is the core in the today’s widely used peer-to-peer networks
in job failures under various network deployments. e.g. BitTorrent [8], [9]. The distributed peer-to-peer networked
master nodes now allow for multiple masters to manage one
I. I NTRODUCTION MapReduce user job, or at least be aware that it exists.
Therefore, in the event that one master node goes down,
Cloud computing [1]–[4] is a model for enabling ubiq- there will be at least one more master node still available
uitous, convenient, on-demand network access to a shared and running for the same user job that can take over the
pool of configurable computing resources (e.g., networks, job management duty for the failed master node. How a
servers, storage, applications, and services) that can be rapidly master node determines or locates its peer master nodes for
provisioned and released with minimal management effort or the managed user jobs is enabled by the DHT algorithm.
service provider interaction. With the dramatic upsurging of
cloud computing applications and business in the past years, In the rest of the paper, Section II briefs the current Hadoop
Hadoop [5], [6], thanks to its open source implementation, MapReduce engine architecture, discusses SPOF issue of this
has become the de facto engine enabling the current cloud architecture and presents the motivations for this work. Then,
computing. For example, large businesses such as Yahoo!, the design and implementation of the proposed distributed
Facebook, Amazon and HP all adopt Hadoop as the foundation MapReduce solution is detailed in Section III. Next, Section IV
in their cloud computing service to billions of customers. presents our evaluation of the performance of the proposed
Hadoop was first developed by Yahoo! and then released as distributed architecture. The related work is reviewed in Sec-
open source to the public. It has been evolving from its first tion V. Finally, Section VI concludes this paper and discusses
version into today’s the second version released in 2012. our future work.
One way that cloud computing is able to scale up and down

II. BACKGROUND AND M OTIVATIONS
easily is through a programming model called MapReduce.
In Hadoop, the MapReduce model has a master node that When this work was performed, Hadoop MapReduce was
receives user jobs, then breaks them into a number of tasks still in its first version (MapReduce1). Before this work was
and delegates (maps) the tasks to other slave computing nodes. completed, Hadoop MapReduce had involved into its sec-
The master also assigns other nodes to merge the pieces of ond version (MapReduce2). However, the research problem
completed computations of the task from the mappers back addressed in this work based on MapReduce1 still exists
into integrated pieces (and eventually into the completed job). in MapReduce2. Therefore, in this section, we review both
These slave nodes are called reducers. MapReduce is highly versions since the implementation and evaluation were based
1) flexible in that it can work on large tasks that require on MapReduce1.
large amounts of mappers and reducers just as easily as it can
work on small tasks that only require very limited computing A. Hadoop MapReduce Engine Architecture
resources, 2) scalable in that it can process a small number of
user jobs as well as a huge number of user jobs by dispatching MapReduce1: Fig. 1 shows the framework architecture of
the actual computation onto thousands, even millions, of out- MapReduce1. There are two key components in the architec-
of-shelf computing nodes, 3) fault-tolerant for tasks (but NOT ture: master node (a.k.a jobtracker node) and slave nodes (a.k.a
jobs!!) in that if a mapper or reducer fails, then the master can tasktracker nodes). A user job is decomposed (mapped) into a
still complete the job by re-delegating the task of the failed number of tasks by the MapReduce engine at the master node
978-1-4799-2003-7/14/$31.00 ©2014 IEEE 3626

and those tasks are dispatched to the remote slave nodes that B. Research Problems and Motivations
are close to the interesting data. The JobTracker managing the
user job at the master node tracks the status and progress of all Let us inspect the consequence of failures in the Hadoop
tasks of the job with the heartbeats from/to the TaskTrackers MapReduce architecture. As we have mentioned earlier, if a
at the slave nodes that manages the tasks. After all tasks are task fails at a salve node, both MapReduce1 and MapReduce2
finished, the JobTracker condenses (reduces) the returned task can take care of this failure by initiating another task, possibly
results into a final result and returns to the user. at another node. However, if the master physical node crashes
(less likely), or the JobTracker/ApplicationManager fails (more
likely), the status of a running job at the master node is gone,
job submission JobTracker even its tasks are still running on slave nodes. As a result,
updates
Laptop
Jobtracker (master) node
the user loses the track of his/her job. In MapReduce1, the
user
tasks dispatching and progress updates
job has to be restarted from scratch. In MapReduce2, the
ResourceManager starts another instance of the user job, also
TaskTracker TaskTracker ... TaskTracker
from scratch. As we can see, if a user job requires extensive
tasktracker (slave) node tasktracker (slave) node tasktracker (slave) node computation e.g. climate simulation, retuning the job takes a
Cloud significant amount of resources and time. In MapReduce2,
the problem becomes even worse if the ResouraceManage
Fig. 1. MapReduce1 Architecture fails because no more user requests can can be accepted and
the management of the cluster resources is out of control.
Therefore, the Hadoop MapReduce Engine has a SPOF
problem. It has been revealed that “a single failure can lead
MapReduce2: MapReduce1 was upgaded to MapReduce2
to large, variable and unpredictable job running times” [11],
(a.k.a YARN–Yet Another Resource Negotiator [10]) to mainly
[12]. This project is thereof motivated to address this problem
address a scalability issue. In MapReduce1, the JobTracker
and to provide fault tolerance for job management. The goal is
at the master node actually takes both responsibilities: a) job
that a job is NOT required to restart when the job management
scheduling—-identifying proper remote slave nodes for the
fails at one node.
mapped tasks, and b) tracking the progress and status of
tasks of a managed user job. MapReduce1 has a scalability
bottleneck of about 4,000 nodes and is difficult to support very III. D ISTRIBUTED M AP R EDUCE E NGINE
large clusters that are required in many business cases today. To address the SPOF problem on the user job management
MapReduce2 splits these two responsibilities to support very in the MapReduce architecture, we propose a distributed
large scales. MapReduce engine that offers the fault tolerance on managing
user jobs.
Fig. 2 shows the framework architecture of MapReduce2.
The ResourceManager at the resource manager node takes
care of the job scheduling by identifying the nodes to run A. Conceptual Architecture
the mapped tasks of a user job. The ApplicationMaster at the The distributed MapReduce engine consists of a group of
master node tracks the progress and status of the tasks running physically distributed nodes that collectively serve together
on slave nodes. The ApplicationMaster processes the returned as a master network as shown in Fig. 3. A user job will be
task results and updates the user. managed by more than one nodes in this master network. The
master nodes of a user job synchronizes their images of the
job. But at one time, only one master node serves as the active
master node for the job. Namely, the tasks of the job only
communicates to one mater node—-the active master node.
When the active master node, or its JobManager, fails, the
user Laptop standby master node(s) and the slave nodes can detect the
progress updates failure with the heartbeats. Then, the new active master node
will be elected from the standby master nodes upon a policy
dispatching discussed later. The communication about the status of the
ResourceManager AppManager
tasks at the slave nodes will be directed to the new active
Resource management node node (master) master node so that the job continue its running. To the system
dispatching and the user, it seems nothing happened in the job processing.
progress updates The next discusses how the master network is formed and
how the switch takes place when an active master node or its
JobManager fails.
Task ... Task
node (slave) node (slave)

B. Formation of Master Network with DHT
Cloud The key component in the proposed architecture is a dis-

tributed master network. This network could be implemented
Fig. 2. MapReduce2 Architecture
with a separated physical network of nodes. For resource
efficiency, we rather propose it be implemented as a virtual
network. When a regular cloud node runs the job management
3627
We propose an Incremental synchronization scheme with

light cache. The active master node only sends the incremental
userLaptop update that is the difference from the last update to its peer
progress updates master nodes. To guarantee the reception of the updates, the
active master has to be acknowledged by the peer masters.
Therefore, the synchronization is carried on TCP in unicast
Job management Sync Job management
because TCP ensures the delivery and unicast allows ac-
...
node (active master) node (standby master) knowledgement (on the contrary, multicast does not support
master node network acknowledgement). Since the active master may fail at any
Control and progress updates moment, it is possible that some task progress received after
the last synchronization but before the failure is lost in the
failure. To address this issue, the task nodes have to keep their
Task ... Task
task progress for the time of a synchronization cycle. So, when
node (slave) node (slave) the active node fails, the new active node solicits the missed
task updates from the task nodes to keep up-to-date.
Cloud
D. Active Master Switching

Fig. 3. Distributed MapReduce Conceptual Architecture
When the master node network of a job is formed as
discussed in Section III-B, the master nodes are compiled into
for a user job, this node is part of the master network and a list in ascending order according to their node IDs. This list is
be referred as a mater node. It can also actually run regular disseminated to all master nodes and tasks nodes so that every
computation task as a slave node for other jobs. Namely, a node knows which node is the next active master is the current
physical node can serve as both a master node for certain jobs active master fails. Namely, the next active master node is pre-
and a slave node for other jobs. As a result, the whole cloud elected based on the node IDs. If the standby master nodes do
could be a virtual master network at the extremity. It should be not receive the periodic updates as scheduled from their active
noted that the existence of the master node network for a job peer master, they will send a status inquiry heartbeat to the
relies on the job existence—it is purged as a job is completed. active master. If no response of the inquiry is received, they
consider the active master fails and the next master node in
In this proposed architecture, the master network is formed the master list will take over the active master role. All other
with DHT algorithm. First, a small number of cloud nodes are master nodes will synchronize to this new active master.
specified as server nodes in the DHT framework. When a user
job arrives at a job management, the job management node
E. Implementation
acts as the active master node for the job and identifies its
peer master nodes through the server nodes with the DHT Implementation on MapReduce1: We have implemented
algorithm as in Chord [7]. The number of master nodes can the proposed distributed MapReduce engine on Hadoop code
be specified by the system (we recommend 2 to 4 nodes as base. The implementation was based on MapReduce1 be-
the master nodes for a job since the physical node has a very cause MapReduce2 was not released when our implementa-
low down rate). Because a network problem cloud result in the tion started. The DHT algorithm was implemented into the
failure of all its nodes, a policy in identifying the peer master JobTracker as in Fig. 1. The implementation of DHT is
nodes is that a peer node should at best effort be in a deferent based on the source code of Open Chord [13]. The mas-
LAN for better fault-tolerance. Then, each master node of a ter synchronization capability was also implemented in the
user job will have a list of its peers in the order of their node JobTracker. Because of the MapReduce1 architecture, in our
IDs on the DHT ring. implementation, the master nodes are a group of specific
nodes that doe not run regular tasks as MapReduce2 does. To
Then, the job management initiates the mapped tasks onto accommodate the distributed JobTracker nodes, TaskTracker is
remote slave nodes. The task updates are performed as in cur- changed to enable the short-term cache of updates and keep a
rent practice between the tasks and the their job management. list of the master nodes.
Meanwhile, the active master node synchronizes the job status
information among its standby peer master nodes, which is Possible Implementation on MapReduce2: It should be
discussed next. noted that the implementation of the proposed distributed
MapReduce solution onto MapReduce2 is still feasible, but
C. Job Image Synchronization in a ”flatter” mode. As in Fig. 2, MapReduce2 retains the
ResourceManager as a separated node, but the AppManager
To guarantee the job continuity in the event of failure of has been dispatched into a regular cloud node as its tasks
the active master node, the master nodes of a job must have the do. Therefore, the implementation of the DHT should be
same job images. There are two ways to achieve this. One is on the ResourceManager that is responsible for locating the
to have the tasks of a job to multicast their status and progress nodes for AppManagers because the ResourceManager should
to all master nodes, but incurs expensive communication cost generate and dispatch multiple AppManagers, rather than only
in the network. The other approach, which is adopted by us, is one as it does without fault-tolerance currently. However, the
to have the tasks communicate their active master node only, synchronization should be implemented on the AppManagers
but this requires the synchronization of job images among the because they need to communicate to each other. The taskJVM
master nodes. should implement the short-term cache and the list of master
3628
nodes as the TaskTracker does in MapReduce1. Moreover, the Switch latency: This metric refers to the time from a
ResourceManager itself should have a distributed architecture failure of an active master to a new active master takes over
too because it has a SPOF as well. The distributed Resource- the job management with its tasks. To avoid the error incurred
Manager architecture could be accompllished with DHT as in by the system time difference between two host machines, we
the distributed JobTracker implementation in the MapReduce1. rather limited this experiment into only one network, namely
both the active and the standby master nodes are in the same
IV. E VALUATION network on the same host machine. We measured 1000 master
switches and averaged the latency. The latency was normalized
We have evaluated the distributed MapReduce solution by the synchronization cycle. From the measurements in
implemented on MapReduce1. The evaluation focus is on the Table I, we observe that the switch latency is a little over
latency, success ratio of the master switch in failure cases, half a synchronization cycle. This is reasonable because the
and the incurred network traffic overhead. We first present worst case occurs when a failure happens immediately after a
the evaluation platform and methodologies, then discuss the synchronization and the standby node has to wait till the end
evaluated metrics and the experiment results. of almost the whole cycle to detect the failure. Therefore the
maximum latency should be of a synchronization cycle plus
A. Platform and Methodologies the inquiry message timeout. The minimum latency should
The evaluation was carried out on a Hadoop cloud con- be close to the timeout of inquiry message when a failure
sisting of virtual machines. We have two Ubuntu-Linux host occurs nearly at the end of a synchronization cycle. Therefore,
machines configured into two LANs and each of them has with the a uniform distribution of failure, the expected latency
five virtual machines installed and configured with the Hadoop should be the half of the synchronization cycle plus the inquiry
cloud computing platform. The Hadoop platform includes both message timeout.
the MapReduce1package the Hadoop Distributed File System Switch success ratio: this metric is defined as the number
(HDFS). The evaluation configuration is illustrated in Fig. 4. of successful master switches divided by the number of the
We set the master synchronization cycle as the ten times switch attempts (or the number of active master failures). This
of the task update cycle. The short-term cache duration on metric is important in that it shows how much fault-tolerance
the task nodes was then linked to the synchronization cycle in is provided by the distributed MapReduce architecture. Ideally,
settings. Namely, the task nodes cached the last ten updates the ratio should be 100%, but it is hurt by network transmission
locally. The two virtual networks of the virtual machines such as packet loss. With the 1000 master switches tested, the
were configured with network masks of 192.168.0.255 and measurement shows a success ratio of 97%, which indicates
192.168.1.255 respectively. The IP address of each node was that the distributed solution is effective in providing fault-
used as the node ID for the DHT. Because of the limitation of tolerance to user jobs.
the physical computing resources, only 80 jobs were submitted Network overhead: This metric measures the overhead of
to run in the cloud. We configured each master network to network traffic incurred by formation of the master network,
contain only two master nodes (i.e. one active and the other synchronization and active master switch. It should grow along
standby). With the preference that master nodes should be with the size of the master network because more copies of a
separated in different networks if possible, the active master message has to be sent in synchronization and master switch.
nodes were basically in the network of192.168.0.0 and the Therefore, the measurement result is normalized by the number
standby master nodes were in 192.168.1.0. We emulated the of master switches. As observed from Table I, the traffic is
active master failure by purging the instance of a JobTracker about 5 messages per master switch in our case of two master
of an active master node. nodes only and three tasks associated with a job. With each
message takes nearly 0.5μs in Gbps networks, the network
Hadoop Cloud cost is about 2.5 μs resulted from the switch in the distributed
MapReduce solution.
TABLE I. E VALUATION R ESULTS
... ... Metric Results
Latency 0.51
Success Ratio 0.97
VM1 VM5 VM6 VM10 Network Overhead 0.5 μs
Host Machine 1 Host Machine 2

Subnet 192.168.0.0 Subnet 192.168.1.0
V. R ELATED W ORK
GigEthernet Hadoop is essentially an open source massively scalable
queryable store and archive platform enabling the cloud com-
Fig. 4. Experiment Platform of Hadoop puting that includes a file system, queryable databases, archival
store, and flexible schema [14]. Hadoop can, and normally
does, use the Hadoop Distributed File System (HDFS) that uses
B. Metrics and Results
the write-once, read-many philosophy meaning that instead
Our evaluation focused on three metrics: switch latency, of the hard disk constantly having to seek to modify data
switch success ratio and network overhead. The evaluation throughout the set, any new data is appended to the end of
results of the three metrics are shown in Table I. the current dataset. MapReduce is the programming model that
3629
was developed by Google [15] and that Hadoop uses to process R EFERENCES
data [6]. The datasets processed in Hadoop can be, and often [1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski,
are, much larger than any one computer can ever process [16]. G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A view
MapReduce organizes how that data is processed over many of cloud computing,” Commun. ACM, vol. 53, no. 4, pp. 50–58, Apr.
computers (anywhere from a few to a few thousand) [16]. 2010. [Online]. Available: http://doi.acm.org/10.1145/1721654.1721672
[2] P. Mell and T. Grance, “The nist definition of cloud computing (draft),”
Distributed Hash Table (DHT) was first proposed in the NIST special publication, vol. 800, no. 145, p. 7, 2011.
work of Chord [7] by MIT to support overlay peer-to-peer [3] T. Velte, A. Velte, and R. Elsenpeter, Cloud computing, a practical
network for contention distribution with some other DHT approach. McGraw-Hill, Inc., 2009.
implementations available such as CAN [17], Pastry [18], [4] Q. Zhang, L. Cheng, and R. Boutaba, “Cloud computing: state-of-the-art
and Tapestry [19]. Chord is a distributed lookup protocol that and research challenges,” Journal of Internet Services and Applications,
vol. 1, no. 1, pp. 7–18, 2010.
specializes in mapping keys to nodes [8]. Data location can be
[5] D. Borthakur, “The hadoop distributed file system: Architecture and
easily implemented on top of Chord by associating a key with design,” 2007.
each data item, and storing the key/data pair at the node to [6] T. White, Hadoop: the definitive guide. O’Reilly, 2012.
which the key maps. Chord adapts efficiently as nodes join [7] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek,
and leave the system, and can answer queries even if the F. Dabek, and H. Balakrishnan, “Chord: a scalable peer-to-peer lookup
system is continuously changing. This is especially a good fit protocol for internet applications,” IEEE/ACM Trans. Netw., vol. 11,
to MapReduce because MapReduce already maps key-value no. 1, pp. 17–32, 2003.
pairs to process data, and mappers and reducers are constantly [8] B. Cohen, “The bittorrent protocol specification,” 2008.
being brought up and shutdown [16]. Chord features load [9] D. Qiu and R. Srikant, “Modeling and performance analysis of
balancing because it acts as a distributed hash function and bittorrent-like peer-to-peer networks,” ACM SIGCOMM Computer
Communication Review, vol. 34, no. 4, pp. 367–378, 2004.
spreads keys evenly over the participating nodes. Chord is also
decentralized; no node is of greater importance than any other [10] A. C. Murthy, C. Douglas, M. Konar, O. O’MALLEY, S. Radia,
S. Agarwal, and V. KV, “Architecture of next generation apache hadoop
node. Chord scales automatically without need to do any tuning mapreduce framework,” Tech. rep., Apache Hadoop, Tech. Rep., 2011.
to achieve success at scale. [11] F. Dinu and T. Ng, “Understanding the effects and implications of
compute node related failures in hadoop,” in Proceedings of the 21st
Some other fault-tolerance efforts have been proposed for international symposium on High-Performance Parallel and Distributed
cloud computing. Cassandra [20] is a peer-to-peer based solu- Computing. ACM, 2012, pp. 187–198.
tion proposed by Facebook engineers to address fault-tolerance [12] F. D. T. E. Ng, “Analysis of hadoop’s performance under failures,” Rice
in distributed databased management. It eliminates the SPOF University, Tech. Rep., 2012.
problem in a distributed data storage. Other effort in addressing [13] L. Karsten and K. Sven, “http://open-chord.sourceforge.net.”
data fault-tolerance includes work like [21]. YARN [10] is [14] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg,
proposed in the MapReduce2 of Hadoop to address the fault- H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash et al.,
tolerance in HDFS. It solves the SPOF problem in an HDFS “Apache hadoop goes realtime at facebook,” in Proceedings of the
2011 ACM SIGMOD International Conference on Management of data.
cluster. So far, there has been no solution proposed to address ACM, 2011, pp. 1071–1080.
the SPOF problem on the job management as our this work [15] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on
does. large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–
113, 2008.
[16] J. Lin and C. Dyer, “Data-intensive text processing with mapreduce,”
VI. ACKNOWLEDGEMENT Synthesis Lectures on Human Language Technologies, vol. 3, no. 1, pp.
1–177, 2010.
This paper presents a fault-tolerant MapReduce engine [17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, A
for cloud computing. The fault-tolerance is enabled by a scalable content-addressable network. ACM, 2001, vol. 31, no. 4.
distributed solution based on DHT algorithm. In the solution, [18] A. Rowstron and P. Druschel, “Pastry: Scalable, decentralized object lo-
cation, and routing for large-scale peer-to-peer systems,” in Middleware
a network of master nodes are formed to provide job man- 2001. Springer, 2001, pp. 329–350.
agement. A failed active master node will be replaced by its
[19] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and
next standby peer node for job management and thereby the J. D. Kubiatowicz, “Tapestry: A resilient global-scale overlay for service
job running is maintained. The solution has been implemented deployment,” Selected Areas in Communications, IEEE Journal on,
into the Hadoop MapReduce1 engine and evaluated of high vol. 22, no. 1, pp. 41–53, 2004.
fault-tolerance with low latency and networking cost. Our next [20] A. Lakshman and P. Malik, “Cassandra: a decentralized structured
step is to implement this solution into the current Hadoop storage system,” ACM SIGOPS Operating Systems Review, vol. 44,
no. 2, pp. 35–40, 2010.
MapReduce2 for more extensive performance evaluation and
release it as open source. [21] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta, “Making cloud intermediate
data fault-tolerant,” in Proceedings of the 1st ACM symposium on Cloud
computing. ACM, 2010, pp. 181–192.
VII. ACKNOWLEDGEMENT
The authors would like to thank Gordon Pettey who helped
with the extensive source code implementation of the solution.
A special thanks is given to National Science Foundation for
the award #1041292 that supports Gordon to work on this
project as a REU student. We would like also thank our
reviewers for their precious comments to make the work better.
View publication stats

3630

Distributed Mapreduce Engine With Fault Tolerance: June 2014

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Mapreduce Engine With Fault Tolerance: June 2014

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Distributed MapReduce engine with fault tolerance

Conference Paper · June 2014

Lixing Song Shaoen wu

SEE PROFILE SEE PROFILE

Energy Harvesting View project

Internet of Things View project

The user has requested enhancement of the downloaded file.

Distributed MapReduce Engine with Fault Tolerance

Lixing Song, Shaoen Wu Honggang Wang Qing Yang

One way that cloud computing is able to scale up and down

978-1-4799-2003-7/14/$31.00 ©2014 IEEE 3626

node (slave) node (slave)

Cloud The key component in the proposed architecture is a dis-

We propose an Incremental synchronization scheme with

D. Active Master Switching

Host Machine 1 Host Machine 2

View publication stats

You might also like