Professional Documents
Culture Documents
net/publication/269328650
CITATIONS READS
2 93
4 authors, including:
Qing Yang
Montana State University
69 PUBLICATIONS 651 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Qing Yang on 26 October 2015.
Abstract—Hadoop is the de facto engine that drives current node to another node. Though, there exists one major potential
cloud computing practice. Current Hadoop architecture suffers point of failure in the current Hadoop MapReduce architecture
from single point of failure problems: its job management lacks of that can result in significant performance degradation—the
fault tolerance. If a job management fails, even if its tasks remains master. If the master node fails, its managed user jobs are
still active on cloud nodes, this job loses all state information not be able to be completed because there is currently no
and has to restart from scratch. In this work, we propose a
redundancy for the master node.
distributed MapReduce engine for Hadoop with the Distributed
Hash Table (DHT) algorithm that drives the scalable peer-to- To address the single point of failure (SPOF) issue on the
peer networks today. The distributed Hadoop engine provides master node of the MapReduce engine, this paper proposes a
the fault-tolerance capability necessary to support efficient job fault-tolerant architecture for the master node in MapReduce
computation required in the cloud computing with numerous
jobs running at a moment. We have implemented the proposed
by employing the Distributed Hash Table algorithm [7] that
distributed solution into Hadoop and evaluated its performance is the core in the today’s widely used peer-to-peer networks
in job failures under various network deployments. e.g. BitTorrent [8], [9]. The distributed peer-to-peer networked
master nodes now allow for multiple masters to manage one
I. I NTRODUCTION MapReduce user job, or at least be aware that it exists.
Therefore, in the event that one master node goes down,
Cloud computing [1]–[4] is a model for enabling ubiq- there will be at least one more master node still available
uitous, convenient, on-demand network access to a shared and running for the same user job that can take over the
pool of configurable computing resources (e.g., networks, job management duty for the failed master node. How a
servers, storage, applications, and services) that can be rapidly master node determines or locates its peer master nodes for
provisioned and released with minimal management effort or the managed user jobs is enabled by the DHT algorithm.
service provider interaction. With the dramatic upsurging of
cloud computing applications and business in the past years, In the rest of the paper, Section II briefs the current Hadoop
Hadoop [5], [6], thanks to its open source implementation, MapReduce engine architecture, discusses SPOF issue of this
has become the de facto engine enabling the current cloud architecture and presents the motivations for this work. Then,
computing. For example, large businesses such as Yahoo!, the design and implementation of the proposed distributed
Facebook, Amazon and HP all adopt Hadoop as the foundation MapReduce solution is detailed in Section III. Next, Section IV
in their cloud computing service to billions of customers. presents our evaluation of the performance of the proposed
Hadoop was first developed by Yahoo! and then released as distributed architecture. The related work is reviewed in Sec-
open source to the public. It has been evolving from its first tion V. Finally, Section VI concludes this paper and discusses
version into today’s the second version released in 2012. our future work.
and those tasks are dispatched to the remote slave nodes that B. Research Problems and Motivations
are close to the interesting data. The JobTracker managing the
user job at the master node tracks the status and progress of all Let us inspect the consequence of failures in the Hadoop
tasks of the job with the heartbeats from/to the TaskTrackers MapReduce architecture. As we have mentioned earlier, if a
at the slave nodes that manages the tasks. After all tasks are task fails at a salve node, both MapReduce1 and MapReduce2
finished, the JobTracker condenses (reduces) the returned task can take care of this failure by initiating another task, possibly
results into a final result and returns to the user. at another node. However, if the master physical node crashes
(less likely), or the JobTracker/ApplicationManager fails (more
likely), the status of a running job at the master node is gone,
job submission JobTracker even its tasks are still running on slave nodes. As a result,
updates
Laptop
Jobtracker (master) node
the user loses the track of his/her job. In MapReduce1, the
user
tasks dispatching and progress updates
job has to be restarted from scratch. In MapReduce2, the
ResourceManager starts another instance of the user job, also
TaskTracker TaskTracker ... TaskTracker
from scratch. As we can see, if a user job requires extensive
tasktracker (slave) node tasktracker (slave) node tasktracker (slave) node computation e.g. climate simulation, retuning the job takes a
Cloud significant amount of resources and time. In MapReduce2,
the problem becomes even worse if the ResouraceManage
Fig. 1. MapReduce1 Architecture fails because no more user requests can can be accepted and
the management of the cluster resources is out of control.
Therefore, the Hadoop MapReduce Engine has a SPOF
problem. It has been revealed that “a single failure can lead
MapReduce2: MapReduce1 was upgaded to MapReduce2
to large, variable and unpredictable job running times” [11],
(a.k.a YARN–Yet Another Resource Negotiator [10]) to mainly
[12]. This project is thereof motivated to address this problem
address a scalability issue. In MapReduce1, the JobTracker
and to provide fault tolerance for job management. The goal is
at the master node actually takes both responsibilities: a) job
that a job is NOT required to restart when the job management
scheduling—-identifying proper remote slave nodes for the
fails at one node.
mapped tasks, and b) tracking the progress and status of
tasks of a managed user job. MapReduce1 has a scalability
bottleneck of about 4,000 nodes and is difficult to support very III. D ISTRIBUTED M AP R EDUCE E NGINE
large clusters that are required in many business cases today. To address the SPOF problem on the user job management
MapReduce2 splits these two responsibilities to support very in the MapReduce architecture, we propose a distributed
large scales. MapReduce engine that offers the fault tolerance on managing
user jobs.
Fig. 2 shows the framework architecture of MapReduce2.
The ResourceManager at the resource manager node takes
care of the job scheduling by identifying the nodes to run A. Conceptual Architecture
the mapped tasks of a user job. The ApplicationMaster at the The distributed MapReduce engine consists of a group of
master node tracks the progress and status of the tasks running physically distributed nodes that collectively serve together
on slave nodes. The ApplicationMaster processes the returned as a master network as shown in Fig. 3. A user job will be
task results and updates the user. managed by more than one nodes in this master network. The
master nodes of a user job synchronizes their images of the
job. But at one time, only one master node serves as the active
master node for the job. Namely, the tasks of the job only
communicates to one mater node—-the active master node.
When the active master node, or its JobManager, fails, the
user Laptop standby master node(s) and the slave nodes can detect the
progress updates failure with the heartbeats. Then, the new active master node
will be elected from the standby master nodes upon a policy
dispatching discussed later. The communication about the status of the
ResourceManager AppManager
tasks at the slave nodes will be directed to the new active
Resource management node node (master) master node so that the job continue its running. To the system
dispatching and the user, it seems nothing happened in the job processing.
progress updates The next discusses how the master network is formed and
how the switch takes place when an active master node or its
JobManager fails.
Task ... Task
3627
IEEE ICC 2014 - Selected Areas in Communications Symposium
3628
IEEE ICC 2014 - Selected Areas in Communications Symposium
nodes as the TaskTracker does in MapReduce1. Moreover, the Switch latency: This metric refers to the time from a
ResourceManager itself should have a distributed architecture failure of an active master to a new active master takes over
too because it has a SPOF as well. The distributed Resource- the job management with its tasks. To avoid the error incurred
Manager architecture could be accompllished with DHT as in by the system time difference between two host machines, we
the distributed JobTracker implementation in the MapReduce1. rather limited this experiment into only one network, namely
both the active and the standby master nodes are in the same
IV. E VALUATION network on the same host machine. We measured 1000 master
switches and averaged the latency. The latency was normalized
We have evaluated the distributed MapReduce solution by the synchronization cycle. From the measurements in
implemented on MapReduce1. The evaluation focus is on the Table I, we observe that the switch latency is a little over
latency, success ratio of the master switch in failure cases, half a synchronization cycle. This is reasonable because the
and the incurred network traffic overhead. We first present worst case occurs when a failure happens immediately after a
the evaluation platform and methodologies, then discuss the synchronization and the standby node has to wait till the end
evaluated metrics and the experiment results. of almost the whole cycle to detect the failure. Therefore the
maximum latency should be of a synchronization cycle plus
A. Platform and Methodologies the inquiry message timeout. The minimum latency should
The evaluation was carried out on a Hadoop cloud con- be close to the timeout of inquiry message when a failure
sisting of virtual machines. We have two Ubuntu-Linux host occurs nearly at the end of a synchronization cycle. Therefore,
machines configured into two LANs and each of them has with the a uniform distribution of failure, the expected latency
five virtual machines installed and configured with the Hadoop should be the half of the synchronization cycle plus the inquiry
cloud computing platform. The Hadoop platform includes both message timeout.
the MapReduce1package the Hadoop Distributed File System Switch success ratio: this metric is defined as the number
(HDFS). The evaluation configuration is illustrated in Fig. 4. of successful master switches divided by the number of the
We set the master synchronization cycle as the ten times switch attempts (or the number of active master failures). This
of the task update cycle. The short-term cache duration on metric is important in that it shows how much fault-tolerance
the task nodes was then linked to the synchronization cycle in is provided by the distributed MapReduce architecture. Ideally,
settings. Namely, the task nodes cached the last ten updates the ratio should be 100%, but it is hurt by network transmission
locally. The two virtual networks of the virtual machines such as packet loss. With the 1000 master switches tested, the
were configured with network masks of 192.168.0.255 and measurement shows a success ratio of 97%, which indicates
192.168.1.255 respectively. The IP address of each node was that the distributed solution is effective in providing fault-
used as the node ID for the DHT. Because of the limitation of tolerance to user jobs.
the physical computing resources, only 80 jobs were submitted Network overhead: This metric measures the overhead of
to run in the cloud. We configured each master network to network traffic incurred by formation of the master network,
contain only two master nodes (i.e. one active and the other synchronization and active master switch. It should grow along
standby). With the preference that master nodes should be with the size of the master network because more copies of a
separated in different networks if possible, the active master message has to be sent in synchronization and master switch.
nodes were basically in the network of192.168.0.0 and the Therefore, the measurement result is normalized by the number
standby master nodes were in 192.168.1.0. We emulated the of master switches. As observed from Table I, the traffic is
active master failure by purging the instance of a JobTracker about 5 messages per master switch in our case of two master
of an active master node. nodes only and three tasks associated with a job. With each
message takes nearly 0.5μs in Gbps networks, the network
Hadoop Cloud cost is about 2.5 μs resulted from the switch in the distributed
MapReduce solution.
TABLE I. E VALUATION R ESULTS
... ... Metric Results
Latency 0.51
Success Ratio 0.97
VM1 VM5 VM6 VM10 Network Overhead 0.5 μs
3629
IEEE ICC 2014 - Selected Areas in Communications Symposium
was developed by Google [15] and that Hadoop uses to process R EFERENCES
data [6]. The datasets processed in Hadoop can be, and often [1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski,
are, much larger than any one computer can ever process [16]. G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A view
MapReduce organizes how that data is processed over many of cloud computing,” Commun. ACM, vol. 53, no. 4, pp. 50–58, Apr.
computers (anywhere from a few to a few thousand) [16]. 2010. [Online]. Available: http://doi.acm.org/10.1145/1721654.1721672
[2] P. Mell and T. Grance, “The nist definition of cloud computing (draft),”
Distributed Hash Table (DHT) was first proposed in the NIST special publication, vol. 800, no. 145, p. 7, 2011.
work of Chord [7] by MIT to support overlay peer-to-peer [3] T. Velte, A. Velte, and R. Elsenpeter, Cloud computing, a practical
network for contention distribution with some other DHT approach. McGraw-Hill, Inc., 2009.
implementations available such as CAN [17], Pastry [18], [4] Q. Zhang, L. Cheng, and R. Boutaba, “Cloud computing: state-of-the-art
and Tapestry [19]. Chord is a distributed lookup protocol that and research challenges,” Journal of Internet Services and Applications,
vol. 1, no. 1, pp. 7–18, 2010.
specializes in mapping keys to nodes [8]. Data location can be
[5] D. Borthakur, “The hadoop distributed file system: Architecture and
easily implemented on top of Chord by associating a key with design,” 2007.
each data item, and storing the key/data pair at the node to [6] T. White, Hadoop: the definitive guide. O’Reilly, 2012.
which the key maps. Chord adapts efficiently as nodes join [7] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek,
and leave the system, and can answer queries even if the F. Dabek, and H. Balakrishnan, “Chord: a scalable peer-to-peer lookup
system is continuously changing. This is especially a good fit protocol for internet applications,” IEEE/ACM Trans. Netw., vol. 11,
to MapReduce because MapReduce already maps key-value no. 1, pp. 17–32, 2003.
pairs to process data, and mappers and reducers are constantly [8] B. Cohen, “The bittorrent protocol specification,” 2008.
being brought up and shutdown [16]. Chord features load [9] D. Qiu and R. Srikant, “Modeling and performance analysis of
balancing because it acts as a distributed hash function and bittorrent-like peer-to-peer networks,” ACM SIGCOMM Computer
Communication Review, vol. 34, no. 4, pp. 367–378, 2004.
spreads keys evenly over the participating nodes. Chord is also
decentralized; no node is of greater importance than any other [10] A. C. Murthy, C. Douglas, M. Konar, O. O’MALLEY, S. Radia,
S. Agarwal, and V. KV, “Architecture of next generation apache hadoop
node. Chord scales automatically without need to do any tuning mapreduce framework,” Tech. rep., Apache Hadoop, Tech. Rep., 2011.
to achieve success at scale. [11] F. Dinu and T. Ng, “Understanding the effects and implications of
compute node related failures in hadoop,” in Proceedings of the 21st
Some other fault-tolerance efforts have been proposed for international symposium on High-Performance Parallel and Distributed
cloud computing. Cassandra [20] is a peer-to-peer based solu- Computing. ACM, 2012, pp. 187–198.
tion proposed by Facebook engineers to address fault-tolerance [12] F. D. T. E. Ng, “Analysis of hadoop’s performance under failures,” Rice
in distributed databased management. It eliminates the SPOF University, Tech. Rep., 2012.
problem in a distributed data storage. Other effort in addressing [13] L. Karsten and K. Sven, “http://open-chord.sourceforge.net.”
data fault-tolerance includes work like [21]. YARN [10] is [14] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg,
proposed in the MapReduce2 of Hadoop to address the fault- H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash et al.,
tolerance in HDFS. It solves the SPOF problem in an HDFS “Apache hadoop goes realtime at facebook,” in Proceedings of the
2011 ACM SIGMOD International Conference on Management of data.
cluster. So far, there has been no solution proposed to address ACM, 2011, pp. 1071–1080.
the SPOF problem on the job management as our this work [15] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on
does. large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–
113, 2008.
[16] J. Lin and C. Dyer, “Data-intensive text processing with mapreduce,”
VI. ACKNOWLEDGEMENT Synthesis Lectures on Human Language Technologies, vol. 3, no. 1, pp.
1–177, 2010.
This paper presents a fault-tolerant MapReduce engine [17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, A
for cloud computing. The fault-tolerance is enabled by a scalable content-addressable network. ACM, 2001, vol. 31, no. 4.
distributed solution based on DHT algorithm. In the solution, [18] A. Rowstron and P. Druschel, “Pastry: Scalable, decentralized object lo-
cation, and routing for large-scale peer-to-peer systems,” in Middleware
a network of master nodes are formed to provide job man- 2001. Springer, 2001, pp. 329–350.
agement. A failed active master node will be replaced by its
[19] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and
next standby peer node for job management and thereby the J. D. Kubiatowicz, “Tapestry: A resilient global-scale overlay for service
job running is maintained. The solution has been implemented deployment,” Selected Areas in Communications, IEEE Journal on,
into the Hadoop MapReduce1 engine and evaluated of high vol. 22, no. 1, pp. 41–53, 2004.
fault-tolerance with low latency and networking cost. Our next [20] A. Lakshman and P. Malik, “Cassandra: a decentralized structured
step is to implement this solution into the current Hadoop storage system,” ACM SIGOPS Operating Systems Review, vol. 44,
no. 2, pp. 35–40, 2010.
MapReduce2 for more extensive performance evaluation and
release it as open source. [21] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta, “Making cloud intermediate
data fault-tolerant,” in Proceedings of the 1st ACM symposium on Cloud
computing. ACM, 2010, pp. 181–192.
VII. ACKNOWLEDGEMENT
The authors would like to thank Gordon Pettey who helped
with the extensive source code implementation of the solution.
A special thanks is given to National Science Foundation for
the award #1041292 that supports Gordon to work on this
project as a REU student. We would like also thank our
reviewers for their precious comments to make the work better.