Professional Documents
Culture Documents
Performance Comparison of Apache Hadoop and Apache Spark
Performance Comparison of Apache Hadoop and Apache Spark
ABSTRACT context. This research paper written with the aim of comparing
Apache Hadoop and Apache Spark based on various parameters
The term ‘Big Data’ is a broad term used for the data sets, which
and highlighting similarities and differences between them.
is enormous and traditional data processing applications find it
hard to process. Both Apache Spark and Apache Hadoop are one
of the significant parts of the big data family. Some of the 2. PRELIMINARY KNOWLEDGE
researchers view both frameworks as the rivals but it is not that
easy to compare these two as they perform numerous things same, 2.1 Apache Hadoop Framework
but there are also some areas where both work differently. Still
both Apache Hadoop and Apache Spark are comparable on Apache Hadoop considered the standard for big data analysis and
different parameters. This research intends to compare these two providing the abstraction over the challenges of distributed and
popular frameworks and figure out their strengths, weaknesses, parallel computing. Apache Hadoop is a one of the popular
unique characteristics and try to answer whether Spark can framework that designed to provide batch processing. Hadoop
replace hadoop or not. composed of two core components: The Hadoop Distributed File
System (HDFS) and MapReduce [1]. MapReduce is the kernel of
the Apache Hadoop and called Hadoop MapReduce Framework.
KEYWORDS Hadoop Cluster is the collection of machines running HDFS and
MapReduce. Individual machines in the cluster are termed as
Big Data, Apache Hadoop, Apache Spark nodes. Number of nodes is directly proportional to performance
that is more the number of nodes, better is the performance [5].
Hadoop works on the principal of distributed and parallel
1. INTRODUCTION computing as the data processed on multiple machines at the same
The term ‘big data’ is becoming quite popular in recent days, as it time. The core components of Hadoop ecosystem consists of Map
has created numerous opportunities in various domains with the Reduce and compatible file system termed as Hadoop Distributed
likes of business, medical and other fields [12]. Big data refers to File System.
the data and traditional data processing applications find it hard to
process. Big Data has three vital characteristics namely large
2.2 Apache Hadoop MapReduce
variety, huge volume and high dimensions. However, many more
The core components of Hadoop ecosystem consists of Map
characteristics have added in recent past in addition to these three Reduce and compatible file system termed as Hadoop Distributed
with the likes of Veracity, Value etc. Various frameworks are File System. Back in the year, 2004 Google devised the Map
available today to tame big data with the likes of Apache Hadoop, Reduce programming model that allows parallel processing on
Apache Spark, Apache Storm etc. These frameworks follow the large amount of data [6]. Map Reduce consists of two phases
principle of parallel computing. The biggest challenge is to select namely Map phase and Reduce phase.
the appropriate tool with respect to nature of data and processing
Map job is used to process the input data, which is in the form of
file, resides in HDFS. Map transforms the input records to
Permission to make digital or hard copies of all or part of this work for personal or intermediate records [7]. Hadoop framework first call the setup
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and method which will be called only once, followed by map method
the full citation on the first page. Copyrights for components of this work owned which will be called once for each key/value pair followed by
by others than ACM must be honored. Abstracting with credit is permitted. To cleanup method.
copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from
Permissions@acm.org. ICAICR - 2019, June 15–16, 2019, Shimla, H.P, India ©
2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6652-
6/19/06…$15.00 https://doi.org/10.1145/3339311.3339329
By default, Map Reduce supports TextInput Format, which
supports Key as Byte Offset, and Value is the Text. In word count
job, Key will be Long Writable, value will be Text and Context
allows the mapper and reducer to interact with remaining Hadoop
system.
Matei Zaharia et al. (2010) in their paper ‘Spark: Cluster Iterative Doesn’t support Supports iterative
Computing with Working Sets’ proposed a new framework called Processing iterative processing processing [2]
Spark (now called as Apache Spark) that supports iterative Support
processing as well as retaining the capabilities of Map Reduce [2].
Stream Doesn’t support Supports iterative
Processing stream processing processing
Support
5.1 Methodology
Execution Time (in sec.)
In this paper, a systematic evaluation of Apache Hadoop
MapReduce has done and performance compared with Apache Dataset Size
Apache Spark Apache Hadoop
Spark. For this purpose, word count job going to solve using
datasets of different sizes. Practical implementation has done in
Java for MapReduce and in scala for Apache Spark on single node 1 MB 1.2 4.3
cluster and results has shown.
3 MB 4.3 17.3
5.2 Experimental Setup
Apache Spark Version Used 2.3.1 6 MB 7.0 30.4
6. CONCLUSION
The aim of this research is to present the performance comparison
of two popular big data processing frameworks. Both frameworks
evaluated based on word count job. Apache Spark counts the
occurrence of each word in file faster than Apache Hadoop. Spark
is best option for stream processing. Hadoop is preferable for
batch processing. Spark is there not to replace Hadoop. Both of
the frameworks have their own application areas.
REFERENCES