You are on page 1of 5

Performance Comparison of Apache Hadoop and Apache Spark

Amritpal Singh† Aditya Khamparia Ashish Kr. Luhach


Lovely Professional University Lovely Professional University The PNG University of Technology,
Phagwara, Punjab Phagwara, Punjab Papua New Guinea
apsaggu@live.com aditya.17862@lpu.co.in ashishluhach@acm.org

ABSTRACT context. This research paper written with the aim of comparing
Apache Hadoop and Apache Spark based on various parameters
The term ‘Big Data’ is a broad term used for the data sets, which
and highlighting similarities and differences between them.
is enormous and traditional data processing applications find it
hard to process. Both Apache Spark and Apache Hadoop are one
of the significant parts of the big data family. Some of the 2. PRELIMINARY KNOWLEDGE
researchers view both frameworks as the rivals but it is not that
easy to compare these two as they perform numerous things same, 2.1 Apache Hadoop Framework
but there are also some areas where both work differently. Still
both Apache Hadoop and Apache Spark are comparable on Apache Hadoop considered the standard for big data analysis and
different parameters. This research intends to compare these two providing the abstraction over the challenges of distributed and
popular frameworks and figure out their strengths, weaknesses, parallel computing. Apache Hadoop is a one of the popular
unique characteristics and try to answer whether Spark can framework that designed to provide batch processing. Hadoop
replace hadoop or not. composed of two core components: The Hadoop Distributed File
System (HDFS) and MapReduce [1]. MapReduce is the kernel of
the Apache Hadoop and called Hadoop MapReduce Framework.
KEYWORDS Hadoop Cluster is the collection of machines running HDFS and
MapReduce. Individual machines in the cluster are termed as
Big Data, Apache Hadoop, Apache Spark nodes. Number of nodes is directly proportional to performance
that is more the number of nodes, better is the performance [5].
Hadoop works on the principal of distributed and parallel
1. INTRODUCTION computing as the data processed on multiple machines at the same
The term ‘big data’ is becoming quite popular in recent days, as it time. The core components of Hadoop ecosystem consists of Map
has created numerous opportunities in various domains with the Reduce and compatible file system termed as Hadoop Distributed
likes of business, medical and other fields [12]. Big data refers to File System.
the data and traditional data processing applications find it hard to
process. Big Data has three vital characteristics namely large
2.2 Apache Hadoop MapReduce
variety, huge volume and high dimensions. However, many more
The core components of Hadoop ecosystem consists of Map
characteristics have added in recent past in addition to these three Reduce and compatible file system termed as Hadoop Distributed
with the likes of Veracity, Value etc. Various frameworks are File System. Back in the year, 2004 Google devised the Map
available today to tame big data with the likes of Apache Hadoop, Reduce programming model that allows parallel processing on
Apache Spark, Apache Storm etc. These frameworks follow the large amount of data [6]. Map Reduce consists of two phases
principle of parallel computing. The biggest challenge is to select namely Map phase and Reduce phase.
the appropriate tool with respect to nature of data and processing
Map job is used to process the input data, which is in the form of
file, resides in HDFS. Map transforms the input records to
Permission to make digital or hard copies of all or part of this work for personal or intermediate records [7]. Hadoop framework first call the setup
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and method which will be called only once, followed by map method
the full citation on the first page. Copyrights for components of this work owned which will be called once for each key/value pair followed by
by others than ACM must be honored. Abstracting with credit is permitted. To cleanup method.
copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from
Permissions@acm.org. ICAICR - 2019, June 15–16, 2019, Shimla, H.P, India ©
2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6652-
6/19/06…$15.00 https://doi.org/10.1145/3339311.3339329
By default, Map Reduce supports TextInput Format, which
supports Key as Byte Offset, and Value is the Text. In word count
job, Key will be Long Writable, value will be Text and Context
allows the mapper and reducer to interact with remaining Hadoop
system.

In Reduce phase, the processing of data takes place. It takes the


intermediate key value pair from the mapper and produces the
output, which will be resided in the HDFS. This phase summarize
the whole dataset.

The following figure represents functioning of Map Reduce


framework.

Figure 1: Map Reduce Framework

2.3 Apache Spark


Apache Spark came into picture to address the weakness of
Apache Hadoop. Apache Spark comes into existence at the
University of California, Berkeley's AMPLab. Its initial release
came back in year 2014[2]. Apache Hadoop is good at batch
processing. However, its performance degrades when comes to
iterative processing. Apache Hadoop stores its intermediate results
on HDFS and writing/reading data from secondary storage is
costly with respect to CPU time/processing. On the contrary,
Apache Spark makes use of in-memory computation, which
makes it faster than Hadoop [3]. Apache Spark supports both Figure 2: Overview of Apache Spark Cluster
batch as well as stream processing. Resilient Distributed Datasets
(RDD) is the underlying data structure of Apache Spark, which is
immutable and partitioned collection of records [4]. The following 3. LITERATURE SURVEY
are the different ways in which RDD created:
Md Mahbub Mishu (2019) has predicted the consequence of drugs
• By reading data from HDFS or risk of developing disease on human body by applying Big
Data Analytics [15].Author has stated that data generated from
• By performing transformation on existing RDDs clinical sources utilized to help the patients. Author further stated
that machine-learning algorithms with the likes of Classification
• By applying parallelize method on existing collection and Clustering could used to analyze healthcare data. Author has
proposed a framework based on C-means clustering technique,
Following figure (Figure. 3) shows the overview of Apache Spark which can benefit both clinicians and the patients.
Cluster. Here Spark Context represents the entry point to Apache
Spark functionality. Driver Program is gateway to the Spark Shell Jinbae Lee et al. (2019) have talked about open source
[8]. Cluster Manager’s role is to take possession of resources on frameworks for big data processing i.e. Apache Spark and Hadoop
the spark cluster and allocate them to a spark job [10]. Another in their paper [16]. Authors have proposed a model for time
important component of the Apache Spark Cluster is executor. approximation and resource minimization for Apache Hadoop and
Executor is a distributed agent and its job is to execute tasks Apache Spark. The proposed model incorporated likelihood of
failure in the estimations to formulate more precisely the
characteristics of big data operations. Authors have proved
through experiments that proposed schemes has significantly
improved the accuracy of resource provisions and improves the
scheduling of tasks in big data processing.

Haifeng Wang et al. (2019) addresses the concern of enhancing


energy efficiency and reducing energy utilization when carrying
out MapReduce job on wide-ranging distributed Hadoop system
as running application needs large amount of energy on a large
cluster [17]. Authors have proposed the control model, which can
scale CPU frequency dynamically as the workload changes.
Authors have made use of wavelet neural network for building the
prediction model. Wavelet Neural network is the neural network
where the standard activation function like sigmoid function is Jeffrey Dean and Sanjay Ghemawat (2004) in their paper
replaced by an activation function drawn from a wavelet basis. ‘MapReduce:Simplified Data Processing on Large Clusters’ has
introduced the programming model named ‘Map Reduce’. The
Mohamed Abbas Hedjazi et al. (2018) compared the batch- success of the model in its easiness, even for the programmers
processing framework Hadoop and real time processing having no or little knowledge of parallel and distributed systems
frameworks for the task of large-scale image classification [9]. as it abstracts the parallelization, load balancing and fault-
Authors has justified that Apache Hadoop is good for batch tolerance.
processing, but when comes to iterative processing hadoop's
performance degraded. Hadoop is not an ideal choice when low
latency is required, then Apache Spark comes into play. Apache
Spark works very well in the case of iterative processing as it
make use of in-memory computation so no intermediate disk 4. COMPARISON OF APACHE HADOOP AND
read/writes are required. Further Apache Spark and Storm APACHE SPARK
compared with respect to classification. Authors have stated that
this analysis comes very handy for researchers who want to work
distributed image processing using big data processing
frameworks.
Features Apache Hadoop Apache Spark
Kritwara Rattanaopas et al. (2017) has proved through their
experimental results that performance of word count job can be
improved using Hadoop's compression codecs with the likes of Data Processing At the core, Hadoop At the core, Apache
snappy,bzip2,deflate and gzip [7]. Authors has stated that Engine is batch processing Spark is batch
compression not only increase storage space but also improve the engine [14] processing engine
performance of job. Authors have used data compression with
map output and results are encouraging in a raw-text input file. In
second scenario, authors used compressed input file with bzip2
with the uncompressed MapReduce and results showed no Language Hadoop primarily Spark supports Java,
overhead on cluster performance and reduced the disk space. Support supports Java Python,Scala and R.

Akaash Vishal Hazarika et al. (2017) presented the comparison of


two popular big data processing frameworks: Apache Hadoop and
Language Hadoop developed Spark developed in
Apache Spark [6]. Authors have compared the performance of
these two frameworks based on word count problem and logistic Developed in Java language. Scala
regression. Authors have highlighted the important similarities
and dissimilarities by taking suitable use cases. They have
concluded Apache Spark has outperformed the Apache Hadoop in Processing Hadoop processes Spark processes 100
both cases because spark has capability of in memory processing Speed slower than Spark times faster than Hadoop
and getting intermediate data from primary memory is less costly
than getting data from HDFS.

Matei Zaharia et al. (2010) in their paper ‘Spark: Cluster Iterative Doesn’t support Supports iterative
Computing with Working Sets’ proposed a new framework called Processing iterative processing processing [2]
Spark (now called as Apache Spark) that supports iterative Support
processing as well as retaining the capabilities of Map Reduce [2].
Stream Doesn’t support Supports iterative
Processing stream processing processing
Support

Machine Hadoop uses Spark has a machine


Learning Mahout for learning library, MLLib
Support processing data [11]

Fault Tolerance Hadoop is highly Spark is fault tolerant


fault tolerant too.

Figure 3: Apache Spark Ecosystem


Data stored Data is stored in Data is stored in main Spark Daemons running on Master and workers
disk for processing memory machine

Table 2: Experimental Setup Details


Security Hadoop has got Spark’s security in its
better security early stage. Offering
features than Spark only password 5.3 Results
[13]. authentication.
The purpose of this study is to compare the performance of
Table 1: Comparision of Apache Hadoop and Apache Spark hadoop and spark for counting the occurrence of each word in a
file. The tests conducted on various datasets with size ranging
from 1 MB (approx) to 300 MB (approx). The datasets resides on
5. PERFORMANCE EVALUATION HDFS.

5.1 Methodology
Execution Time (in sec.)
In this paper, a systematic evaluation of Apache Hadoop
MapReduce has done and performance compared with Apache Dataset Size
Apache Spark Apache Hadoop
Spark. For this purpose, word count job going to solve using
datasets of different sizes. Practical implementation has done in
Java for MapReduce and in scala for Apache Spark on single node 1 MB 1.2 4.3
cluster and results has shown.
3 MB 4.3 17.3
5.2 Experimental Setup
Apache Spark Version Used 2.3.1 6 MB 7.0 30.4

Apache Hadoop Version Used 3.1.1 15 MB 9.4 32.0

Processor Intel® Core™ i3-5010U CPU 50 MB 13.8 33.4


@ 2.10GHz × 4
70 MB 14.6 35.8
OS Ubuntu 18.04.1 LTS
100 MB 17.8 50.9
OS Type 64-bit
300 MB 55.2 130.2
RAM 4 GB
Table 3: Experimental Results (execution time comparison)
HDD 1 TB

Java Version Used 1.8

Scala Version Used 2.11.8

Eclipse Version Used 4.8.0

Hadoop Daemons running on Name Node, Job Tracker, Data


machine Node, Task Tracker
[6] Hazarika, A., Jain, E., Ram, G. (2017) : Performance comparision of Hadoop
and spark engine. International Conference on I-SMAC (IoT in Social, Mobile,
Analytics and Cloud) (I-SMAC), Palladam, India.
[7] Rattanaopas, K., Kaewkeeree, S. (2017): Improving Hadoop MapReduce
performance with data compression: A study using wordcount job. International
Conference on Electrical Engineering/Electronics, Computer,
Telecommunications and Information Technology (ECTI-CON), Phuket,
Thailand
[8] Aziz, K., Zaidouni, D., Bellafkih, M. (2018): Real-Time Data Analysis Using
Spark and Hadoop.
[9] Hedjazi, M., Kourbane, I., Genc, Y., Ali, B. (2018): A comparison of Hadoop,
Spark and Storm for the task of large-scale image classification. Signal
Processing and Communications Applications Conference (SIU), Izmir,
Turkey.
[10] Saqqa, S., Nayamat, G., Awajan, A.: A Large-Scale Sentiment Data
Classification for Online Reviews Under Apache Spark. Procedia Computer
Science. (2018). doi : 10.1016/j.procs.2018.10.166
[11] Belouch, M., Hadaj, S., Idhammad, M.: Performance evaluation of intrusion
detection based on machine learning using Apache Spark. Procedia Computer
Science. (2018). doi: 10.1016/j.procs.2018.01.091
[12] PENG, Z.(2019): Stocks Analysis and Prediction Using Big Data Analytics.
International Conference on Intelligent Transportation, Big Data & Smart City
(ICITBS), pp. 309-312.
[13] Shabestari, F., Rahmani, A., Navimipour, N., Jabbehdari, S. : A taxonomy of
software-based and hardware-based approaches for energy efficiency
management in the Hadoop. Journal of Network and Computer Applications.
(2019). doi : 10.1016/j.jnca.2018.11.007
[14] Glushkova, D., Jovanovic, P., Abello, A. : Mapreduce performance model for
Hadoop 2.x. Information Systems. (2019). doi : 10.1016/j.is.2017.11.006
Figure 4: Execution Time for Word Count Job in Hadoop and [15] Mishu, Md. (2019): A Patient Oriented Framework using Big Data & C-means
Spark Clustering for Biomedical Engineering Applications. 2019 International
Conference on Robotics,Electrical and Signal Processing Techniques
(ICREST), pp. 113-115
The figure (figure 4) clearly states that Apache Spark outperforms [16] Lee, J., Kim, B., Chung, J.(2019): Time Estimation and Resource Minimization
the Apache Hadoop. The execution time of Hadoop is far more Scheme for Apache Spark and Hadoop Big Data Systems With Failures, pp.
than of Spark irrespective of dataset size. Apache Spark counts 9658 – 9666
the occurrence of each word in file faster than Apache Hadoop. [17] Wang, H., Cao, Y. : An Energy Efficiency Optimization and Control Model for
Hadoop Clusters. (2019). doi : 10.1109/ACCESS.2019.2907018
Apache Spark makes use of in-memory computation, which
makes it faster than Hadoop who look for intermediate results in
HDFS.

6. CONCLUSION
The aim of this research is to present the performance comparison
of two popular big data processing frameworks. Both frameworks
evaluated based on word count job. Apache Spark counts the
occurrence of each word in file faster than Apache Hadoop. Spark
is best option for stream processing. Hadoop is preferable for
batch processing. Spark is there not to replace Hadoop. Both of
the frameworks have their own application areas.

REFERENCES

[1] Dean, J., Ghemawat, S. (2004): MapReduce: Simplified Data Processing on


Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and
Implementation, pp. 137-150. San Francisco, CA.
[2] Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S.,Stoica, I.(2010):
Spark: cluster computing with working sets. HotCloud'10 Proceedings of the
2nd USENIX conference on Hot topics in cloud computing, pp 1-7. Boston,
MA.
[3] Wijayanto,A.,Winarko,E.(2016):Implementation of Multi-criteria Collaborative
Filtering on Cluster Using Apache Spark. 2nd International Conference on
Science and Technology-Computer (ICST), Yogyakarta, Indonesia.
[4] Akil,B.,Zhou,Y.,Rohm,U.(2017): On the Usability of Hadoop MapReduce,
Apache Spark & Apache Flink for Data Science. IEEE International Conference
on Big Data
[5] Son, S.,Gil, M., Moon, Y.(2017): Anomaly Detection for Big Log Data Using a
Hadoop Ecosystem.

You might also like