You are on page 1of 4

4th International Conference on Advanced Technologies RSA-147

For Signal and Image Processing – ATSIP’ 2018


March 21-24, 2018 – Sousse, Tunisia

A Comparison of Big Remote Sensing Data


Processing with Hadoop MapReduce and Spark
I. Chebbi W. Boulila N. Mellouli
RIADI Laboratory, LIASD Laboratory RIADI Laboratory LIASD Laboratory
University of Manouba Tunisia, University of Paris 8 France University of Manouba, Tunisia University of Paris 8, France
ichebbi88@gmail.com wadii.boulila@riadi.rnu.tn n.mellouli@iut.univ-paris8.fr

M. Lamolle I.R. Farah


LIASD Laboratory RIADI Laboratory
University of Paris 8, France University of Manouba, Tunisia
m.lamolle@iut.univ-paris8.fr riadh.farah@ensi.rnu.tn

Abstract—The continuous generation of huge amount of re- 7.5 petabytes with 1.5 million of users. The second V is
mote sensing (RS) data is becoming a challenging task for Velocity. It refers to generating, analysing and interpreting a
researchers due to the 4 Vs characterizing this type of data rapid growth of RS data [4]. For instance, RS data archives
(volume, variety, velocity and veracity). Many platforms have
been proposed to deal with big data in RS field. This paper of EOSDIS reach 4 TB per day distributed on more than
focus on the comparison of two well-known platforms of big RS 600 millions data files. The third V concerns the Value. The
data namely Hadoop and Spark. We start by describing the two RS data can be multisource, multitemporal, multispectral or
platforms Hadoop and Spark. The first platform is designed for even multi resolution. Multisource means RS images can be
processing enormous unstructured data in a distributed comput- taken from multiple sources such as RADAR, LiDAR, optical,
ing environment. It is composed of two basic elements : 1) Hadoop
Distributed file system for storage, and 2) Mapreduce and Yarn etc. Multitemporal means that RS images can be scenes of
for parallel processing, scheduling the jobs and analyzing big multi dates images. Multiresolution means different resolution
RS data. The second platform, Spark, is composed by a set of (spatial or spectral) [5].
libraries and uses the resilient distributed data set to overcome Furthermore, understanding big RS data [6] consists on know-
the computational complexity. The last part of this paper is ing the three facets: quality, methodologies and applications.
devoted to a comparison between the two platforms.
Index Terms—Big Data, Architectures, Hadoop, Spark, Remote • ”Quality”: RS data are acquired from various sensors.
Sensing Image The delivered data are complex, heterogeneous, and high-
dimensional. Therefore, the remote sensing data should
I. I NTRODUCTION be analyzed, classified, preprocessed, and may be com-
Nowadays, a huge number of earth observation data is bined with other traditional data in order to obtain better
delivered from RS sensors. This data is used in various knowledge.
applications in several domains. Many researchers refer to data • ”Methodologies”: are set of techniques, principles and
coming RS as big data [1]. When coupling the complexity methods allowing to process and extract information from
of huge amount of data (big data) with the nature of RS big RS data. Methodologies can be composed by a set of
data, the problem becomes more challenging to researchers. tasks such as preprocessing, processing, dimensionality
The complexity of big data is coming from the its 4Vs reduction, deployment, visualization, analysis and inter-
(Volume, Velocity, Value and Veracity) and even many others pretation. Methodology is important in RS big data since
characteristics of this data [2]. Whereas, the complexity of many constraints interfere like uncertainty, processing
RS data refers to the multiplicity of image resolution (spatial, scalability, computational times, etc.
spectral and temporal). Besides, this later type of complexity • ”Applications”: play an important role on the choice
can be increased specially when RS metadata is organized into of the two previous facets. Since RS data has many
complex data structures [3]. Thus, the main challenging is how applications fields. For example, a problem of classifi-
to deal with these 4Vs in the RS big data. cation of low image resolution doesn’t require the same
Let us start with the first V: Volume. The volume of RS data quality of data and the same methodologies as a problem
is increasing in terms of hours and minutes. Last years, we of prediction in case of very high image resolution.
are rapidly moving from terabytes to exabytes. For example, Moreover, processing multispectral images is not similar
in 2013, EOSDIS (NASA’s Earth Observation System Data to processing hyperspectral images.
and Information System) recorded an RS data volume of The difficulty in processing big RS data comes from not
978-1-5386-5239-8/18/$31.00 ©2018 IEEE
only the big volume of this data but also from many tasks

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:18:33 UTC from IEEE Xplore. Restrictions apply.
that characterize the RS data which are data acquisition,
preprocessing, storage, processing, analysis and interpreting
[7]. In literature, many frameworks are proposed to deal with
these problems. Among these framework, we list two popular
tools named Hadoop and Spark.
The rest of the paper is organized as follows. we begin by
presenting the framework hadoop with its component and we
focus on Map and reduce phases, section 2 is about Apache
Spark and its component and finally, we end up with summary
and a conclusion.

II. H ADOOP M AP R EDUCE


Hadoop MapReduce [8] is an open source programming
framework used for the processing of large structured and Fig. 1: Map and Reduce Phases
unstructured data sets that are stored in HDFS (Hadoop Dis-
tributed File System) in a distributed computing environment.
The current Apache Hadoop ecosystem is based on several • Jobtracker
modules which are Hadoop Kernel, MapReduce, HDFS, Yarn, • Datanode
Apache Hive, Zookeeper and many others. • Tasktracker
Apache Hadoop is basically composed of two core compo-
nents: Mapreduce and HDFS [9]. A brief description of these
two components is given below.
• MapReduce is a programming model that was built from
models found in the field of functional programming
and distributed computing [10]. Tasks in MapReduce are
broken down into three parts: map, reduce and driver.
They are organized in pairs of key and value. Mappers
emit key and value pairs and the reducers receive them,
work on them, and produce the final result. Applications
that use MapReduce read the data using map function (in
the form of key and value pairs) and produced an output
also in form of (key and value) pairs. This type of data
fit very well to distributed environment. Fig. 2: Master-Slave Architecture in Hadoop
Then, the reduce function takes the generated pairs and
produces the final results. Pairs are, generally, sorted
according to keys when entered to the reduce function. III. S PARK
The reducer logic would then work on each key group; Apache spark is a powerful open source platform for big
in this case, it would sum up the values for each key and data processing. It is considered as multipurpose platform that
would produce the final result. is characterized by its flexibility, scalability and speed. It is,
Figure 1 describes the functional representation of the also, considered as an open source common parallel computing
map and reduce functions. framework which has the advantages of MapReduce. Spark is
• HDFS is responsible for the storage of files. It was intended to run on top of Hadoop and can be an alternative to
designed and developed to handle large files efficiently the traditional batch map and reduce models [10]. Moreover,
[11]. It is a distributed filesystem designed to work on a Spark is well adapted for iterative process, fast queries and
cluster. The main goal of using HDFS is to facilitate the real time data processing [12].
storage of large files by splitting them into blocks and Spark integrates four main libraries :
distributing them across multiple nodes redundantly [9]. • SQL for querying in large and structured data.
The Hadoop MapReduce stores and processes data in a • MLlib which contains main learning algorithms and sta-
distributed architecture, to achieve this goal, Hadoop imple- tistical methods.
ments a master and slave model. The namenode and jobtracker • GraphX for processing graphs and networks.
daemons are master daemons, whereas the datanode and task- • Spark Streaming to process streaming data.
tracker daemons are slave daemons (Fig.2). Apache Hadoop Apache spark is characterized by a set of efficient machine
consists of the following daemons: learning algorithms and enhanced linear algebra Libraries
• Namenode [12] [13]. An optimized engine is added in Spark to support
• Secondary namenode general execution graphs [13]. A wide range of workloads

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:18:33 UTC from IEEE Xplore. Restrictions apply.
is covered by Spark including interactive queries, batch to be modeled in regression, or a class code 0.0,1.0, ...
applications, iterative algorithms and streaming. Also, by in supervised classification or discrimination.
using Spark framework the management burden of maintaining • Rating: The type of data(note of an article by a customer)
separate tools is reduced. Indeed, it integrates the concept specific to the recommendation systems and therefore to
of RDD (Resilient Distributed Dataset). Schematically, each the variable factorization algorithm (ALS).
data partition remains in memory on its computing server • Model: The model class is the result of a learning
between two iterations while managing the principles of algorithm that has the predict() function to apply the
fault-tolerance. Spark specific commands run in Java, Scala, model to a new or resilient data observation of new
and some for Python. observations.
Figure 3 depicts the architecture of Apache Spark. • Supported Methods and utilities: MLlib supplies fast,
distributed implementations of common Learning algo-
rithms, it provides low-level primitive and basic utilities
for convex optimization, distributed linear algebra, sta-
tistical analysis and feature extraction. It also supports
several I/O formats.
• Algorithm optimization: MLlib involves many optimiza-
tion to support efficient distributed learning and predic-
tion. For example, ALS algorithm for recommendation
makes careful use of blocking to reduce JVM garbage
collection overhead and leverage high-level linear algebra
operations.
• Pipeline API: MLlib provide native support for the di-
verse set of functionality required for pipeline construc-
tion such as a sequence of data pre-processing , feature
extraction, model-fitting and validation.
• Spark integration: MLlib benefits from the several com-
ponents of Spark integration. At the lowest level, Spark
Fig. 3: Apache Spark
core provides an engine with over than 80 operators for
transforming data, MLlib also leverage the other high-
RDDs are the core of Spark’s functionalities. They are set of level libraries packeded with Spark. SparkSQL , GraphX
records or subjects of a specific types, partitioned or distributed are also integrated in Spark and their improvements lead
across multiple nodes in the cluster. If a node is affected by to MLlib improvements.
a hardware or network failure. The resilient table is rebuilt • Documentation: MLlib has an user guide which provides
automatically on other nodes and the task is completed. The a good documentation for users, describing its methods
main property of Spark RDDs is the ability to store them in and utilities with detailed code examples.
the memory of each node [14]. This saves a lot of disk access
which is the main lock, in terms of computing time, when IV. D IFFERENCE BETWEEN H ADOOP M APREDUCE AND
running iterative algorithms. Another specificity of Spark is the A PACHE S PARK
broadcast variables. These variables are read-only and defined
from the master node, and they are known and kept in memory This comparison (TABLE I) describes many differences
of all the other nodes. An accumulator is a special case of between Apache spark and Hadoop MapReduce. However, we
broadcast variables. However, is not a read-only varaiable. must specify that both of them are compatible with each other,
Each local version can be incremented by each node while they integrate with all the data sources and file formats that
the master node has access to global accumulation. are supported by each of them and the duo are scalable.
In the rest of this part we will detail the MLlib which is a
component of Spark, that has been developed into a tool that V. C ONCLUSION
supports many common machine learning algorithms including In this paper we described two frameworks for processing
deep learning [15]. Our description will be focused on the type RS bid data. These two frameworks are Hadoop and Spark. A
of data, labeled point, rating, comparison between these two frameworks is made focusing
• Types of data: the use of RDD at the core of Spark’s on their architectures and especially the libraries, the phases
computation efficiency makes MLlib dependent on very and the tasks who are specialized in the development of
compelling data structures. A map functions used to machine learning algorithms. This will be very useful for our
create objects of these types. future work consisting in integrating deep Learning libraries
• Labeled point: this type is specific to learning algorithms with Apache Spark and TensorFlow. The later is a framework
and associates a ”label” ; a real and a vector. This ” developed by Google based on neural networks for numerical
label ” is either the value of the quantitative variable Y computations.

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:18:33 UTC from IEEE Xplore. Restrictions apply.
TABLE I: A comparison between Spark and Hadoop MapRe- [9] I Chebbi, W Boulila, IR Farah, Improvement of satellite image clas-
duce sification: Approach based on Hadoop/MapReduce, 2nd International
Conference on Advanced Technologies for Signal and Image Processing,
Spark Hadoop Mapreduce 2016.
Storage In memory storing and On disk storing data [10] V. A. Ayma; R. S. Ferreira; P. N. Happ; D. A. B. Oliveira; G. A.
processig O. P.Costa; R. Q. Feitosa; A. Plaza; P. Gamba On the architecture
Data Processing Hybrid processing:batch Batch processing of a big data classification tool based on a map reduce approach for
processing and stream hyperspectral image analysis IEEE International Geoscience and Remote
processing Sensing Symposium (IGARSS), 1508 - 1511;2015
Real-Time Processing It can process real-time It fails when it comes [11] Yanfeng Lyu; Xunli Fan; Kun Liu ”An Optimized Strategy for Small
data from real-time event to real-time data Files Storing and Accessing in HDFS” IEEE International Conference
streams processing on Computational Science and Engineering (CSE) , Volume: 1, 611 -
614,2017.
Code Writing Code is complex and Code is Compact [12] W Huang, L. Meng, D. Zhang and W. Zhang, ”In-memory Parallel
lengthy with the API of Processing of Massive Remotely Snesed Data Using an Apache Spark
Scala, Python, Java on Hadoop YARN Model”, IEEE Journal of Selected Topics in Applied
and Sparksql Earth Observations and Remote Sensing, pp. 1-17, 2016.
Fault Tolerance Spark uses RDD which It achieves fault [13] Akhmedov Khumoyun; Yun Cui; Lee Hanku, ”Spark based distributed
rebuilds the lost partition tolerance through Deep Learning framework for Big Data applications” International
through the information it replication by using Conference on Information Science and Communications Technologies
already has TaskTracker and (ICISCT),Pages: 1 - 5, 2016
JobTracker [14] Shyam R. and Bharathi Ganesh H.B. and Sachin Kumar S. and Praba-
haran Poornachandran, Apache Spark a Big Data Analytics Platform for
Execution Time It is 100 time faster than It is slower than spark Smart Grid, Procedia Technology (21), 171 - 178, 2015.
MapReduce [15] Hameeza Ahmed and Muhammad Ali Ismail and Muhammad Faraz
Security Sparks security is cur- Hadoop MapReduce Hyder and Syed Muhammad Sheraz and Nida Fouq, Performance
rently in its infancy, of- has better security Comparison of Spark Clusters Configured Conventionally and a Cloud
fering only authentication features than Spark Service, Procedia Computer Science (82), 99 - 106, 2016.
support through shared it supports Kerberos
password authentication authentication, which
is a good security
feature but difficult to
manage
Cost Spark uses large amounts Hadoop is disk-
of RAM to run everything bound, so saves
in-memory, and RAM is the costs of buying
more expensive than hard expensive RAM
disks
Machine learning It comes with a Mllib It uses iterations
library to make things for the same data
simple and allows the ,each iterative step
algorithm to perform involves a map-
much better than reduce sequences and
traditional MapReduce it does not support
programs all the algorithms

R EFERENCES
[1] Yan Ma, Haiping Wu, Lizhe Wang, Bormin Huang, Rajiv Ranjan, Albert
Y. Zomaya, Wei Jie: ”Remote sensing big data computing: Challenges
and opportunities”,Future Generation Comp. Syst. 51: 47-60,2015
[2] Bhosale, H. S, Gadekar D. P, ”A Review Paper on Big Data and
Hadoop”, International Journal of Scientific and Research Publications,
volume.4, issue 10, October 2014
[3] Michael R. Evans, Dev Oliver,XunZhou, and Shashi Shekhar,” Spatial
Big Data: Case Studies on Volume, Velocity, and Variety , in Big
Data: Techniques and Technologies in Geoinformatics” ,isbn 978-1-46-
658651- 2, CRC Press, 2014.
[4] Ahmed Eldawy, Mohamed F. Mokbel,” The Era of Big Spatial Data:
Challenges and Opportunities” ,MDM (2) 7-10,2015.
[5] Wadii Boulila, Imed Riadh Farah, Amir Hussain, A novel decision
support system for the interpretation of remote sensing big data, Earth
Science Informatics, vol. 11, Issue 1, pp 31-45, 2018
[6] I Chebbi, W Boulila, IR Farah, Big Data: Concepts, Challenges and
Applications Computational Collective Intelligence ICCCI(2), pp. 638-
647, 2015.
[7] Ahmed Oussous, Fatima-AzhraBenjelloun, AyoubAitLahcen, Samir
Belfkih, ”Big data Technologies: A survey”,Journal of King Saud
University-Computer and information Sciences,4 June 2017.
[8] Manika Manwal, Amit Gupta ”Big data and hadoop A technological
survey”, Emerging Trends in Computing and Communication Technolo-
gies (ICETCCT), February 2018

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:18:33 UTC from IEEE Xplore. Restrictions apply.

You might also like