Professional Documents
Culture Documents
Abstract— Big Data is at use every single day, from the adoption archived data or legacy collections, and streamed data arriving
of Internet through social networks, mobile devices, connected from multiple sources.
objects, videos, blogs and others. Big Data real-time processing
have received a growing attention especially with the expansion Back in 2004, Google invented the MapReduce [2]
of data in volume and complexity. Big data is created every day, programming model for processing the massive data, In fact,
from the use of the Internet through social networks, mobile this model has allowed parallel and distributed computed of a
devices, connected objects, videos, blogs and others. In order to very large amount data contained in a cluster. Additionally, it
ensure a reliable and a fast real-time information processing, consists of a processing paradigm of executing data with
powerful tools are essential for the analysis and processing of Big partitioning and aggregation of intermediate results. Its main
Data. Standards MapReduce frameworks such as Hadoop goal is to process data in parallel, where, data splitting,
MapReduce face some limitations for processing real-time data of distribution, synchronization and fault tolerance are handled
various formats. In this paper, we highlight the implementation automatically by the framework MapReduce [12].
of the de-facto standard Hadoop MapReduce and also the
implementation of the framework Apache Spark. Thereafter, we Various MapReduce frameworks exist. Actually they help
conduct experimental simulations to analyze a real-time data provide off-the-shelf high scalability and can significantly
stream using Spark and Hadoop. To further enforce our shorten the processing time for big data. For example, the
contribution, we introduce a comparison of the two open-source framework Hadoop has become widely used for
implementations in terms of architecture and performance with a processing big data. Hadoop has some limitations of supporting
discussion to feature the results of simulations. The paper real-time updates. In fact, Hadoop most analytic updates and
discusses also the drawbacks of using Hadoop for real-time re-scores have are completed by long-run batch jobs, which
processing. significantly delays presenting the analytic results. For
example, in some businesses revenue loss can be measured in
Keywords- Big Data, Real-time processing, Real-time analysis, milliseconds, like the notion of a batch job which has no
Hadoop, MapReduce, Spark, Twitter, HDFS bearing. Therefore, the need for innovative techniques and
methodologies that will be able to provide real-time results
I. INTRODUCTION Open source frameworks such as Apache Spark address these
Big Data is defined to refer to the increased volume of data, needs.
which makes it really difficult to store, process and analysis This paper is structured as follows: Section II reviews the
using classical processing tools. Most definitions of Big Data MapReduce Hadoop implementation. While Section III details
defined big data as characterized by 3 Vs [1]: Volume, Variety, the Spark implementation. The experimental evaluation of real-
and Velocity. These terms were originally introduced by time data analysis with Hadoop and Spark are discussed in
Gartner to describe the elements of big data challenges. Section IV. Section V includes the drawbacks of using Hadoop
Data volume mainly refers to all types of the data which is with additional comparison of its functioning mechanisms with
generated from different sources and continuously expands Spark in real-time processing. Finally, Section VI entails the
over time. In today's generation, the storing and processing concluding remarks.
includes exabytes (1018 bytes) or even zettabytes (1021 bytes)
whereas almost 10 years ago, only (106 bytes) were stored on II. HADOOP MAPREDUCE IMPLEMENTATION
floppy disks. For example, the social network Twitter Hadoop is an open source framework that allows
generated in January 2013, 7 terabytes of data every day. The distributed processing of a large amount of data through a
volume of the tweets generated grew increasingly at high rates. cluster using a simple programming model. It allows
In Twitter's short history, Twitter was processing 5,000 tweets applications to work on multiple nodes for big data processing.
per day in 2007 to 500 million tweets per day in 2013 [11].
The data Variety refers to the different types of data A. MapReduce
collected via sensors, smartphones, or social networks. Such MapReduce is a programming model that allows parallel
data types include video, image, text, audio, and data logs, computing on a large amount of data using a large number of
either structured, semi-structured, or unstructured format. machines called Cluster or Grid (more data more machines).
The Velocity refers to the speed of data transfer. The MapReduce has two main functions:
contents of data constantly change because of the absorption of
complementary data collections, introduction of previously
A. Hadoop MapReduce
1) Proposed architecture
In this study, we developed an application in Scala
language for streaming data from twitter in Apache spark. The
architecture of this study as shown in figure 3.
Figure 4. Workflow of processing data in Hadoop MapReduce
We have imported the Hadoop package to our Eclipse
project. Then we have started calculating the tweets every 60
seconds as show in the following workflow (figure 4).
2) Experimental setup
The following figure shows the MapReduce treatments on
the Twitter data and the time elapsed during this processing.
4) Experimental setup
In this experiment, we developed a program in Scala
language to show the reception rate of the twitter data and also
the processing time of batches. And then, we show how Spark
processes Twitter Data in some seconds. Apache Spark
provides a web interface that gives an overview of running
application to monitor, view, and analyze the operating state of
Apache Spark.
The following figure shows some Tweets that we got in
Console Window Eclipse.
Figure 12 shows the completed batches, it is noted that which has long processing time as shown in figure 13. This
some batches have taken a long time to finish. So to have more figure shows the duration and the status as well as the jobs
details on the batch and to know the reason of processing delay generated (in our case 2 jobs generated).
of these batches, it is enough just to click on the Batch Time
Spark jobs