Professional Documents
Culture Documents
Spark 4-2 Documentation
Spark 4-2 Documentation
INTRODUCTION
In an every minute of a single day, the expansion of data is rapidly increasing from
the past few years where traditional database storage is not enough to process that
huge data. From then “Hadoop” was introduced which is one of the tool of “Big
Data” that stores huge amount of data and analyses it. But it cannot analyze and
represent the data in real-time.
Big data is a field that treats ways to analyze, systematically extract information
from, or otherwise deal with data sets that are too large or complex to be dealt with
by traditional data-processing application software. Big Data is used to describe a
collection of data that is huge in size and yet growing exponentially with time.
1
In short such data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.
(i) Volume – The name Big Data itself is related to a size which is enormous. Size
of data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data. Hence, 'Volume' is one characteristic which needs to be
considered while dealing with Big Data.
(ii) Variety – The next aspect of Big Data is its variety. Variety refers to
heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in
the analysis applications. This variety of unstructured data poses certain issues for
storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
→Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
→Today's market is flooded with an array of Big Data tools. They bring cost
efficiency, better time management into the data analytical tasks.
2
There are many big data tools like Hadoop, Spark, Cassandra, Storm etc.., But in
this Project we use Hadoop and Spark for cluster computation concepts.
1.3.1 Hadoop:
Apache Hadoop offers a scalable, flexible and reliable distributed computing big
data framework for a cluster of systems with storage capacity.
Hadoop follows a Master Slave architecture for the transformation and analysis of
large datasets using Hadoop MapReduce paradigm. The important Hadoop
components that play a vital role in the Hadoop architecture are -
A] Hadoop Distributed File System (HDFS) – Patterned after the UNIX file system
B] Hadoop MapReduce
3
Fig 2: Hadoop overview
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.
HDFS supports the rapid transfer of data between compute nodes. At its outset, it
was closely coupled with MapReduce, a programmatic framework for data
processing.
Features of HDFS:
4
5) HDFS follows the master-slave architecture and it has the following elements.
Namenode:
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks:
Datanode:
Goals of HDFS:
Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have
mechanisms for quick and automatic fault detection and recovery.
Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
5
B] Hadoop MapReduce:
MapReduce is mainly used for parallel processing of large sets of data stored in
Hadoop cluster. Initially, it is a hypothesis specially designed by Google to provide
parallelism, data distribution and fault-tolerance. MR processes data in the form of
key-value pairs. A key-value (KV) pair is a mapping element between two linked
data items - key and its value.
The key (K) acts as an identifier to the value. An example of a key-value (KV) pair
is a pair where the key is the node Id and the value is its properties including neighbor
nodes, predecessor node, etc. MR API provides the following features like batch
processing, parallel processing of huge amounts of data and high availability. For
processing large sets of data MR comes into the picture. The programmers will write
MR applications that could be suitable for their business scenarios. Programmers
have to understand the MR working flow and according to the flow, applications
will be developed and deployed across Hadoop clusters. Hadoop built on Java APIs
and it provides some MR APIs that is going to deal with parallel computing across
nodes.
The MR work flow undergoes different phases and the result will be stored in hdfs
with replications.
Job tracker is going to take care of all MR jobs that are running on various nodes
present in the Hadoop cluster. Job tracker plays vital role in scheduling jobs and it
will keep track of the entire map and reduce jobs. Actual map and reduce tasks are
performed by Task tracker.
6
Fig 3: Hadoop MapReduce architecture
Map reduce architecture consists of mainly two processing stages. First one is the
map stage and the second one is reduce stage. The actual MR process happens in
task tracker. In between map and reduce stages, Intermediate process will take place.
Intermediate process will do operations like shuffle and sorting of the mapper output
data. The Intermediate data is going to get stored in local file system.
Mapper Phase:
In Mapper Phase the input data is going to split into 2 components, Key and Value.
The key is writable and comparable in the processing stage. Value is writable only
during the processing stage.
Suppose, client submits input data to Hadoop system, the Job tracker assigns tasks
to task tracker. The input data that is going to get split into several input splits.
Input splits are the logical splits in nature. Record reader converts these input splits
in Key-Value (KV) pair. This is the actual input data format for the mapped input
for further processing of data inside Task tracker. The input format type varies from
7
one type of application to another. So the programmer has to observe input data and
to code according.
Suppose we take Text input format, the key is going to be byte offset and value will
be the entire line. Partition and combiner logics come in to map coding logic only to
perform special data operations. Data localization occurs only in mapper nodes.
Combiner is also called as mini reducer. The reducer code is placed in the mapper
as a combiner. When mapper output is a huge amount of data, it will require high
network bandwidth. To solve this bandwidth issue, we will place the reduced code
in mapper as combiner for better performance. Default partition used in this process
is Hash partition.
A partition module in Hadoop plays a very important role to partition the data
received from either different mappers or combiners. Petitioner reduces the pressure
that builds on reducer and gives more performance. There is a customized partition
which can be performed on any relevant data on different basis or conditions.
Also, it has static and dynamic partitions which play a very important role in hadoop
as well as hive. The partitioner would split the data into numbers of folders using
reducers at the end of map reduce phase.
According to the business requirement developer will design this partition code. This
partitioner runs in between Mapper and Reducer. It is very efficient for query
purpose.
Reducer Phase:
Shuffled and sorted data is going to pass as input to the reducer. In this phase, all
incoming data is going to combine and same actual key value pairs is going to write
8
into hdfs system. Record writer writes data from reducer to hdfs. The reducer is not
so mandatory for searching and mapping purpose.
Reducer logic is mainly used to start the operations on mapper data which is sorted
and finally it gives the reducer outputs like part-r-0001etc,. Options are provided to
set the number of reducers for each job that the user wanted to run. In the
configuration file mapred-site.xml, we have to set some properties which will enable
to set the number of reducers for the particular task.
Speculative Execution plays an important role during job processing. If two or more
mappers are working on the same data and if one mapper is running slow then Job
tracker assigns tasks to the next mapper to run the program fast. The execution will
be on FIFO (First In First Out).
1.3.2 Spark:
Spark provides a faster and more general data processing platform. Spark lets you
run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop.
Last year, Spark took over Hadoop by completing the 100 TB Daytona Gray Sort
contest 3x faster on one tenth the number of machines and it also became the fastest
open source engine for sorting a petabyte.
• Currently provides APIs in Scala, Java, and Python, with support for other
languages (such as R) on the way
9
• Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon
S3, Hive, HBase, Cassandra, etc.)
• Can run on clusters managed by Hadoop YARN or Apache Mesos, and can
also run standalone
Spark Architecture:
Apache Spark has a well-defined and layered architecture where all the spark
components and layers are loosely coupled and integrated with various extensions
and libraries.
RDD’s are collection of data items that are split into partitions and can be stored in-
memory on workers nodes of the spark cluster. In terms of datasets, apache spark
supports two types of RDD’s – Hadoop Datasets which are created from the files
stored on HDFS and parallelized collections which are based on existing Scala
collections. Spark RDD’s support two different types of operations –
Transformations and Actions.
10
DAG is a sequence of computations performed on data where each node is an RDD
partition and edge is a transformation on top of data. The DAG abstraction helps
eliminate the Hadoop MapReduce multi0stage execution model and provides
performance enhancements over Hadoop.
A spark cluster has a single Master and any number of Slaves/Workers. The driver
and the executors run their individual Java processes and users can run them on the
same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster
or in mixed machine configuration.
It is the central point and the entry point of the Spark Shell (Scala, Python, and R).
The driver program runs the main () function of the application and is the place
where the Spark Context is created.
11
Spark Driver contains various components – DAG Scheduler, Task Scheduler,
Backend Scheduler and Block Manager responsible for the translation of spark user
code into actual spark jobs executed on the cluster. The driver program that runs on
the master node of the spark cluster schedules the job execution and negotiates with
the cluster manager. It translates the RDD’s into the execution graph and splits the
graph into multiple stages.
Driver stores the metadata about all the Resilient Distributed Databases and their
partitions. Cockpits of Jobs and Tasks Execution -Driver program converts a user
application into smaller execution units known as tasks. Tasks are then executed by
the executors i.e. the worker processes which run individual tasks. Driver exposes
the information about the running spark application through a Web UI at port 4040.
Executor is a distributed agent responsible for the execution of tasks. Executors run
for the entire lifetime of a Spark application and this phenomenon is known as
“Static Allocation of Executors”.
However, users can also opt for dynamic allocations of executors wherein they can
add or remove spark executors dynamically to match with the overall workload.
An external service responsible for acquiring resources on the spark cluster and
allocating them to a spark job. There are 3 different types of cluster managers a Spark
application can leverage for the allocation and deallocation of various physical
resources such as memory for client spark jobs, CPU memory, etc.
12
Hadoop YARN, Apache Mesos or the simple standalone spark cluster manager
either of them can be launched on-premise Choosing a cluster manager for any spark
application depends on the goals of the application.
Spark Core consists of general execution engine for spark platform that all required
by other functionality which is built upon as per the requirement approach. It
provides in-built memory computing and referencing datasets stored in external
storage systems. Spark allows the developers to write code quickly with the help of
rich set of operators.
Spark SQL:
Spark SQL is a component on top of Spark Core that introduces a new set of data
abstraction called Schema RDD, which provides support for both the structured and
semi-structured data.
13
Spark Streaming:
This component allows Spark to process real-time streaming data. It provides an API
to manipulate data streams that matches with RDD API. It allows the programmers
to understand the project and switch through the applications that manipulate the
data and giving outcome in real-time.
Apache Spark is equipped with a rich library known as MLlib. This library contains
a wide array of machine learning algorithms, classification, clustering and
collaboration filters, etc. It also includes few lower-level primitives. All these
functionalities help Spark scale out across a cluster.
GraphX:
Spark also comes with a library to manipulate the graphs and performing
computations, called as GraphX. Just like Spark Streaming and Spark SQL.
14
CHAPTER 2
LITERATURE REVIEW
According to the recent big data survey over 274 business and IT decision-makers
in Europe, there is a clear trend towards making data available for analysis in (near)
real-time, and over 70% of responders indicate the need for real-time processing.
However, most of the existing big data technologies are designed to achieve high
throughput, but not low latency, probably due to the nature of big data, i.e., high
volume, high diversity and high velocity. The survey also shows that only 9% of the
companies have made progress with faster data availability, due to the complexity
and technical challenges of real-time processing. Most analytic updates and re-
scores must be completed by long-run batch jobs, which significantly delays
presenting the analytic results. Therefore, the whole life-cycle of data analytics
systems (e.g., business intelligence systems) requires innovative techniques and
methodologies that can provide real-time or near real-time results. That is, from the
capturing of real-time business data to transformation and delivery of actionable
information, all these stages in the life-cycle of data analytics require value-added
real-time functionalities, including real-time data feeding from operational sources,
ETL optimization, and generating real-time analytical models and dashboards.
Since the advances toward real time are affecting many enterprise applications, an
increasing number of systems have emerged to support real-time data integration
and analytics in the past few years. Some of the technologies tweaked the existing
cloud-based products to lower the latency when processing large-scale data sets;
while some others came along with the other products to constitute a real-time
processing system. This has brought the new dawn for enterprises to tackle with the
real-time challenge.
15
We believe it is necessary to make a survey of the existing real-time processing
systems. In this paper, we make the following contributions: First, the paper reviews
the MapReduce Hadoop implementation [4]. It discusses the reasons why Hadoop
is not suitable for real-time processing. Then, the paper surveys the real-time
architectures and open source technologies for big data with regards to the following
two categories: data integration and data analytics. In the end, the paper compares
the surveyed technologies in terms of architecture, usability, and failure
recoverability, etc., and also discusses the open issues and the trend of real-time
processing systems.
Marz[2] witnessing the shift in data processing from the batch based to the real-time
based. A lot of technologies can be used to create a real-time processing system,
however, it is complicated and daunting to choose the right tools, incorporate, and
orchestrate them. We are now introducing a generic solution to this problem, called
Lambda Architecture proposed by Marz . This architecture defines the most general
system of running arbitrary functions on arbitrary data using the following equation
“query = function (all data)"[7]. The premise behind this architecture is that one can
run adhoc queries against all the data to get results, however, it is very expensive to
do so in terms of resource. So the idea is to pre-compute the results as a set of views,
and then query the views. Fig6 describes the lambda architecture. The architecture
consists of three layers. First, the batch layer computes views on the collected data,
and repeats the process when it is done to infinity. Its output is always outdated by
the time it is available for new data has been received in the meantime. Second, a
parallel speed processing layer closes this gap by constantly processing the most
recent data in near real-time fashion.
16
Any query against the data is answered through the serving layer, i.e., by querying
both the speed and the batch layers’ serving stores in the serving layer, and the results
are merged to answer user queries.
Serving layer
The batch and speed layers both are forgiving and can recover from errors by being
recomputed, rolled back (batch layer), or simply flushed (speed layer). A core
concept of the lambda architecture is that the (historical) input data is stored
unprocessed. This concept provides the following two advantages. Future
algorithms, analytics, or business cases can retroactively be applied to all the data
by simply adding another view on the whole data. Human errors like bugs can be
corrected by re-computing the whole output after a bug fix. Many systems are built
initially to store only processed or aggregated data.
The lambda architecture[4] can avoid this by storing the raw data as the starting point
for the batch processing. However, the lambda architecture itself is only a paradigm.
17
The technologies with which the different layers are implemented are independent
from the general idea. The speed layer deals only with new data and compensates
for the high latency updates of the batch layer.
It can typically leverage stream processing systems, such as Storm, S4, and Spark,
etc. The batch serving layer can be very simple in contrast. It needs to be horizontally
scalable and supports random reads, but is only updated by batch processes. The
technologies like Hadoop with Cascading, Scalding, Python streaming, Pig, and
Hive, are suitable for the batch layer. The serving layer requires a system that can
perform fast random reads and writes. The data in the serving layer are replaced by
the fresh real-time data, and the views computed by the batch layer. Therefore, the
size of data is relatively small, and the system in serving layer does not need complex
locking mechanisms since the views are replaced in one operation. In-memory data
stores like Redis or Memchache, are enough for this layer. If users require high
availability,low latency and storing large volume of data in the serving store, the
systems like HBase, Cassandra, ElephantDB, MongoDB,or DynamoDB can be the
potential option.
18
Spark can work on the RDDs for multiple iterations which are required by many
machine learning algorithms. A RDD resides in main memory, but can be persisted
to disk as requested. If a partition of RDD is lost, it can be re-built. Spark also
supports shared variable, broadcast variable and accumulator variable. A shared
variable is typically used in the situation where a global value is needed, such as
lookup tables, and number counter.
Spark Streaming is extended from Spark by adding the ability to perform online
processing through a similar functional interface to Spark, such as map, filter,
reduce, etc. Spark Streaming runs streaming computations as a series of short batch
jobs on RDDs, and it can automatically parallel the jobs across the nodes in a cluster.
It supports fault recovery for a wide array of operators.
Overview: MapReduce paradigm and its implementation, Hadoop, are widely used
in data-intensive area, and receive ever-growing attentions. They are the milestone
to the era of big data. Since MapReduce was originally invented to process large-
scale data in batch jobs in a shared nothing cluster environment, the implementation
is very different to the traditional processing systems such as DBMSs. The
scalability, efficiency and fault-tolerance are the main focuses for MapReduce
implementations, instead of real-time capability. In the past few years, many
technologies were proposed to optimize MapReduce based system.
These technologies narrow the “gap" between batch and real-time data processing.
However, we believe that the decision of selecting the best data processing system,
depending on the types and the sources of data, and the processing time needed, and
the ability to take immediate actions. [9]MapReduce paradigm is best suited for
batch jobs.
19
For example, have done the benchmark study of comparing MapReduce paradigm
and parallel DBMSs, and the results show that DBMSs substantially faster than
MapReduce. 1.x, the next generation Hadoop, YARN, is implemented as the open
framework that can integrate with different third-party real-time processing systems,
including Storm, Flume, Spark, etc. which is a promising direction to achieve real-
time capability. By the help of the third-party components, YARN compensates for
the deficiency of real-time capability in Hadoop 1.x. However, at the time of writing
this paper, we still have not found that any real-world cases of using YARN for
doing real-time analytics. To the best of our knowledge, most of the current real-
time enterprise systems are using the three-layer lambda architecture or its variants
as discussed in Section 4. The main reason that makes the lambda architecture
popular is its highly flexibility, in which users can combine different technologies
catering for their business requirements, e.g., in the context of legacy systems, the
lambda architecture can show its advantage over others.
Now many real-time processing [5] systems seek ways to enable in-memory
processing to avoid reading to/from disk whenever possible, as well as integration
with distributed computing technologies. This improves the scalability while
maintaining a low latency.
20
2.3 Overview of survey:
In this paper, we first studied the shortcomings of Hadoop in regards to the real-time
data processing, then presented one of the most popular real-time architectures, the
three-layer lambda architecture.
We have compared the real-time systems from the perspectives of architectures, use
cases (i.e., integration or query analytics), recoverability from failures, and license
types. We also found that despite the diversity, the real-time systems share a great
similarity of which heavily use main memory and distributed computing
technologies.
21
CHAPTER 3
EXISTING METHODS
Traditional data systems, such as relational databases and data warehouses, have
been the primary way businesses and organizations have stored and analyzed their
data for the past 30 to 40 years. Although other data stores and technologies exist,
the major percentage of business data can be found in these traditional systems.
Traditional systems are designed from the ground up to work with data that has
primarily been structured data. Characteristics of structured data include
A design to get data from the disk and load the data into memory to be processed by
applications. This is an extremely inefficient architecture when processing large
volumes of data this way. The data is extremely large, and the programs are small.
The big component must move to the small component for processing.
The use of Structured Query Language (SQL) for managing and accessing the data.
Relational and warehouse database systems that often-read data in 8k or 16k block
sizes.
22
These block sizes load data into memory, and then the data are processed by
applications. When processing large volumes of data, reading the data in these block
sizes is extremely inefficient.
In several traditional siloed environments data scientists can spend 80% of their time
looking for the right data and 20% of the time doing analytics. A data-driven
environment must have data scientists spending a lot more time doing analytics.
Every year organizations need to store more and more detailed information for
longer periods of time. Increased regulation in areas such as health and finance are
significantly increasing storage volumes. Expensive shared storage systems often
store this data because of the critical nature of the information. Shared storage arrays
provide features such as striping (for performance) and mirroring (for availability).
Managing the volume and cost of this data growth within these traditional systems
is usually a stress point for IT organizations. Examples of data often stored in
structured form include Enterprise Resource Planning (ERP), Customer Resource
Management (CRM), financial, retail, and customer.
23
3.1.1 Clustering Technique:-
Clustering is one of the data mining techniques, to retrieve important and relevant
information about data and metadata. We use it to classify different data in different
classes. It relates a way that segments data records into different segments called
classes. The method of identifying similar groups of data in a data set is called
clustering. Entities in each group are comparatively more similar to entities of that
group than those of the other groups. In this article, I will be taking you through the
types of clustering, different clustering algorithms and a comparison between two of
the most commonly used cluster methods.
Types of clustering algorithms: Since the task of clustering is subjective, the means
that can be used for achieving this goal are plenty. Every methodology follows a
different set of rules for defining the ‘similarity’ among data points. In fact, there are
more than 100 clustering algorithms known. But few of the algorithms are used
popularly; let’s look at them in detail:
Connectivity models: As the name suggests, these models are based on the notion
that the data points closer in data space exhibit more similarity to each other than the
data points lying farther away. These models can follow two approaches. In the first
approach, they start with classifying all data points into separate clusters & then
aggregating them as the distance decreases.
In the second approach, all data points are classified as a single cluster and then
partitioned as the distance increases. Also, the choice of distance function is
subjective. These models are very easy to interpret but lack scalability for handling
big datasets. Examples of these models are hierarchical clustering algorithm and its
variants.
24
Centroid models: These are iterative clustering algorithms in which the notion of
similarity is derived by the closeness of a data point to the centroid of the clusters.
K-Means clustering algorithm is a popular algorithm that falls into this category. In
these models, the no. of clusters required at the end have to be mentioned
beforehand, which makes it important to have prior knowledge of the dataset. These
models run iteratively to find the local optima.
Distribution models: These clustering models are based on the notion of how
probable is it that all data points in the cluster belong to the same distribution (For
example: Normal, Gaussian). These models often suffer from overfitting. A popular
example of these models is Expectation-maximization algorithm which uses
multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of
data points in the data space. It isolates various different density regions and assign
the data points within these regions in the same cluster. Popular examples of density
models are DBSCAN and OPTICS.
So, the better choice is to place them as much as possible far away from each other.
The next step is to take each point belonging to a given data set and associate it to
the nearest center. When no point is pending, the first step is completed, and an early
group age is done. At this point we need to re-calculate k new centroids as barycenter
of the clusters resulting from the previous step. After we have these k new centroids,
25
a new binding has to be done between the same data set points and the nearest
new center. A loop has been generated. As a result of this loop we may notice that
the k centers change their location step by step until no more changes are done
or in other words centers do not move any more. Finally, this algorithm aims
at minimizing an objective function know as squared error function given by:
𝑐 𝑐𝑖
Where,
‘||xi - vj||’ is the Euclidean distance between xi and vj.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center
is minimum of all the cluster centers.
26
𝑐𝑖
𝑉𝑖 = (1⁄𝑐𝑖 ) ∑ 𝑋𝑖
𝑗=1
5) Recalculate the distance between each data point and new obtained cluster
centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
Disadvantages:
2) The use of Exclusive Assignment - If there are two highly overlapping data then
k-means will not be able to resolve that there are two clusters.
5) The learning algorithm provides the local optima of the squared error function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful result.
27
Fig 7: Showing the non-linear data set where k-means algorithm fails
7) Applicable only when mean is defined i.e. fails for categorical data.
Advantages:
3) Gives best result when data set are distinct or well separated from each other.
28
3.1.3 Drawbacks of Traditional Systems:-
The reason traditional systems have a problem with big data is that they were not
designed for it.
29
4. Problem—Complexity: When you look at any traditional proprietary solution, it
is full of extremely complex silos of system administrators, DBAs, application
server teams, storage teams, and network teams. Often there is one DBA for every
40 to 50 database servers. Anyone running traditional systems knows that complex
systems fail in complex ways.
➔ Imagine that for a database of 1.1 billion people, one would like to compute
the average number of social contacts a person has according to age. In SQL,
such a query could be expressed as :-
SELECT age, avg(contacts),
FROM social.person,
GROUP BY age, ORDER BY age;
30
3.2 Data analysis using Hadoop MapReduce:-
1. Map: each worker node applies the map function to the local data, and writes
the output to a temporary storage. A master node ensures that only one copy
of redundant input data is processed.
2. Shuffle: worker nodes redistribute data based on the output keys (produced
by the map function), such that all data belonging to one key is located on the
same worker node.
3. Reduce: worker nodes now process each group of output data, per key, in
parallel.
MapReduce allows for distributed processing of the map and reduction operations.
Maps can be performed in parallel, provided that each mapping operation is
independent of the others; in practice, this is limited by the number of independent
data sources and/or the number of CPUs near each source.
31
Similarly, a set of 'reducers' can perform the reduction phase, provided that all
outputs of the map operation that share the same key are presented to the same
reducer at the same time, or that the reduction function is associative. While this
process can often appear inefficient compared to algorithms that are more sequential
(because multiple instances of the reduction process must be run), MapReduce can
be applied to significantly larger datasets than a single "commodity" server can
handle – a large server farm can use MapReduce to sort a petabyte of data in only a
few hours.
32
3.2.1 Logical view:-
→The Map and Reduce functions of MapReduce are both defined with respect to
data structured in (key, value) pairs. Map takes one pair of data with a type in
one data domain, and returns a list of pairs in a different domain:
Map(k1,v1) → list(k2,v2).
→The Map function is applied in parallel to every pair (keyed by K1) in the input
dataset. This produces a list of pairs (keyed by K2 for each call. After that, the
MapReduce framework collects all pairs with the same key (K2) from all lists and
groups them together, creating one group for each key.
→The Reduce function is then applied in parallel to each group, which in turn
produces a collection of values in the same domain:
→Each Reduce call typically produces either one value v3 or an empty return,
though one call is allowed to return more than one value. The returns of all calls are
collected as the desired result list.
→Thus the MapReduce framework transforms a list of (key, value) pairs into a list
of values. This behavior is different from the typical functional programming map
and reduce combination, which accepts a list of arbitrary values and returns one
single value that combines all the values returned by map.
→It is necessary but not sufficient to have implementations of the map and reduce
abstractions in order to implement MapReduce. Distributed implementations of
MapReduce require a means of connecting the processes performing the Map and
Reduce phases. This may be a distributed file system. Other options are possible,
such as direct streaming from mappers to reducers, or for the mapping processors to
serve up their results to reducers that query them.
33
Example using Hadoop Mapreduce: -
➔ Using MapReduce, the K1 key values could be the integers 1 through 1100,
each representing a batch of 1 million records, the K2 key value could be a
person's age in years, and this computation could be achieved using the
following functions :-
function Map is
input: integer K1 between 1 and 1100, representing a batch of 1 million
social.person records
for each social.person record in the K1 batch do
let Y be the person's age
let N be the number of contacts the person has
produce one output record (Y,(N,1))
repeat
end function
function Reduce is
input: age (in years) Y
for each input record (Y,(N,C)) do
Accumulate in S the sum of N*C
Accumulate in Cnew the sum of C
Repeat
let A be S/Cnew
produce one output record (Y,(A,Cnew))
end function
34
3.2.2 Drawbacks of Hadoop:-
→Hadoop does not suit for small data. (HDFS)Hadoop distributed file system lacks
the ability to efficiently support the random reading of small files because of its high
capacity design.
→In Hadoop, with a parallel and distributed algorithm, the MapReduce process data
sets. There are tasks that we need to perform: Map and Reduce and, MapReduce
requires a lot of time to perform these tasks thereby increasing latency. Data is
distributed and processed over the cluster in MapReduce which increases the time
and reduces processing speed.
3. Latency:-
4. Security:-
35
5. No Real time Processing:-
→Apache Hadoop is for batch processing, which means it takes a huge amount of
data in input, process it and produces the result. Hadoop is not suitable for Real-time
data processing.
→Hadoop supports batch processing only, it does not process streamed data, and
hence overall performance is slower. The MapReduce framework of Hadoop does
not leverage the memory of the Hadoop cluster to the maximum.
7. Uncertainity:-
→Hadoop only ensures that the data job is complete, but it’s unable to guarantee
when the job will be complete.
→ Hadoop has a 1,20,000 line of code, the number of lines produces the number of
bugs and it will take more time to execute the program.
9. No Caching:-
→Hadoop is not efficient for caching. In Hadoop, MapReduce cannot cache the
intermediate data in memory for a further requirement which diminishes the
performance of Hadoop.
NOTE:-As a result of Limitations of Hadoop, the need for Spark and Flink emerged.
Thus made the system friendlier to play with a huge amount of data. Spark provides
in-memory processing of data thus improves the processing speed. Flink improves
the overall performance as it provides single run-time for the streaming as well as
batch processing. Spark provides a security bonus.
36
CHAPTER 4
PROPOSED SYSTEM
Twitter is a gold mine of data. Unlike other social platforms, almost every user’s
tweets are completely public and pull able. This is a huge plus if you’re trying to get
a large amount of data to run analytics on. Twitter data is also specific. Twitter’s
API allows you to do complex queries like pulling every tweet about a certain topic
within the last twenty minutes or pull a certain user’s non-retweeted tweets.
A simple application of this could be analyzing how your company is received in the
public. You could collect the last 2,000 tweets that mention your company (or any
term you like) and run a sentiment analysis algorithm over it.
Tools Overview:
We’ll be using Python 3.6 version. Ideally, you should have an IDE to write this
code in. To connect to Twitter’s API, we will be using a Python library
called Tweepy, which we’ll install in a bit.
Getting Started:
• To use Twitter’s API, we must create a developer account on the Twitter apps
site.
• Log in or make a Twitter account at https://apps.twitter.com/.
• Create a new app
37
• Fill in the app creation page with a unique name, a website name (use a
placeholder website if you don’t have one), and a project description. Accept
the terms and conditions and proceed to the next page.
• Once your project has been created, click on the “Keys and Access Tokens”
tab. You should now be able to see your consumer secret and consumer key.
You’ll also need a pair of access token sand request those tokens. The page be
refreshed.
Tweepy is an excellently supported tool for accessing the Twitter API. It supports
Python 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6. There are a couple of different ways to install
Tweepy. The easiest way is using pip.
Using Pip: -Simply type pip install tweepy into your terminal.
Step 3: -Authenticating
First, let’s import Tweepy and add our own authentication information.
import tweepy
consumer_secret="qXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXh"
access_token="9XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXi"
access_token_secret="kXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXT"
38
Now it’s time to create our API object.
→Prerequisites:
Installing Hadoop:
Step 1: Download the Java 8 Package from official website. Save this file in your
home directory.
39
Command: tar -xvf hadoop-2.7.3.tar.gz
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open .bashrc file. Now, add Hadoop and Java Path as shown below.
For applying all these changes to the current Terminal, execute the source command.
To make sure that Java and Hadoop have been properly installed on your system
and can be accessed through the Terminal, execute the java -version and hadoop
version commands.
Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls
40
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
Command: vi core-site.xml
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag:
Command: vi hdfs-site.xml
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
41
mapred-site.xml contains configuration settings of MapReduce application like
number of JVM that can run in parallel, the size of the mapper and the reducer
process, CPU cores available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create the
mapred-site.xml file using mapred-site.xml template.
Command: vi mapred-site.xml.
Step10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
Command: vi yarn-site.xml
42
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run
Hadoop like Java home path, etc.
Command: vi hadoop–env.sh
Command: cd
Command: cd hadoop-2.7.3
This formats the HDFS via NameNode. This command is only executed for the first
time. Formatting the file system means initializing the directory specified by the
dfs.name.dir variable.
Never format, up and running Hadoop filesystem. You will lose all your data stored
in the HDFS.
Command: cd hadoop-2.7.3/sbin
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh
43
→The above command is a combination of start-dfs.sh, start-yarn.sh & mr-
jobhistory-daemon.sh
Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree
of all files stored in the HDFS and tracks all the file stored across the cluster.
Start DataNode:
Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources and
thus helps in managing the distributed applications running on the YARN system.
Its work is to manage each NodeManagers and the each application’s
ApplicationMaster.
Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for
managing containers, monitoring their resource usage and reporting the same to the
ResourceManager.
→After installing Hadoop successfully, Download Eclipse from official website and
upload Hadoop jar files into Eclipse and run the Hadoop java codes in Eclipse.
Prerequisites:
→Unzip the folder in your home directory using the following command.
→Use the following command to see that you have a .bashrc file ls -a
→Next, we will edit our .bashrc so we can open a spark notebook in any directory by
using command gedit ~/.bashrc and export the following:
45
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
→Save and exit out of your .bashrc file. Either close the terminal and open a new one
or in your terminal type : source ~/.bashrc
2. Using Twitter send the data extracted into another host where it needs to be
processed.
3. The solution providing for streaming real-time log data is to extract the twitter
data.
4. Providing a file which contains the keywords of specified twitter data
according to the need in the spark processing logic.
5. Processing logic should be written in spark-Scala or spark-java and store in
HDFS for data process purposes.
6. Using twitter for sending the streaming data into another port.
7. Spark-streaming to receive the data from the port and check the data which
contains specified information (according to the code written for which type
of data needed), extract that data and store into HDFS or H-Base.
8. On the Stored specified data categorize that data using Tableau Visualization.
46
4.2.1 Data flow in spark streaming: -
Internally, it works as follows. Spark Streaming receives live input data streams and
divides the data into batches, which are then processed by the Spark engine to
generate the final stream of results in batches.
DStream: -
47
1) Data collection: -
It is the major process in spark streaming. Collecting data or log files from any web
servers can be done in many ways. Data can be collected from twitter, facebook,
cloud and then it must be stored in any centralized stores for processing them. We
have used tweepy for collecting data.
HDFS:
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware. HDFS holds very large amount of
data and provides easier access. To store such huge data, the files are stored across
multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.
Hardware Failure Hardware failure is the norm rather than the exception. An HDFS
instance may consist of hundreds or thousands of server machines, each storing part
of the file system’s data. The fact that there are a huge number of components and
that each component has a non-trivial probability of failure means that some
component of HDFS is always non-functional. Therefore, detection of faults and
quick, automatic recovery from them is a core architectural goal of HDFS.
Applications that run on HDFS need streaming access to their data sets. They are not
general-purpose applications that typically run-on general-purpose file systems.
HDFS is designed more for batch processing rather than interactive use by users.
48
The emphasis is on high throughput of data access rather than low latency of data
access. POSIX imposes many hard requirements that are not needed for applications
that are targeted for HDFS. POSIX semantics in a few key areas has been traded to
increase data throughput rates
Applications that run on HDFS have large data sets. A typical file in HDFS is
gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should
provide high aggregate data bandwidth and scale to hundreds of nodes in a single
cluster. It should support tens of millions of files in a single instance.
HDFS applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies data
coherency issues and enables high throughput data access. A MapReduce
application or a web crawler application fits perfectly with this model. There is a
plan to support appending-writes to files in the future.
49
Portability Across Heterogeneous Hardware and Software Platforms: -
HDFS has been designed to be easily portable from one platform to another. This
facilitates widespread adoption of HDFS as a platform of choice for a large set of
applications.
2) Data Processing:
Data processing is done in Spark shell. Data can be processed using any
programming languages such as python, Java, scala. Here we have used Python for
processing the data or log files that are stored in HDFS file system. The data
processed must be stored again in any file system for visualizing it.
What is python?
Python's simple, easy to learn syntax emphasizes readability and therefore reduces
the cost of program maintenance. Python supports modules and packages, which
encourages program modularity and code reuse. The Python interpreter and the
extensive standard library are available in source or binary form without charge for
all major platforms, and can be freely distributed.
After executing the code in python language in spark shell, the data that is processed
need to be visualized again. The data visualization process is done as following:
50
3) Data visualization:
Because of the way the human brain processes information, using charts or graphs
to visualize large amounts of complex data is easier than poring over spreadsheets.
Sentiment Analysis:
It's also referred as subjectivity analysis, opinion mining, and appraisal extraction.
In this project, to analyze and visualize tweets, Text Blob is used.
51
Why Sentiment Analysis?
1. Data Storage Hadoop stores data on disk Spark stores data in-
memory
3. Line of code Hadoop 2.0 has 1,20,000 line Apache Spark has only
of code 20000 line of code
52
5. Streaming data With Hadoop Mapreduce one Spark can be used for
can process batch of stored process as well as modify
data real time data with Spark
Streaming
10. Security Apache Hadoop Map Reduce Spark is little less secure in
is more secure because of comparison to Map Reduce
Kerberos. because it supports the only
authentication through
shared secret password
authentication
53
CHAPTER 5
54
Results from Hadoop Word Count:
56
CHAPTER 6
CONCLUSION
In this project, Apache Spark showed how to analyze Twitter data in real time. As
obtained in our experimental setup, Spark analyzed the tweets in a small period
which is less than a second, this prove that Spark is a good tool for processing
streaming data in real-time. If the data is static and it is possible to wait for the end
of batch processing, MapReduce is enough for processing. But if we want to analyze
streaming data, it is necessary to use Spark. And as shown in our study, Spark was
able to analyze data in few seconds. Spark is a high-quality tool for memory
processing that allows processing streaming data in real-time on a large amount of
data. Apache Spark is much more advanced than MapReduce. It supports several
requirements like real-time processing, batch and streaming.
Hence, the differences between Apache Spark vs. Hadoop MapReduce shows that
Apache Spark is much-advance cluster computing engine than MapReduce. Spark
can handle any type of requirements (batch, interactive, iterative, streaming, graph)
while MapReduce limits to Batch processing. Spark is one of the favorite choices of
data scientist. Thus, Apache Spark is growing very quickly and replacing
MapReduce
57
CHAPTER 7
FUTURE WORK
The framework Apache Flink surpasses Apache Spark. Further research should
focus of the optimization of Hadoop and Spark by tuning different default parameter
configuration settings. In order to satisfy this optimization, we suggest applying this
tuning to improve the performance.
The key vision for Apache Flink is to overcome and reduces the complexity that has
been faced by other distributed data-driven engines. It is achieved by integrating
query optimization, concepts from database systems and efficient parallel in-
memory and out-of-core algorithms, with the MapReduce framework.
As Apache Flink is mainly based on the streaming model, Apache Flink iterates data
by using streaming architecture. The concept of an iterative algorithm is tightly
bounded into Flink query optimizer. Apache Flink’s pipelined architecture allows
processing the streaming data faster with lower latency than micro-batch
architectures (Spark).
However, this optimization issue can be solved by apache flink which is much faster
than spark. Flink increases the performance of the job by instructing to only process
part of data that have changed. Of course, we will fix this issue in future using apache
flink.
58
CHAPTER 8
REFERENCES
[2] Dean, J., & Ghemawat, S. (2010). MapReduce: a flexible data processing tool.
Communications of the ACM, 53(1), 72-77.
[3] Patel, A. B., Birla, M., & Nair, U. (2012, December). Addressing big data
problem using Hadoop and Map Reduce. In 2012 Nirma University International
Conference on Engineering (NUiCONE) (pp. 1-5). IEEE.
[4] Karau, H., Konwinski, A., Wendell, P., &Zaharia, M. (2015). Learning spark:
lightning-fast big data analysis. " O'Reilly Media, Inc.".
59
[14] Hadoop, https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-
common/SingleCluster.html
[16] https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-
architecture.html
[18] https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/
60