Spark 4-2 Documentation

CHAPTER 1
INTRODUCTION
1.1 Problem Statement:
In an every minute of a single day, the expansion of data is rapidly increasing from
the past few years where traditional database storage is not enough to process that
huge data. From then “Hadoop” was introduced which is one of the tool of “Big
Data” that stores huge amount of data and analyses it. But it cannot analyze and
represent the data in real-time.
1.2 Significance of work:
Hadoop MapReduce fails when it comes to real-time data processing, as it was

designed to perform batch processing on voluminous amounts of data. So, we use
Apache Spark which is unified analytics engine for large scale processing. It is
designed in such a way that it can perform batch processing (processing of the
previously collected job in a single batch) and stream processing (deal with
streaming data). It is a general purpose, cluster computing platform. The key feature
of spark is that it has in-memory cluster computation capability. Because of its
features like Speed, Security, Real time processing of data, Spark became an
important tool of big data for fast computation and real time analystics.
1.3 Big Data:
Big data is a field that treats ways to analyze, systematically extract information
from, or otherwise deal with data sets that are too large or complex to be dealt with
by traditional data-processing application software. Big Data is used to describe a
collection of data that is huge in size and yet growing exponentially with time.
1
In short such data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.
Characteristics Of Big Data: -
(i) Volume – The name Big Data itself is related to a size which is enormous. Size
of data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data. Hence, 'Volume' is one characteristic which needs to be
considered while dealing with Big Data.
(ii) Variety – The next aspect of Big Data is its variety. Variety refers to
heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in
the analysis applications. This variety of unstructured data poses certain issues for
storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
→Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
→Today's market is flooded with an array of Big Data tools. They bring cost
efficiency, better time management into the data analytical tasks.
2
There are many big data tools like Hadoop, Spark, Cassandra, Storm etc.., But in
this Project we use Hadoop and Spark for cluster computation concepts.
1.3.1 Hadoop:
Hadoop is an open source framework that allows distributed processing of a large

amount of data through a cluster using a simple programming model. It allows
applications to work on multiple nodes for big data processing
Fig 1: Hadoop architecture
Apache Hadoop offers a scalable, flexible and reliable distributed computing big
data framework for a cluster of systems with storage capacity.
Hadoop follows a Master Slave architecture for the transformation and analysis of
large datasets using Hadoop MapReduce paradigm. The important Hadoop
components that play a vital role in the Hadoop architecture are -
A] Hadoop Distributed File System (HDFS) – Patterned after the UNIX file system
B] Hadoop MapReduce
3
Fig 2: Hadoop overview
A] HDFS (Hadoop Distributed File System):
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.
HDFS supports the rapid transfer of data between compute nodes. At its outset, it
was closely coupled with MapReduce, a programmatic framework for data
processing.
Features of HDFS:
1) It is suitable for the distributed storage and processing.
2) Hadoop provides a command interface to interact with HDFS.
3) Streaming access to file system data.
4) HDFS provides file permissions and authentication
4
5) HDFS follows the master-slave architecture and it has the following elements.
Namenode:
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks:
• Manages the file system namespace.

• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening
files and directories.
Datanode:
The datanode is a commodity hardware having the GNU/Linux operating system

and datanode software. For every node (Commodity hardware/System) in a cluster,
there will be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client

request.They also perform operations such as block creation, deletion, and
replication according to the instructions of the name node.
Goals of HDFS:
Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have
mechanisms for quick and automatic fault detection and recovery.
Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
5
B] Hadoop MapReduce:
MapReduce is mainly used for parallel processing of large sets of data stored in
Hadoop cluster. Initially, it is a hypothesis specially designed by Google to provide
parallelism, data distribution and fault-tolerance. MR processes data in the form of
key-value pairs. A key-value (KV) pair is a mapping element between two linked
data items - key and its value.
The key (K) acts as an identifier to the value. An example of a key-value (KV) pair
is a pair where the key is the node Id and the value is its properties including neighbor
nodes, predecessor node, etc. MR API provides the following features like batch
processing, parallel processing of huge amounts of data and high availability. For
processing large sets of data MR comes into the picture. The programmers will write
MR applications that could be suitable for their business scenarios. Programmers
have to understand the MR working flow and according to the flow, applications
will be developed and deployed across Hadoop clusters. Hadoop built on Java APIs
and it provides some MR APIs that is going to deal with parallel computing across
nodes.
The MR work flow undergoes different phases and the result will be stored in hdfs
with replications.
Job tracker is going to take care of all MR jobs that are running on various nodes
present in the Hadoop cluster. Job tracker plays vital role in scheduling jobs and it
will keep track of the entire map and reduce jobs. Actual map and reduce tasks are
performed by Task tracker.
6
Fig 3: Hadoop MapReduce architecture
Map reduce architecture consists of mainly two processing stages. First one is the
map stage and the second one is reduce stage. The actual MR process happens in
task tracker. In between map and reduce stages, Intermediate process will take place.
Intermediate process will do operations like shuffle and sorting of the mapper output
data. The Intermediate data is going to get stored in local file system.
Mapper Phase:
In Mapper Phase the input data is going to split into 2 components, Key and Value.
The key is writable and comparable in the processing stage. Value is writable only
during the processing stage.
Suppose, client submits input data to Hadoop system, the Job tracker assigns tasks
to task tracker. The input data that is going to get split into several input splits.
Input splits are the logical splits in nature. Record reader converts these input splits
in Key-Value (KV) pair. This is the actual input data format for the mapped input
for further processing of data inside Task tracker. The input format type varies from
7
one type of application to another. So the programmer has to observe input data and
to code according.
Suppose we take Text input format, the key is going to be byte offset and value will
be the entire line. Partition and combiner logics come in to map coding logic only to
perform special data operations. Data localization occurs only in mapper nodes.
Combiner is also called as mini reducer. The reducer code is placed in the mapper
as a combiner. When mapper output is a huge amount of data, it will require high
network bandwidth. To solve this bandwidth issue, we will place the reduced code
in mapper as combiner for better performance. Default partition used in this process
is Hash partition.
A partition module in Hadoop plays a very important role to partition the data
received from either different mappers or combiners. Petitioner reduces the pressure
that builds on reducer and gives more performance. There is a customized partition
which can be performed on any relevant data on different basis or conditions.
Also, it has static and dynamic partitions which play a very important role in hadoop
as well as hive. The partitioner would split the data into numbers of folders using
reducers at the end of map reduce phase.
According to the business requirement developer will design this partition code. This
partitioner runs in between Mapper and Reducer. It is very efficient for query
purpose.
Reducer Phase:
Shuffled and sorted data is going to pass as input to the reducer. In this phase, all
incoming data is going to combine and same actual key value pairs is going to write
8
into hdfs system. Record writer writes data from reducer to hdfs. The reducer is not
so mandatory for searching and mapping purpose.
Reducer logic is mainly used to start the operations on mapper data which is sorted
and finally it gives the reducer outputs like part-r-0001etc,. Options are provided to
set the number of reducers for each job that the user wanted to run. In the
configuration file mapred-site.xml, we have to set some properties which will enable
to set the number of reducers for the particular task.
Speculative Execution plays an important role during job processing. If two or more
mappers are working on the same data and if one mapper is running slow then Job
tracker assigns tasks to the next mapper to run the program fast. The execution will
be on FIFO (First In First Out).
1.3.2 Spark:
Spark is an Apache project advertised as “lightning fast cluster computing”. It has a

thriving open-source community and is the most active Apache project at the
moment.
Spark provides a faster and more general data processing platform. Spark lets you
run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop.
Last year, Spark took over Hadoop by completing the 100 TB Daytona Gray Sort
contest 3x faster on one tenth the number of machines and it also became the fastest
open source engine for sorting a petabyte.
Features of Spark include:
• Currently provides APIs in Scala, Java, and Python, with support for other
languages (such as R) on the way
9
• Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon
S3, Hive, HBase, Cassandra, etc.)
• Can run on clusters managed by Hadoop YARN or Apache Mesos, and can
also run standalone
Evolution: Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s

AMPLab by MateiZaharia. It was Open Sourced in 2010 under a BSD license. It
was donated to Apache software foundation in 2013, and now Apache Spark has
become a top level Apache project from Feb-2014.
Spark Architecture:
Apache Spark has a well-defined and layered architecture where all the spark
components and layers are loosely coupled and integrated with various extensions
and libraries.
Apache Spark Architecture is based on two main abstractions-
A] Resilient Distributed Datasets (RDD):
RDD’s are collection of data items that are split into partitions and can be stored in-
memory on workers nodes of the spark cluster. In terms of datasets, apache spark
supports two types of RDD’s – Hadoop Datasets which are created from the files
stored on HDFS and parallelized collections which are based on existing Scala
collections. Spark RDD’s support two different types of operations –
Transformations and Actions.
B] Directed Acyclic Graph (DAG):
Direct-Transformation is an action which transitions data partition (A-B)
Acyclic -Transformation cannot return to the older partition.
10
DAG is a sequence of computations performed on data where each node is an RDD
partition and edge is a transformation on top of data. The DAG abstraction helps
eliminate the Hadoop MapReduce multi0stage execution model and provides
performance enhancements over Hadoop.
Fig 4: Spark Architecture
• Master Daemon – (Master/Driver Process)

• Worker Daemon – (Slave Process)
A spark cluster has a single Master and any number of Slaves/Workers. The driver
and the executors run their individual Java processes and users can run them on the
same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster
or in mixed machine configuration.
Role of Driver in Spark Architecture:
• Spark Driver – Master Node of a Spark Application
It is the central point and the entry point of the Spark Shell (Scala, Python, and R).
The driver program runs the main () function of the application and is the place
where the Spark Context is created.
11
Spark Driver contains various components – DAG Scheduler, Task Scheduler,
Backend Scheduler and Block Manager responsible for the translation of spark user
code into actual spark jobs executed on the cluster. The driver program that runs on
the master node of the spark cluster schedules the job execution and negotiates with
the cluster manager. It translates the RDD’s into the execution graph and splits the
graph into multiple stages.
Driver stores the metadata about all the Resilient Distributed Databases and their
partitions. Cockpits of Jobs and Tasks Execution -Driver program converts a user
application into smaller execution units known as tasks. Tasks are then executed by
the executors i.e. the worker processes which run individual tasks. Driver exposes
the information about the running spark application through a Web UI at port 4040.
Role of Executor in Spark Architecture:
Executor is a distributed agent responsible for the execution of tasks. Executors run
for the entire lifetime of a Spark application and this phenomenon is known as
“Static Allocation of Executors”.
However, users can also opt for dynamic allocations of executors wherein they can
add or remove spark executors dynamically to match with the overall workload.
Role of Cluster Manager in Spark Architecture:
An external service responsible for acquiring resources on the spark cluster and
allocating them to a spark job. There are 3 different types of cluster managers a Spark
application can leverage for the allocation and deallocation of various physical
resources such as memory for client spark jobs, CPU memory, etc.
12
Hadoop YARN, Apache Mesos or the simple standalone spark cluster manager
either of them can be launched on-premise Choosing a cluster manager for any spark
application depends on the goals of the application.
1.3.2 Spark Components:
Fig 5: Spark ecosystem
Apache Spark Core:
Spark Core consists of general execution engine for spark platform that all required
by other functionality which is built upon as per the requirement approach. It
provides in-built memory computing and referencing datasets stored in external
storage systems. Spark allows the developers to write code quickly with the help of
rich set of operators.
Spark SQL:
Spark SQL is a component on top of Spark Core that introduces a new set of data
abstraction called Schema RDD, which provides support for both the structured and
semi-structured data.
13
Spark Streaming:
This component allows Spark to process real-time streaming data. It provides an API
to manipulate data streams that matches with RDD API. It allows the programmers
to understand the project and switch through the applications that manipulate the
data and giving outcome in real-time.
MLlib (Machine Learning Library):
Apache Spark is equipped with a rich library known as MLlib. This library contains
a wide array of machine learning algorithms, classification, clustering and
collaboration filters, etc. It also includes few lower-level primitives. All these
functionalities help Spark scale out across a cluster.
GraphX:
Spark also comes with a library to manipulate the graphs and performing
computations, called as GraphX. Just like Spark Streaming and Spark SQL.
14
CHAPTER 2
LITERATURE REVIEW
According to the recent big data survey over 274 business and IT decision-makers
in Europe, there is a clear trend towards making data available for analysis in (near)
real-time, and over 70% of responders indicate the need for real-time processing.
However, most of the existing big data technologies are designed to achieve high
throughput, but not low latency, probably due to the nature of big data, i.e., high
volume, high diversity and high velocity. The survey also shows that only 9% of the
companies have made progress with faster data availability, due to the complexity
and technical challenges of real-time processing. Most analytic updates and re-
scores must be completed by long-run batch jobs, which significantly delays
presenting the analytic results. Therefore, the whole life-cycle of data analytics
systems (e.g., business intelligence systems) requires innovative techniques and
methodologies that can provide real-time or near real-time results. That is, from the
capturing of real-time business data to transformation and delivery of actionable
information, all these stages in the life-cycle of data analytics require value-added
real-time functionalities, including real-time data feeding from operational sources,
ETL optimization, and generating real-time analytical models and dashboards.
Since the advances toward real time are affecting many enterprise applications, an
increasing number of systems have emerged to support real-time data integration
and analytics in the past few years. Some of the technologies tweaked the existing
cloud-based products to lower the latency when processing large-scale data sets;
while some others came along with the other products to constitute a real-time
processing system. This has brought the new dawn for enterprises to tackle with the
real-time challenge.
15
We believe it is necessary to make a survey of the existing real-time processing
systems. In this paper, we make the following contributions: First, the paper reviews
the MapReduce Hadoop implementation [4]. It discusses the reasons why Hadoop
is not suitable for real-time processing. Then, the paper surveys the real-time
architectures and open source technologies for big data with regards to the following
two categories: data integration and data analytics. In the end, the paper compares
the surveyed technologies in terms of architecture, usability, and failure
recoverability, etc., and also discusses the open issues and the trend of real-time
processing systems.
2.1 Real-time system architecture:
Marz[2] witnessing the shift in data processing from the batch based to the real-time
based. A lot of technologies can be used to create a real-time processing system,
however, it is complicated and daunting to choose the right tools, incorporate, and
orchestrate them. We are now introducing a generic solution to this problem, called
Lambda Architecture proposed by Marz . This architecture defines the most general
system of running arbitrary functions on arbitrary data using the following equation
“query = function (all data)"[7]. The premise behind this architecture is that one can
run adhoc queries against all the data to get results, however, it is very expensive to
do so in terms of resource. So the idea is to pre-compute the results as a set of views,
and then query the views. Fig6 describes the lambda architecture. The architecture
consists of three layers. First, the batch layer computes views on the collected data,
and repeats the process when it is done to infinity. Its output is always outdated by
the time it is available for new data has been received in the meantime. Second, a
parallel speed processing layer closes this gap by constantly processing the most
recent data in near real-time fashion.
16
Any query against the data is answered through the serving layer, i.e., by querying
both the speed and the batch layers’ serving stores in the serving layer, and the results
are merged to answer user queries.
Serving layer
Fig 6: Three-layer lambda architecture
The batch and speed layers both are forgiving and can recover from errors by being
recomputed, rolled back (batch layer), or simply flushed (speed layer). A core
concept of the lambda architecture is that the (historical) input data is stored
unprocessed. This concept provides the following two advantages. Future
algorithms, analytics, or business cases can retroactively be applied to all the data
by simply adding another view on the whole data. Human errors like bugs can be
corrected by re-computing the whole output after a bug fix. Many systems are built
initially to store only processed or aggregated data.
The lambda architecture[4] can avoid this by storing the raw data as the starting point
for the batch processing. However, the lambda architecture itself is only a paradigm.
17
The technologies with which the different layers are implemented are independent
from the general idea. The speed layer deals only with new data and compensates
for the high latency updates of the batch layer.
It can typically leverage stream processing systems, such as Storm, S4, and Spark,
etc. The batch serving layer can be very simple in contrast. It needs to be horizontally
scalable and supports random reads, but is only updated by batch processes. The
technologies like Hadoop with Cascading, Scalding, Python streaming, Pig, and
Hive, are suitable for the batch layer. The serving layer requires a system that can
perform fast random reads and writes. The data in the serving layer are replaced by
the fresh real-time data, and the views computed by the batch layer. Therefore, the
size of data is relatively small, and the system in serving layer does not need complex
locking mechanisms since the views are replaced in one operation. In-memory data
stores like Redis or Memchache, are enough for this layer. If users require high
availability,low latency and storing large volume of data in the serving store, the
systems like HBase, Cassandra, ElephantDB, MongoDB,or DynamoDB can be the
potential option.
2.2 Real-time platform:
Spark[6] and Spark Streaming. Spark is a cluster computing system originally

developed by UC Berkeley AMP Lab. Now it is an umbrellaed project of Apache
foundation. The aim of Spark is to make data analytics program run faster by offering
a general execution model that optimizes arbitrary operator graphs and supports in-
memory computing. This execution model is called Resilient Distributed Dataset
(RDD), which is a distributed memory abstraction of data. Spark performs in-
memory computations on large clusters in a fault-tolerant manner through RDD’s.
18
Spark can work on the RDDs for multiple iterations which are required by many
machine learning algorithms. A RDD resides in main memory, but can be persisted
to disk as requested. If a partition of RDD is lost, it can be re-built. Spark also
supports shared variable, broadcast variable and accumulator variable. A shared
variable is typically used in the situation where a global value is needed, such as
lookup tables, and number counter.
Spark Streaming is extended from Spark by adding the ability to perform online
processing through a similar functional interface to Spark, such as map, filter,
reduce, etc. Spark Streaming runs streaming computations as a series of short batch
jobs on RDDs, and it can automatically parallel the jobs across the nodes in a cluster.
It supports fault recovery for a wide array of operators.
Overview: MapReduce paradigm and its implementation, Hadoop, are widely used
in data-intensive area, and receive ever-growing attentions. They are the milestone
to the era of big data. Since MapReduce was originally invented to process large-
scale data in batch jobs in a shared nothing cluster environment, the implementation
is very different to the traditional processing systems such as DBMSs. The
scalability, efficiency and fault-tolerance are the main focuses for MapReduce
implementations, instead of real-time capability. In the past few years, many
technologies were proposed to optimize MapReduce based system.
These technologies narrow the “gap" between batch and real-time data processing.
However, we believe that the decision of selecting the best data processing system,
depending on the types and the sources of data, and the processing time needed, and
the ability to take immediate actions. [9]MapReduce paradigm is best suited for
batch jobs.
19
For example, have done the benchmark study of comparing MapReduce paradigm
and parallel DBMSs, and the results show that DBMSs substantially faster than
MapReduce. 1.x, the next generation Hadoop, YARN, is implemented as the open
framework that can integrate with different third-party real-time processing systems,
including Storm, Flume, Spark, etc. which is a promising direction to achieve real-
time capability. By the help of the third-party components, YARN compensates for
the deficiency of real-time capability in Hadoop 1.x. However, at the time of writing
this paper, we still have not found that any real-world cases of using YARN for
doing real-time analytics. To the best of our knowledge, most of the current real-
time enterprise systems are using the three-layer lambda architecture or its variants
as discussed in Section 4. The main reason that makes the lambda architecture
popular is its highly flexibility, in which users can combine different technologies
catering for their business requirements, e.g., in the context of legacy systems, the
lambda architecture can show its advantage over others.
The requirement of real-time processing of high-volume data streams brings a

revolution to the traditional data processing infrastructures. The data streams include
ticks of market data, network monitoring, fraud detection, and command and control
in military environments, etc. The real-time processing and analytics allow a
business user to take immediate action in response to the events of streams that are
significant. It is vital to optimize the data stream processing technologies, such as to
minimize disk I/O and communication.
Now many real-time processing [5] systems seek ways to enable in-memory
processing to avoid reading to/from disk whenever possible, as well as integration
with distributed computing technologies. This improves the scalability while
maintaining a low latency.
20
2.3 Overview of survey:
Hadoop does not effectively accommodate the needs of real-time processing

capability. In order to overcome this limitation, some real-time processing platforms
appeared recently. The emerging systems have shown the advantage in dealing with
uninterrupted data streams and provide real-time data analytics.
In this paper, we first studied the shortcomings of Hadoop in regards to the real-time
data processing, then presented one of the most popular real-time architectures, the
three-layer lambda architecture.
We have compared the real-time systems from the perspectives of architectures, use
cases (i.e., integration or query analytics), recoverability from failures, and license
types. We also found that despite the diversity, the real-time systems share a great
similarity of which heavily use main memory and distributed computing
technologies.
It is of great value to come up with an efficient framework for gathering, processing

and analyzing big data in a near real-time/real-time fashion. Investigations into the
current popular platforms and techniques provide a good reference for the users who
are interested in implementing real-time systems for big data. The results also show
that there is a need for benchmarking the real-time processing systems.
21
CHAPTER 3
EXISTING METHODS
3.1 Traditional RDBMS:
Traditional data systems, such as relational databases and data warehouses, have
been the primary way businesses and organizations have stored and analyzed their
data for the past 30 to 40 years. Although other data stores and technologies exist,
the major percentage of business data can be found in these traditional systems.
Traditional systems are designed from the ground up to work with data that has
primarily been structured data. Characteristics of structured data include
• Clearly defined fields organized in records.

• Records are usually stored in tables. Fields have names, and relationships are
defined between different fields
• Schema-on-write that requires data be validated against a schema before it can
be written to disk.
A significant amount of requirements analysis, design, and effort up front can be

involved in putting the data in clearly defined structured formats. This can increase
the time before business value can be realized from the data.
A design to get data from the disk and load the data into memory to be processed by
applications. This is an extremely inefficient architecture when processing large
volumes of data this way. The data is extremely large, and the programs are small.
The big component must move to the small component for processing.
The use of Structured Query Language (SQL) for managing and accessing the data.
Relational and warehouse database systems that often-read data in 8k or 16k block
sizes.
22
These block sizes load data into memory, and then the data are processed by
applications. When processing large volumes of data, reading the data in these block
sizes is extremely inefficient.
Organizations today contain large volumes of information that is not actionable or

being leveraged for the information it contains.
An order management system is designed to take orders. A web application is

designed for operational efficiency. A customer system is designed to manage
information on customers. Data from these systems usually reside in separate data
silos. However, bringing this information together and correlating with other data
can help establish detailed patterns on customers.
In several traditional siloed environments data scientists can spend 80% of their time
looking for the right data and 20% of the time doing analytics. A data-driven
environment must have data scientists spending a lot more time doing analytics.
Every year organizations need to store more and more detailed information for
longer periods of time. Increased regulation in areas such as health and finance are
significantly increasing storage volumes. Expensive shared storage systems often
store this data because of the critical nature of the information. Shared storage arrays
provide features such as striping (for performance) and mirroring (for availability).
Managing the volume and cost of this data growth within these traditional systems
is usually a stress point for IT organizations. Examples of data often stored in
structured form include Enterprise Resource Planning (ERP), Customer Resource
Management (CRM), financial, retail, and customer.
23
3.1.1 Clustering Technique:-
Clustering is one of the data mining techniques, to retrieve important and relevant
information about data and metadata. We use it to classify different data in different
classes. It relates a way that segments data records into different segments called
classes. The method of identifying similar groups of data in a data set is called
clustering. Entities in each group are comparatively more similar to entities of that
group than those of the other groups. In this article, I will be taking you through the
types of clustering, different clustering algorithms and a comparison between two of
the most commonly used cluster methods.
Types of clustering algorithms: Since the task of clustering is subjective, the means
that can be used for achieving this goal are plenty. Every methodology follows a
different set of rules for defining the ‘similarity’ among data points. In fact, there are
more than 100 clustering algorithms known. But few of the algorithms are used
popularly; let’s look at them in detail:
Connectivity models: As the name suggests, these models are based on the notion
that the data points closer in data space exhibit more similarity to each other than the
data points lying farther away. These models can follow two approaches. In the first
approach, they start with classifying all data points into separate clusters & then
aggregating them as the distance decreases.
In the second approach, all data points are classified as a single cluster and then
partitioned as the distance increases. Also, the choice of distance function is
subjective. These models are very easy to interpret but lack scalability for handling
big datasets. Examples of these models are hierarchical clustering algorithm and its
variants.
24
Centroid models: These are iterative clustering algorithms in which the notion of
similarity is derived by the closeness of a data point to the centroid of the clusters.
K-Means clustering algorithm is a popular algorithm that falls into this category. In
these models, the no. of clusters required at the end have to be mentioned
beforehand, which makes it important to have prior knowledge of the dataset. These
models run iteratively to find the local optima.
Distribution models: These clustering models are based on the notion of how
probable is it that all data points in the cluster belong to the same distribution (For
example: Normal, Gaussian). These models often suffer from overfitting. A popular
example of these models is Expectation-maximization algorithm which uses
multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of
data points in the data space. It isolates various different density regions and assign
the data points within these regions in the same cluster. Popular examples of density
models are DBSCAN and OPTICS.
3.1.2 Clustering algorithm:
K-means is a clustering algorithm that solve the well-known clustering problem.

The procedure follows a simple and easy way to classify a given data set through
a certain number of clusters (assume k clusters) fixed apriori. The main idea is
to define k centers, one for each cluster.
So, the better choice is to place them as much as possible far away from each other.
The next step is to take each point belonging to a given data set and associate it to
the nearest center. When no point is pending, the first step is completed, and an early
group age is done. At this point we need to re-calculate k new centroids as barycenter
of the clusters resulting from the previous step. After we have these k new centroids,
25
a new binding has to be done between the same data set points and the nearest
new center. A loop has been generated. As a result of this loop we may notice that
the k centers change their location step by step until no more changes are done
or in other words centers do not move any more. Finally, this algorithm aims
at minimizing an objective function know as squared error function given by:
𝑐 𝑐𝑖
𝐽(𝑉 ) = ∑ ∑(||𝑋𝑖 − 𝑉𝑗 ||)2

𝑖=1 𝑗=1
Where,
‘||xi - vj||’ is the Euclidean distance between xi and vj.
‘ci’ is the number of data points in ith cluster.
‘c’ is the number of cluster centers.
Algorithmic steps for k-means clustering:
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the

set of centers.
1) Randomly select ‘c’ cluster centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center
is minimum of all the cluster centers.
4) Recalculate the new cluster center using:
26
𝑐𝑖
𝑉𝑖 = (1⁄𝑐𝑖 ) ∑ 𝑋𝑖
𝑗=1
where, ‘ci’ represents the number of data points in ith cluster.
5) Recalculate the distance between each data point and new obtained cluster
centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
Disadvantages:
1) The learning algorithm requires apriori specification of the number of cluster

centers.
2) The use of Exclusive Assignment - If there are two highly overlapping data then
k-means will not be able to resolve that there are two clusters.
3) The learning algorithm is not invariant to non-linear transformations i.e. with

different representation of data we get different results (data represented in form of
cartesian co-ordinates and polar co-ordinates will give different results).
4) Euclidean distance measures can unequally weight underlying factors.
5) The learning algorithm provides the local optima of the squared error function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful result.
27
Fig 7: Showing the non-linear data set where k-means algorithm fails
7) Applicable only when mean is defined i.e. fails for categorical data.
8) Unable to handle noisy data and outliers.
9) Algorithm fails for non-linear data set.
Advantages:
1) Fast, robust and easier to understand.
2)Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension

of each object, and t is # iterations. Normally, k, t, d << n.
3) Gives best result when data set are distinct or well separated from each other.
28
3.1.3 Drawbacks of Traditional Systems:-
The reason traditional systems have a problem with big data is that they were not
designed for it.
1. Problem—Schema-On-Write: Traditional systems are schema-on-write. Schema-

on-write requires the data to be validated when it is written. This means that a lot of
work must be done before new data sources can be analyzed. Here is an example:
Suppose a company wants to start analyzing a new source of data from unstructured
or semi-structured sources. A company will usually spend months (3–6 months)
designing schemas and so on to store the data in a data warehouse. That is 3 to 6
months that the company cannot use the data to make business decisions. Then when
the data warehouse design is completed 6 months later, often the data has changed
again. If you look at data structures from social media, they change on a regular
basis. The schema-on-write environment is too slow and rigid to deal with the
dynamics of semi-structured and unstructured data environments that are changing
over a period of time. The other problem with unstructured data is that traditional
systems usually use Large Object Byte (LOB) types to handle unstructured data,
which is often very inconvenient and difficult to work with.
2. Problem—Cost of Storage: As organizations start to ingest larger volumes of data,

shared storage is cost prohibitive. Traditional systems uses shared storage.
3. Problem—Cost of Proprietary Hardware: Large proprietary hardware solutions

can be cost prohibitive when deployed to process extremely large volumes of data.
Organizations are spending millions of dollars in hardware and software licensing
costs while supporting large data environments.
29
4. Problem—Complexity: When you look at any traditional proprietary solution, it
is full of extremely complex silos of system administrators, DBAs, application
server teams, storage teams, and network teams. Often there is one DBA for every
40 to 50 database servers. Anyone running traditional systems knows that complex
systems fail in complex ways.
5. Problem—Causation: Because data is so expensive to store in traditional systems,

data is filtered, and aggregated, and large volumes are thrown out because of the cost
of storage. Minimizing the data to be analyzed reduces the accuracy and confidence
of the results. Not only are accuracy and confidence to the resulting data affected,
but it also limits an organization’s ability to identify business opportunities. Atomic
data can yield more insights into the data than aggregated data.
6. Problem—Bringing Data to the Programs: In relational databases and data

warehouses, data are loaded from shared storage elsewhere in the datacenter. The
data must go over wires and through switches that have bandwidth limitations before
programs can process the data. For many types of analytics that process 10s, 100s,
and 1000s of terabytes, the capability of the computational side to process data
greatly exceeds the storage bandwidth available.
An example of how data is analysed in Traditional RDBMS :-
➔ Imagine that for a database of 1.1 billion people, one would like to compute
the average number of social contacts a person has according to age. In SQL,
such a query could be expressed as :-
SELECT age, avg(contacts),
FROM social.person,
GROUP BY age, ORDER BY age;
30
3.2 Data analysis using Hadoop MapReduce:-
MapReduce is a framework for processing parallelizable problems across large

datasets using a large number of computers (nodes), collectively referred to as
a cluster (if all nodes are on the same local network and use similar hardware) or
a grid (if the nodes are shared across geographically and administratively distributed
systems, and use more heterogeneous hardware). Processing can occur on data
stored either in a filesystem (unstructured) or in a database (structured). MapReduce
can take advantage of the locality of data, processing it near the place it is stored in
order to minimize communication overhead.
A MapReduce framework (or system) is usually composed of three operations (or

steps):
1. Map: each worker node applies the map function to the local data, and writes
the output to a temporary storage. A master node ensures that only one copy
of redundant input data is processed.
2. Shuffle: worker nodes redistribute data based on the output keys (produced
by the map function), such that all data belonging to one key is located on the
same worker node.
3. Reduce: worker nodes now process each group of output data, per key, in
parallel.
MapReduce allows for distributed processing of the map and reduction operations.
Maps can be performed in parallel, provided that each mapping operation is
independent of the others; in practice, this is limited by the number of independent
data sources and/or the number of CPUs near each source.
31
Similarly, a set of 'reducers' can perform the reduction phase, provided that all
outputs of the map operation that share the same key are presented to the same
reducer at the same time, or that the reduction function is associative. While this
process can often appear inefficient compared to algorithms that are more sequential
(because multiple instances of the reduction process must be run), MapReduce can
be applied to significantly larger datasets than a single "commodity" server can
handle – a large server farm can use MapReduce to sort a petabyte of data in only a
few hours.
→Another way to look at MapReduce is as a 5-step parallel and distributed

computation:
1. Prepare the Map() input – the "MapReduce system" designates Map

processors, assigns the input key value K1 that each processor would work
on, and provides that processor with all the input data associated with that key
value.
2. Run the user-provided Map() code – Map() is run exactly once for
each K1 key value, generating output organized by key values K2.
3. "Shuffle" the Map output to the Reduce processors – the MapReduce
system designates Reduce processors, assigns the K2 key value each
processor should work on, and provides that processor with all the Map-
generated data associated with that key value.
4. Run the user-provided Reduce() code – Reduce() is run exactly once for
each K2 key value produced by the Map step.
5. Produce the final output – the MapReduce system collects all the Reduce
output, and sorts it by K2 to produce the final outcome.
32
3.2.1 Logical view:-
→The Map and Reduce functions of MapReduce are both defined with respect to
data structured in (key, value) pairs. Map takes one pair of data with a type in
one data domain, and returns a list of pairs in a different domain:
Map(k1,v1) → list(k2,v2).
→The Map function is applied in parallel to every pair (keyed by K1) in the input
dataset. This produces a list of pairs (keyed by K2 for each call. After that, the
MapReduce framework collects all pairs with the same key (K2) from all lists and
groups them together, creating one group for each key.
→The Reduce function is then applied in parallel to each group, which in turn
produces a collection of values in the same domain:
Reduce(k2, list(V2)) → list(V3)
→Each Reduce call typically produces either one value v3 or an empty return,
though one call is allowed to return more than one value. The returns of all calls are
collected as the desired result list.
→Thus the MapReduce framework transforms a list of (key, value) pairs into a list
of values. This behavior is different from the typical functional programming map
and reduce combination, which accepts a list of arbitrary values and returns one
single value that combines all the values returned by map.
→It is necessary but not sufficient to have implementations of the map and reduce
abstractions in order to implement MapReduce. Distributed implementations of
MapReduce require a means of connecting the processes performing the Map and
Reduce phases. This may be a distributed file system. Other options are possible,
such as direct streaming from mappers to reducers, or for the mapping processors to
serve up their results to reducers that query them.
33
Example using Hadoop Mapreduce: -
➔ Using MapReduce, the K1 key values could be the integers 1 through 1100,
each representing a batch of 1 million records, the K2 key value could be a
person's age in years, and this computation could be achieved using the
following functions :-
function Map is
input: integer K1 between 1 and 1100, representing a batch of 1 million
social.person records
for each social.person record in the K1 batch do
let Y be the person's age
let N be the number of contacts the person has
produce one output record (Y,(N,1))
repeat
end function
function Reduce is
input: age (in years) Y
for each input record (Y,(N,C)) do
Accumulate in S the sum of N*C
Accumulate in Cnew the sum of C
Repeat
let A be S/Cnew
produce one output record (Y,(A,Cnew))
end function
34
3.2.2 Drawbacks of Hadoop:-
1. Issue with Small files:-
→Hadoop does not suit for small data. (HDFS)Hadoop distributed file system lacks
the ability to efficiently support the random reading of small files because of its high
capacity design.
2. Slow Processing Speed:-
→In Hadoop, with a parallel and distributed algorithm, the MapReduce process data
sets. There are tasks that we need to perform: Map and Reduce and, MapReduce
requires a lot of time to perform these tasks thereby increasing latency. Data is
distributed and processed over the cluster in MapReduce which increases the time
and reduces processing speed.
3. Latency:-
→In Hadoop, MapReduce framework is comparatively slower, since it is for

supporting different format, structure and huge volume of data. MapReduce requires
a lot of time to perform these tasks thereby increasing latency.
4. Security:-
→Hadoop is challenging in managing the complex application. If the user doesn’t

know how to enable a platform who is managing the platform, your data can be a
huge risk. At storage and network levels, Hadoop is missing encryption, which is a
major point of concern. Hadoop supports Kerberos authentication, which is hard to
manage.
35
5. No Real time Processing:-
→Apache Hadoop is for batch processing, which means it takes a huge amount of
data in input, process it and produces the result. Hadoop is not suitable for Real-time
data processing.
6. Support for Batch Processing only:-
→Hadoop supports batch processing only, it does not process streamed data, and
hence overall performance is slower. The MapReduce framework of Hadoop does
not leverage the memory of the Hadoop cluster to the maximum.
7. Uncertainity:-
→Hadoop only ensures that the data job is complete, but it’s unable to guarantee
when the job will be complete.
8. Lengthy Line of code:-
→ Hadoop has a 1,20,000 line of code, the number of lines produces the number of
bugs and it will take more time to execute the program.
9. No Caching:-
→Hadoop is not efficient for caching. In Hadoop, MapReduce cannot cache the
intermediate data in memory for a further requirement which diminishes the
performance of Hadoop.
NOTE:-As a result of Limitations of Hadoop, the need for Spark and Flink emerged.
Thus made the system friendlier to play with a huge amount of data. Spark provides
in-memory processing of data thus improves the processing speed. Flink improves
the overall performance as it provides single run-time for the streaming as well as
batch processing. Spark provides a security bonus.
36
CHAPTER 4
PROPOSED SYSTEM
4.1 Extracting live tweet data from Twitter as input:
Why Twitter data?
Twitter is a gold mine of data. Unlike other social platforms, almost every user’s
tweets are completely public and pull able. This is a huge plus if you’re trying to get
a large amount of data to run analytics on. Twitter data is also specific. Twitter’s
API allows you to do complex queries like pulling every tweet about a certain topic
within the last twenty minutes or pull a certain user’s non-retweeted tweets.
A simple application of this could be analyzing how your company is received in the
public. You could collect the last 2,000 tweets that mention your company (or any
term you like) and run a sentiment analysis algorithm over it.
Tools Overview:
We’ll be using Python 3.6 version. Ideally, you should have an IDE to write this
code in. To connect to Twitter’s API, we will be using a Python library
called Tweepy, which we’ll install in a bit.
Getting Started:
Step 1: -Twitter Developer Account
• To use Twitter’s API, we must create a developer account on the Twitter apps
site.
• Log in or make a Twitter account at https://apps.twitter.com/.
• Create a new app
37
• Fill in the app creation page with a unique name, a website name (use a
placeholder website if you don’t have one), and a project description. Accept
the terms and conditions and proceed to the next page.
• Once your project has been created, click on the “Keys and Access Tokens”
tab. You should now be able to see your consumer secret and consumer key.
You’ll also need a pair of access token sand request those tokens. The page be
refreshed.
Step 2: -Installing Tweepy
Tweepy is an excellently supported tool for accessing the Twitter API. It supports
Python 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6. There are a couple of different ways to install
Tweepy. The easiest way is using pip.
Using Pip: -Simply type pip install tweepy into your terminal.
Step 3: -Authenticating
First, let’s import Tweepy and add our own authentication information.
import tweepy
consumer _key = "wXXXXXXXXXXXXXXXXXXXXXXX1"
consumer_secret="qXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXh"
access_token="9XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXi"
access_token_secret="kXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXT"
38
Now it’s time to create our API object.
# Creating the authentication object
auth = tweepy. OAuthHandler(consumer_key, consumer_secret)
# Setting your access token and secret
auth.set_access_token (access_token, access_token_secret)
# Creating the API object while passing in auth information--api=tweepy.API(auth)
4.2 Implementing Hadoop framework:
How Hadoop is installed: -
→Prerequisites:
1. VIRTUAL BOX: it is used for installing the operating system on it.
2. OPERATING SYSTEM: You can install Hadoop on Linux based operating

systems i.e.., Ubuntu
3. JAVA: You need to install the Java 8 package on your system.
4. HADOOP: You require Hadoop 2.7.3 package.
Installing Hadoop:
Step 1: Download the Java 8 Package from official website. Save this file in your
home directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Step 3: Download the Hadoop 2.7.3 Package.
Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-

2.7.3.tar.gz
Step 4: Extract the Hadoop tar File.
39
Command: tar -xvf hadoop-2.7.3.tar.gz
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open .bashrc file. Now, add Hadoop and Java Path as shown below.
Command: vi. bashrc
Fig8: Hadoop Installation – Setting Environment Variable
Then, save the bash file and close it.
For applying all these changes to the current Terminal, execute the source command.
Command: source .bashrc
To make sure that Java and Hadoop have been properly installed on your system
and can be accessed through the Terminal, execute the java -version and hadoop
version commands.
Command: java –version
Command: hadoop version
Step 6: Edit the Hadoop Configuration files.
Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls
40
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where Namenode runs in the cluster. It

contains configuration settings of Hadoop core such as I/O settings that are common
to HDFS & MapReduce.
Command: vi core-site.xml
Fig9: Hadoop Installation – Configuring core-site.xml
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode,

DataNode, Secondary NameNode). It also includes the replication factor and block
size of HDFS.
Command: vi hdfs-site.xml
Fig10: Hadoop Installation – Configuring hdfs-site.xml
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
41
mapred-site.xml contains configuration settings of MapReduce application like
number of JVM that can run in parallel, the size of the mapper and the reducer
process, CPU cores available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create the
mapred-site.xml file using mapred-site.xml template.
Command: cp mapred-site.xml.template mapred-site.xml
Command: vi mapred-site.xml.
Fig11: Hadoop Installation – Configuring mapred-site.xml
Step10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of Resource Manager and

NodeManager like application memory management size, the operation needed on
program & algorithm, etc.
Command: vi yarn-site.xml
Fig12: Hadoop Installation – Configuring yarn-site.xml
42
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run
Hadoop like Java home path, etc.
Command: vi hadoop–env.sh
Fig13: Hadoop Installation – Configuring hadoop-env.sh
Step 12: Go to Hadoop home directory and format the NameNode.
Command: cd
Command: cd hadoop-2.7.3
Command: bin/hadoop namenode -format
This formats the HDFS via NameNode. This command is only executed for the first
time. Formatting the file system means initializing the directory specified by the
dfs.name.dir variable.
Never format, up and running Hadoop filesystem. You will lose all your data stored
in the HDFS.
Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and

start all the daemons.
Command: cd hadoop-2.7.3/sbin
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh
43
→The above command is a combination of start-dfs.sh, start-yarn.sh & mr-
jobhistory-daemon.sh
Or you can run all the services individually as below:
Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree
of all files stored in the HDFS and tracks all the file stored across the cluster.
Command: ./hadoop-daemon.sh start namenode
Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the requests

from the Namenode for different operations.
Command: ./hadoop-daemon.sh start datanode
Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources and
thus helps in managing the distributed applications running on the YARN system.
Its work is to manage each NodeManagers and the each application’s
ApplicationMaster.
Command: ./yarn-daemon.sh start resourcemanager
Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for
managing containers, monitoring their resource usage and reporting the same to the
ResourceManager.
Command: ./yarn-daemon.sh start nodemanager

44
Step 15: Now open the Mozilla browser localhost: 50070/dfshealth.html to check
the NameNode interface.
→After installing Hadoop successfully, Download Eclipse from official website and
upload Hadoop jar files into Eclipse and run the Hadoop java codes in Eclipse.
4.3 Implementing spark framework:
How Spark is installed in python Jupyter notebook :-
Prerequisites:
1. Download and install Anaconda.
2. Go to the Apache Spark website.
→Make sure you have java installed on your machine.
Go to your home directory -> cd~
→Unzip the folder in your home directory using the following command.
tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz
→Use the following command to see that you have a .bashrc file ls -a
→Next, we will edit our .bashrc so we can open a spark notebook in any directory by
using command gedit ~/.bashrc and export the following:
45
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
→Save and exit out of your .bashrc file. Either close the terminal and open a new one
or in your terminal type : source ~/.bashrc
→In terminal : $pyspark

Steps in making the process:
1. By considering different types of twitter data and store in one host.
2. Using Twitter send the data extracted into another host where it needs to be
processed.
3. The solution providing for streaming real-time log data is to extract the twitter
data.
4. Providing a file which contains the keywords of specified twitter data
according to the need in the spark processing logic.
5. Processing logic should be written in spark-Scala or spark-java and store in
HDFS for data process purposes.
6. Using twitter for sending the streaming data into another port.
7. Spark-streaming to receive the data from the port and check the data which
contains specified information (according to the code written for which type
of data needed), extract that data and store into HDFS or H-Base.
8. On the Stored specified data categorize that data using Tableau Visualization.
46
4.2.1 Data flow in spark streaming: -
Fig 14: data flow in spark streaming
Internally, it works as follows. Spark Streaming receives live input data streams and
divides the data into batches, which are then processed by the Spark engine to
generate the final stream of results in batches.
Fig 15: Internal data flow of spark
DStream: -
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming.

It represents a continuous stream of data, either the input data stream received from
source, or the processed data stream generated by transforming the input stream.
Internally, a DStream is represented by a continuous series of RDDs, which is
Spark’s abstraction of an immutable, distributed dataset
47
1) Data collection: -
It is the major process in spark streaming. Collecting data or log files from any web
servers can be done in many ways. Data can be collected from twitter, facebook,
cloud and then it must be stored in any centralized stores for processing them. We
have used tweepy for collecting data.
HDFS:
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware. HDFS holds very large amount of
data and provides easier access. To store such huge data, the files are stored across
multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.
Assumptions and Goals: -
Hardware Failure Hardware failure is the norm rather than the exception. An HDFS
instance may consist of hundreds or thousands of server machines, each storing part
of the file system’s data. The fact that there are a huge number of components and
that each component has a non-trivial probability of failure means that some
component of HDFS is always non-functional. Therefore, detection of faults and
quick, automatic recovery from them is a core architectural goal of HDFS.
Streaming Data Access: -
Applications that run on HDFS need streaming access to their data sets. They are not
general-purpose applications that typically run-on general-purpose file systems.
HDFS is designed more for batch processing rather than interactive use by users.
48
The emphasis is on high throughput of data access rather than low latency of data
access. POSIX imposes many hard requirements that are not needed for applications
that are targeted for HDFS. POSIX semantics in a few key areas has been traded to
increase data throughput rates
Large Data Sets:-
Applications that run on HDFS have large data sets. A typical file in HDFS is
gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should
provide high aggregate data bandwidth and scale to hundreds of nodes in a single
cluster. It should support tens of millions of files in a single instance.
Simple Coherency Model: -
HDFS applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies data
coherency issues and enables high throughput data access. A MapReduce
application or a web crawler application fits perfectly with this model. There is a
plan to support appending-writes to files in the future.
Moving Computation is Cheaper than Moving Data: -
A computation requested by an application is much more efficient if it is executed

near the data it operates on. This is especially true when the size of the data set is
huge. This minimizes network congestion and increases the overall throughput of
the system. The assumption is that it is often better to migrate the computation closer
to where the data is located rather than moving the data to where the application is
running. HDFS provides interfaces for applications to move themselves closer to
where the data is located.
49
Portability Across Heterogeneous Hardware and Software Platforms: -
HDFS has been designed to be easily portable from one platform to another. This
facilitates widespread adoption of HDFS as a platform of choice for a large set of
applications.
2) Data Processing:
Data processing is done in Spark shell. Data can be processed using any
programming languages such as python, Java, scala. Here we have used Python for
processing the data or log files that are stored in HDFS file system. The data
processed must be stored again in any file system for visualizing it.
Here, we will implement the code in python language:
What is python?
Python is an interpreted, object-oriented, high-level programming language with

dynamic semantics. Its high-level built in data structures, combined with dynamic
typing and dynamic binding, make it very attractive for Rapid Application
Development.
Python's simple, easy to learn syntax emphasizes readability and therefore reduces
the cost of program maintenance. Python supports modules and packages, which
encourages program modularity and code reuse. The Python interpreter and the
extensive standard library are available in source or binary form without charge for
all major platforms, and can be freely distributed.
After executing the code in python language in spark shell, the data that is processed
need to be visualized again. The data visualization process is done as following:
50
3) Data visualization:
Why is data visualization important?
Because of the way the human brain processes information, using charts or graphs
to visualize large amounts of complex data is easier than poring over spreadsheets.
Data visualization is a quick, easy way to convey concepts in a universal manner –

and you can experiment with different scenarios by making slight adjustments.
Data visualization can also:
• Identify areas that need attention or improvement.
• Clarify which factors influence customer behavior.
• Help you understand which products to place where.
• Predict sales volumes.
There are many visualization tools such as:
• Plotly, Text Blob, Data Hero, Tableau, Raw etc..,
Sentiment Analysis:
Sentiment analysis can be defined as a process that automates mining of attitudes,

opinions, views and emotions from text, speech, tweets and database sources through
Natural Language Processing (NLP). Sentiment analysis involves classifying
opinions in text into categories like "positive" or "negative" or "neutral".
It's also referred as subjectivity analysis, opinion mining, and appraisal extraction.
In this project, to analyze and visualize tweets, Text Blob is used.
51
Why Sentiment Analysis?
• Business: In marketing field companies use it to develop their strategies, to

understand customers’ feelings towards products or brand, how people respond
to their campaigns or product launches and why consumers don’t buy some
products.
• Politics: In political field, it is used to keep track of political view, to detect
consistency and inconsistency between statements and actions at the
government level. It can be used to predict election results as well!
• Public Actions: Sentiment analysis also is used to monitor and analyse social
phenomena, for the spotting of potentially dangerous situations and determining
the general mood of the blogosphere.
4.3 Comparison of Hadoop and Spark: -
S.No Features Hadoop Spark
1. Data Storage Hadoop stores data on disk Spark stores data in-
memory
2. Fault tolerance Uses replication to achieve Spark uses RDD’s

fault tolerance (Resilient Distributed
Datasets) for fault
tolerance
3. Line of code Hadoop 2.0 has 1,20,000 line Apache Spark has only
of code 20000 line of code
4. High level Java Python, Java, scala

language
52
5. Streaming data With Hadoop Mapreduce one Spark can be used for
can process batch of stored process as well as modify
data real time data with Spark
Streaming
6. Easy to manage Difficult to program and Easy to program and does

requires abstractions not require any abstractions
7. Speed Hadoop is relatively slower Spark is 100X times faster

than spark than Hadoop due to its In-
memory computation
8. Cost Hadoop handles in low cost Spark requires more cost
9. Latency Hadoop is a high latency Spark has low latency

computing framework computing
10. Security Apache Hadoop Map Reduce Spark is little less secure in
is more secure because of comparison to Map Reduce
Kerberos. because it supports the only
authentication through
shared secret password
authentication
Table1: Comparision between Hadoop and Spark
53
CHAPTER 5
RESULTS AND ANALYSIS
5.1 Initial Implementation Details:
Twitter app creation after achieving developer access :
Fig 16: Twitter app
Output of extracting live tweets:
Fig 17: Output of twitter live tweets
54
Results from Hadoop Word Count:
Fig18: Output of Hadoop Word Count job
Results from Spark Word Count:
Fig19: Output of Spark Word Count job

55
Twitter Sentiment Analysis:
Fig20: Sentiment Analysis on twitter dataset
56
CHAPTER 6
CONCLUSION
In this project, Apache Spark showed how to analyze Twitter data in real time. As
obtained in our experimental setup, Spark analyzed the tweets in a small period
which is less than a second, this prove that Spark is a good tool for processing
streaming data in real-time. If the data is static and it is possible to wait for the end
of batch processing, MapReduce is enough for processing. But if we want to analyze
streaming data, it is necessary to use Spark. And as shown in our study, Spark was
able to analyze data in few seconds. Spark is a high-quality tool for memory
processing that allows processing streaming data in real-time on a large amount of
data. Apache Spark is much more advanced than MapReduce. It supports several
requirements like real-time processing, batch and streaming.
Hence, the differences between Apache Spark vs. Hadoop MapReduce shows that
Apache Spark is much-advance cluster computing engine than MapReduce. Spark
can handle any type of requirements (batch, interactive, iterative, streaming, graph)
while MapReduce limits to Batch processing. Spark is one of the favorite choices of
data scientist. Thus, Apache Spark is growing very quickly and replacing
MapReduce
57
CHAPTER 7
FUTURE WORK
The framework Apache Flink surpasses Apache Spark. Further research should
focus of the optimization of Hadoop and Spark by tuning different default parameter
configuration settings. In order to satisfy this optimization, we suggest applying this
tuning to improve the performance.
Why Apache Flink?
The key vision for Apache Flink is to overcome and reduces the complexity that has
been faced by other distributed data-driven engines. It is achieved by integrating
query optimization, concepts from database systems and efficient parallel in-
memory and out-of-core algorithms, with the MapReduce framework.
As Apache Flink is mainly based on the streaming model, Apache Flink iterates data
by using streaming architecture. The concept of an iterative algorithm is tightly
bounded into Flink query optimizer. Apache Flink’s pipelined architecture allows
processing the streaming data faster with lower latency than micro-batch
architectures (Spark).
However, this optimization issue can be solved by apache flink which is much faster
than spark. Flink increases the performance of the job by instructing to only process
part of data that have changed. Of course, we will fix this issue in future using apache
flink.
58
CHAPTER 8
REFERENCES
[1] Gunarathne, T. (2015).Hadoop MapReduce v2 Cookbook. Packt Publishing Ltd.
[2] Dean, J., & Ghemawat, S. (2010). MapReduce: a flexible data processing tool.
Communications of the ACM, 53(1), 72-77.
[3] Patel, A. B., Birla, M., & Nair, U. (2012, December). Addressing big data
problem using Hadoop and Map Reduce. In 2012 Nirma University International
Conference on Engineering (NUiCONE) (pp. 1-5). IEEE.
[4] Karau, H., Konwinski, A., Wendell, P., &Zaharia, M. (2015). Learning spark:
lightning-fast big data analysis. " O'Reilly Media, Inc.".
[5] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large

clusters. Commun. ACM, 51(1):107–113, 2008.
[6]Holden Karau, Lightening-fast Big data Analytics
[7] Databricks, https://databricks.com/spark/about
[8] Spark, http://spark.apache.org/
[9] Hadoop, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[10] Spark, https://spark.apache.org/docs/1.6.0/spark-standalone.html
[11] Spark, https://spark.apache.org/docs/1.6.0/running-on-yarn.html
[12] Spark http://databricks.github.io/simr/
[13] Tweets, http://www.internetlivestats.com/twitter-statistics/
59
[14] Hadoop, https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-
common/SingleCluster.html
[15] Eclipse, https://www.eclipse.org/downloads/packages/release/photon
[16] https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-
architecture.html
[17] Tweepy, www.tweepy.org
[18] https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/
60

Spark 4-2 Documentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark 4-2 Documentation

Uploaded by

Copyright:

Available Formats

CHAPTER 1

1.1 Problem Statement:

1.2 Significance of work:

Hadoop MapReduce fails when it comes to real-time data processing, as it was

1.3 Big Data:

Characteristics Of Big Data: -

Hadoop is an open source framework that allows distributed processing of a large

Fig 1: Hadoop architecture

A] HDFS (Hadoop Distributed File System):

1) It is suitable for the distributed storage and processing.

2) Hadoop provides a command interface to interact with HDFS.

3) Streaming access to file system data.

4) HDFS provides file permissions and authentication

• Manages the file system namespace.

The datanode is a commodity hardware having the GNU/Linux operating system

Datanodes perform read-write operations on the file systems, as per client

Spark is an Apache project advertised as “lightning fast cluster computing”. It has a

Features of Spark include:

Evolution: Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s

Apache Spark Architecture is based on two main abstractions-

A] Resilient Distributed Datasets (RDD):

B] Directed Acyclic Graph (DAG):

Direct-Transformation is an action which transitions data partition (A-B)

Acyclic -Transformation cannot return to the older partition.

Fig 4: Spark Architecture

• Master Daemon – (Master/Driver Process)

Role of Driver in Spark Architecture:

• Spark Driver – Master Node of a Spark Application

Role of Executor in Spark Architecture:

Role of Cluster Manager in Spark Architecture:

1.3.2 Spark Components:

Fig 5: Spark ecosystem

Apache Spark Core:

MLlib (Machine Learning Library):

2.1 Real-time system architecture:

Fig 6: Three-layer lambda architecture

2.2 Real-time platform:

Spark[6] and Spark Streaming. Spark is a cluster computing system originally

The requirement of real-time processing of high-volume data streams brings a

Hadoop does not effectively accommodate the needs of real-time processing

It is of great value to come up with an efficient framework for gathering, processing

3.1 Traditional RDBMS:

• Clearly defined fields organized in records.

A significant amount of requirements analysis, design, and effort up front can be

Organizations today contain large volumes of information that is not actionable or

An order management system is designed to take orders. A web application is

3.1.2 Clustering algorithm:

K-means is a clustering algorithm that solve the well-known clustering problem.

𝐽(𝑉 ) = ∑ ∑(||𝑋𝑖 − 𝑉𝑗 ||)2

‘ci’ is the number of data points in ith cluster.

‘c’ is the number of cluster centers.

Algorithmic steps for k-means clustering:

Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the

1) Randomly select ‘c’ cluster centers.

4) Recalculate the new cluster center using:

where, ‘ci’ represents the number of data points in ith cluster.

1) The learning algorithm requires apriori specification of the number of cluster

3) The learning algorithm is not invariant to non-linear transformations i.e. with

4) Euclidean distance measures can unequally weight underlying factors.

8) Unable to handle noisy data and outliers.

9) Algorithm fails for non-linear data set.

1) Fast, robust and easier to understand.

2)Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension