You are on page 1of 24

Big Data Analytics(BDA)

GTU #3170722

Unit-5

Spark
 Outline
Looping
• Introduction to Data Analysis with Spark
• Features of Apache Spark
• Components of Spark
• Downloading Spark and Getting Started
• RDD Transformations
• RDD Actions
• Programming with RDDs
• Machine Learning with MLlib
• Spark Applications
Basics of Spark
 Apache Spark is a lightning-fast cluster computing designed for fast computation.
 It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently
use more types of computations which includes Interactive Queries and Stream Processing.
 One of the main features Spark offers for speed is the ability to run computations in memory,
but the system is also more efficient than MapReduce for complex applications running on
disk.
 Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming.
 Apart from supporting all these workload in a respective system, it reduces the management
burden of maintaining separate tools.
 Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL,
and rich built-in libraries. It also integrates closely with other Big Data tools.
 Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra.
Feature of Spark
 Speed
 Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster
when running on disk.
 This is possible by reducing number of read/write operations to disk.
 It stores the intermediate processing data in memory.
 Supports multiple languages
 Spark provides built-in APIs in Java, Scala, or Python.
 Therefore, you can write applications in different languages.
 Spark comes up with 80 high-level operators for interactive querying.
 Advanced Analytics
 Spark not only supports ‘Map’ and ‘reduce’.
 It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Components of Spark
Apache Spark consists of Six Components. These are mentioned below

Cluster Managers

• The design of spark is efficiently scale up from one to thousands of computer nodes.
• To achieve this while maximum flexibility, Spark may executer variety of cluster managers, including Apache
Mesos, Hadoop Yarn and simple cluster manager comprised in Spark itself is known s the Standalone
Scheduler.
• Spark support for these types of cluster managers permits your application to run on them.
Components of Spark– Cont.
 Spark Core
 The Spark Core is the heart of Spark and performs the core functionality.
 It holds the components for task scheduling, fault recovery, interacting with storage systems and memory
management.

 Spark SQL
 The Spark SQL is built on the top of Spark Core. It provides support for structured data.
 It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive variant of SQL?
called the HQL (Hive Query Language).
 It supports JDBC and ODBC connections that establish a relation between Java objects and existing
databases, data warehouses and business intelligence tools.
 It also supports various sources of data like Hive tables, Parquet, and JSON.
Components of Spark– Cont.
 Spark Streaming
 Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of streaming
data.
 It uses Spark Core's fast scheduling capability to perform streaming analytics.
 It accepts data in mini-batches and performs RDD transformations on that data.
 Its design ensures that the applications written for streaming data can be reused to analyze batches of
historical data with little modification.
 The log files generated by web servers can be considered as a real-time example of a data stream.

 MLlib (Machine Learning Library)


 The MLlib is a Machine Learning library that contains various machine learning algorithms.
 These include correlations and hypothesis testing, classification and regression, clustering, and principal
component analysis.
 It is nine times faster than the disk-based implementation used by Apache Mahout.
Components of Spark– Cont.
 GraphX
 The GraphX is a library that is used to manipulate graphs and perform graph-parallel computations.
 It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge.
 To manipulate graph, it supports various fundamental operators like subgraph, join Vertices, and aggregate
Messages.

 SPARK R
 It is a commonly used programming language by scientists due to its simplicity and capacity for running
sophisticated algorithms.
 It has the biggest drawback of having a single node for computing power so, it is useless for processing
massive amounts of data.
Components of Spark– Cont.
 Please visit below link for installing spark in windows:

http://www.labsom.ct.utfpr.edu.br/mrosa/computacao-nuvem-seguranca-redes/material-
extra/how-to-install-apache-spark-on-windows-10.pdf

 Please visit below link for installing spark in Linux:

https://www.tutorialspoint.com/apache-spark/apache-spark-installation.htm

 For download spark file please visit below link:

https://spark.apache.org/downloads.html
RDD Operations
 The RDD provides the two types of operations:
 Transformation
 Action

 Transformation
 In Spark, the role of transformation is to create a new dataset from an existing one.
 The transformations are considered lazy as they only computed when an action requires a result to be
returned to the driver program.

 Action
 In Spark, the role of action is to return a value to the driver program after running a computation on the
dataset.
RDD Transformations
 The following table lists some of the common transformations supported by Spark:
Transformation Purpose
map(func) Returns a new RDD by applying the func function on each data element of the source RDD.
filter(func) Returns a new RDD by selecting those data elements for which the applied func function returns true.
This transformation is similar to map: the difference is that each input item can be mapped to zero or multiple output items (the applied func
flatMap(func)
function should return a Seq).
union(otherRdd) Returns a new RDD that contains the union of the elements in the source RDD and the otherRdd argument.
distinct([numPartitions]) Returns a new RDD that contains only the distinct elements of the source RDD.
When called on an RDD of (K, V) pairs, it returns an RDD of (K, Iterable<V>) pairs. By default, the level of parallelism in the output RDD
groupByKey([numPartioti
depends on the number of partitions of the source RDD. You can pass an optional numPartitions argument to set a different number of
ons])
partitions.

When called on an RDD of (K, V) pairs, it returns an RDD of (K, V) pairs, where the values for each key are aggregated using the given reduce
reduceByKey(func,
func function, which must be of type (V,V) => V. The same as for the groupByKey transformation, the number of reduce partitions is
[numPartitions])
configurable through an optional numPartitions second argument.

When called on an RDD of (K, V) pairs, it returns an RDD of (K, V) pairs sorted by keys in ascending or descending order, as specified in the
sortByKey([ascending],
Boolean ascending argument. The number of partitions for the output RDD is configurable through an optional numPartitions second
[numPartitions])
argument.

When called on RDDs of type (K, V) and (K, W), it returns an RDD of (K, (V, W)) pairs with all pairs of elements for each key. It supports left
join(otherRdd,
outer join, right outer join, and full outer join. The number of partitions for the output RDD is configurable through an optional numPartitions
[numPartitions])
second argument.
RDD Actions
 The following table lists some of the common actions supported by Spark:
Action Purpose
Aggregates the elements of an RDD using a given function, func (this takes two arguments and returns one). To ensure the correct parallelism
reduce(func)
at compute time, the reduce function, func, has to be commutative and associative.

collect() Returns all the elements of an RDD as an array to the driver.

count() Returns the total number of elements in an RDD.


first() Returns the first element of an RDD.

take(n) Returns an array containing the first n elements of an RDD.

foreach(func) Executes the func function on each element of an RDD.

Writes the elements of an RDD as a text file in a given directory (with the absolute location specified through the path argument) in the local
saveAsTextFile(path)
filesystem, HDFS, or any other Hadoop-supported filesystem. This is available for Scala and Java only.

This action is only available on RDDs of type (K, V) – it returns a hashmap of (K, Int) pairs, where K is a key of the source RDD and its value is
countByKey()
the count for that given key, K.

takeOrdered(n,
It returns the first n elements of the RDD using either their natural order or a custom comparator.
[ordering])
saveAsSequenceFile(p
It is used to write the elements of the dataset as a Hadoop Sequence File in a given path in the local filesystem, HDFS or any other Hadoop-
ath)
supported file system.
(Java and Scala)
Programming with RDDs
 Please visit below link:
 https://subscription.packtpub.com/book/data/9781788994613/1/ch01lvl1sec04/rdd-programming

 https://www.javatpoint.com/apache-spark-word-count-example

 Also ,I have put the pdf file in ppt folder. ProgrammingwithRDDsandDataframes.pdf


Machine Learning with MLlib
 Please visit below link:
 https://spark.rstudio.com/guides/mlib.html

 Also ,I have put the pdf file in ppt folder. Machine Learning with MLlib.pdf
Spark Architecture
 Apache Spark has a well-defined multi-layer architecture in which all Spark components and
layers are loosely coupled.
 The architecture is further integrated with various extensions and libraries.
 The Apache Spark architecture is based on two main abstractions:
1. Resilient Distributed Dataset (RDD)
2. Directed Acyclic Graph (DAG)
Resilient Distributed Dataset(RDD)
 RDDs are the building blocks of any Spark application.
 RDDs Stands for:
 Resilient: Fault tolerant and is capable of rebuilding data on failure
 Distributed: Distributed data among the multiple nodes in a cluster
 Dataset: Collection of partitioned data with values
 It is a layer of abstracted data over the distributed collection.
 It is immutable in nature and follows lazy transformations.
 Once you create an RDD it becomes immutable. An object whose state cannot be modified after
it is created, but they can surely be transformed.
 RDDs are highly resilient, i.e, they are able to recover quickly from any issues as the same data
chunks are replicated across multiple executor nodes.
 Even if one executor node fails, another will still process the data.
Resilient Distributed Dataset(RDD) – Cont.
 In distributed environment, each dataset in RDD is divided into logical partitions, which may be
computed on different nodes of the cluster.
 Due to this, you can perform transformations or actions on the complete data parallelly.
 Also, you don’t have to worry about the distribution, because Spark takes care of that.
Two ways to create RDDs
 There are two ways to create RDDs − parallelizing an existing collection in your driver program,
or by referencing a dataset in an external storage system, such as a shared file system, HDFS,
HBase, etc.
 With RDDs, you can perform two types of operations:
 Transformations: They are the operations that are applied to create a new RDD.
 Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass
the result back to the driver.
Working of Spark Architecture – Master Node
 Master node have the driver program, which drives your application.
 The code you are writing behaves as a driver program or if you are using the interactive shell, the shell acts
as the driver program.
 Driver program, which create a Spark Context. It is a gateway to all the Spark functionalities.
 It is similar to your database connection. Any command you execute in your database goes through the
database connection. Likewise, anything you do on Spark goes through Spark context.
 Spark context works with the cluster manager to manage various jobs.
 The driver program & Spark context takes care of the
job execution within the cluster.
 A job is split into multiple tasks which are distributed
over the worker node.
 Anytime an RDD is created in Spark context, it can be
distributed across various nodes and can be cached
there.
Working of Spark Architecture – Worker Nodes
 Worker nodes are the slave nodes whose job is to basically execute the tasks.
 These tasks are then executed on the partitioned RDDs in the worker node and hence returns
back the result to the Spark Context.
 Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes.
 These tasks work on the partitioned RDD, perform operations, collect the results and return to
the main Spark Context.
 If you increase the number of workers, then you can
divide jobs into more partitions and execute them
parallelly over multiple systems. It will be a lot faster.
 With the increase in the number of workers, memory
size will also increase & you can cache the jobs to
execute it faster.
Interactive Spark with PySpark
 PySpark is the Python API to use Spark. Spark is an open-source, cluster computing system
which is used for big data solution. It is lightning fast technology that is designed for fast
computation.
 PySpark is a Python API to support Python with Apache Spark. PySpark provides Py4j
library, with the help of this library.
 Python can be easily integrated with Apache Spark.
 PySpark plays an essential role when it needs to work with a vast dataset or analyze them.
 This feature of PySpark makes it a very demanding tool among data engineers.
PySpark
 Real-time Computation
 PySpark provides real-time computation on a large amount of data
because it focuses on in-memory processing. It shows the low latency.
 Support Multiple Language
 PySpark framework is suited with various programming languages
like Scala, Java, Python, and R. Its compatibility makes it the preferable
frameworks for processing huge datasets.
 Caching and disk constancy
 PySpark framework provides powerful caching and good disk constancy.
 Swift Processing
 PySpark allows us to achieve a high data processing speed, which is
about 100 times faster in memory and 10 times faster on the disk.
 Works well with RDD
 Python programming language is dynamically typed, which helps when
working with RDD. We will learn more about RDD using Python in the
further tutorial.
Why PySpark?
 A large amount of data is generated offline and online.
 These data contain the hidden patterns, unknown correction, market trends, customer
preference and other useful business information.
 It is necessary to extract valuable information from the raw data.

 We require a more efficient tool to perform different types of operations on the big data.
 It is needed some scalable and flexible tools to crack big data and gain benefit from it.
Application of PySpark
 Entertainment Industry
 Commercial Sector
 Healthcare
 Trades and E-commerce
 Tourism Industry

You might also like