You are on page 1of 15

Spark RDD

Dr. S.K. Sudarsanam

VIT Business School


VIT University
“Data is the new
science. Big Data holds
the answers.”
– Pat Gelsinger – CEO
of VMWare
This session covers

a) Introduction to SPARK
b) Need for SPARK
c) Key Features of SPARK
d) Use Cases of SPARK
SPARK
• Started as a research project in UC
Berkley AMPLab in 2009
• Matei Zaharia is a Romanian-
Canadian computer scientist and the
creator of Apache Spark
• Donated to Apache Foundation as a
Open-source in 2010
• Apache SPARK became a top level
project in 2014
• Zaharia is a co-founder and CTO of
Databricks
• Databricks set a new world record in
large scale sorting using Spark
SPARK
• Apache Spark is an open-
source distributed general-
purpose cluster-
computing framework.
• Spark provides an interface for
programming entire clusters with
implicit data parallelism and fault
tolerance.
Why SPARK
• Speed -Speed is the major reason for its
popularity and it offers 100 times faster
processing speed than Hadoop. It runs
10 times faster when running on disk by
reducing the no. of read/write in disk
and also it stores intermediate results in
disks
• Cost – It is cost-effective as it uses a few
numbers of resources only.
• Compatibility - Spark is compatible with
the resource manager and runs with
Hadoop just like MapReduce. Other
resource managers like YARN and Moses
are also compatible with Spark.
Why SPARK
• Real-time Processing–The other reason for
the popularity of Spark includes real-time
processing in batch mode. It remains high in
demand due to in-memory processing
feature.
• Supports multiple languages − Spark
provides built-in APIs in Java, Scala, or
Python. Therefore, you can write
applications in different languages. Spark
comes up with 80 high-level operators for
interactive querying.
• Advanced Analytics − Spark not only
supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine
learning (ML), and Graph algorithms.
Spark on Hadoop

SPARK SPARK

SPARK

Yarn
MapReduce

HDFS HDFS HDFS

Stand-alone Hadoop Yarn Spark in MR


Spark Deployment
• Standalone − Spark Standalone deployment
means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space
is allocated for HDFS, explicitly. Here, Spark and
MapReduce will run side by side to cover all
spark jobs on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means,
simply, spark runs on Yarn without any pre-
installation or root access required. It helps to
integrate Spark into Hadoop ecosystem or
Hadoop stack. It allows other components to run
on top of stack.
• Spark in MapReduce (SIMR) − Spark in
MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR,
user can start Spark and uses its shell without
any administrative access.
Spark Eco System

MLLib
Spark
Spark SQL (Machine GraphX
Streaming
Learning)

Apache Spark Core API

R Scala SQL Python Java


Apache SPARK Eco-System
Apache Spark Core
• All the functionalities being provided by
Apache Spark are built on the top of Spark
Core. It delivers speed by providing in-
memory computation capability.
• Thus Spark Core is the foundation of parallel
and distributed processing of huge dataset.
The key features of Apache Spark Core are:
• It is in charge of essential I/O
functionalities.
• Significant in programming and observing
the role of the Spark cluster.
• Task dispatching.
• Fault recovery.
Apache SPARK Eco-System
Apache Spark Core
• It overcomes the snag of MapReduce by using
in-memory computation.
• Spark Core is embedded with a special
collection called RDD (resilient distributed
dataset). RDD is among the abstractions of
Spark. Spark RDD handles partitioning data
across all the nodes in a cluster. It holds them
in the memory pool of the cluster as a single
unit. There are two operations performed on
RDDs: Transformation and Action-
• Transformation: It is a function that produces
new RDD from the existing RDDs.
• Action: In Transformation, RDDs are created
from each other. But when we want to work
with the actual dataset, then, at that point we
use Action.
Spark SQL
• The Spark SQL component is a distributed
framework for structured data processing. Using
Spark SQL, Spark gets more information about
the structure of data and the computation. With
this information, Spark can perform extra
optimization.
• It uses same execution engine while computing
an output. It does not depend on API/ language
to express the computation.
• Spark SQL works to access structured and semi-
structured information. It also enables powerful,
interactive, analytical application across both
streaming and historical data.
• Spark SQL is Spark module for structured data
processing. Thus, it acts as a distributed SQL
query engine.
Spark SQL
Features of Spark SQL include:
• Cost based optimizer
• Mid query fault-tolerance: This is done
by scaling thousands of nodes and multi-
hour queries using the Spark engine. Full
compatibility with existing Hive data.
• Data Frames and SQL provide a common
way to access a variety of data sources.
It includes Hive, Avro, Parquet, ORC,
JSON, and JDBC.
• Provision to carry structured data inside
Spark programs, using either SQL or a
familiar Data Frame API.
Spark SQL
• In the depth of Spark SQL there lies a catalyst
optimizer. Catalyst optimization allows some
advanced programming language features that
allow you to build an extensible query optimizer.
A new extensible optimizer called Catalyst
emerged to implement Spark SQL. This optimizer
is based on functional programming construct
in Scala.
Catalyst Optimizer supports both rule-
based and cost-based optimization. In rule-based
optimization the rule based optimizer use set of
rule to determine how to execute the query.
While the cost based optimization finds the most
suitable way to carry out SQL statement. In cost-
based optimization, multiple plans are generated
using rules and then their cost is computed.

You might also like