What is Apache Spark? The Complete Guide

What is Spark?
Apache Spark is an open-source cluster computing framework. Its primary

purpose is to handle the real-time generated data.
Spark was built on the top of the Hadoop MapReduce. It was optimized to run in
memory whereas alternative approaches like Hadoop's MapReduce writes data
to and from computer hard drives. So, Spark process the data much quicker
than other alternatives.
History of Apache Spark

The Spark was initiated by Matei Zaharia at UC Berkeley's AMPLab in 2009. It
was open sourced in 2010 under a BSD license.
In 2013, the project was acquired by Apache Software Foundation. In 2014, the
Spark emerged as a Top-Level Apache Project.
Object 1
Object 2
4
Features of Apache Spark

•Fast - It provides high performance for both batch and streaming data,
using a state-of-the-art DAG scheduler, a query optimizer, and a physical
execution engine.
•Easy to Use - It facilitates to write the application in Java, Scala, Python,

R, and SQL. It also provides more than 80 high-level operators.
•Generality - It provides a collection of libraries including SQL and

DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
•Lightweight - It is a light unified analytics engine which is used for

large scale data processing.Runs Everywhere - It can easily run on
Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
Uses of Spark
•Data integration: The data generated by systems are not consistent
enough to combine for analysis. To fetch consistent data from systems we
can use processes like Extract, transform, and load (ETL). Spark is used to
reduce the cost and time required for this ETL process.
•Stream processing: It is always difficult to handle the real-time

generated data such as log files. Spark is capable enough to operate
streams of data and refuses potentially fraudulent operations.
•Machine learning: Machine learning approaches become more feasible

and increasingly accurate due to enhancement in the volume of data. As
spark is capable of storing data in memory and can run repeated queries
quickly, it makes it easy to work on machine learning algorithms.
•Interactive analytics: Spark is able to generate the respond rapidly.

So, instead of running pre-defined queries, we can handle the data
interactively.
Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single
master and multiple slaves.
The Spark architecture depends upon two abstractions:
•Resilient Distributed Dataset (RDD)
•Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be
stored in-memory on worker nodes. Here,
•Resilient: Restore the data on failure.
•Distributed: Data is distributed among different nodes.
•Dataset: Group of data.

We will learn about RDD later in detail.
Object 3
How to find Nth Highest Salary in SQL
Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of
computations on data. Each node is an RDD partition, and the edge is a
transformation on top of data. Here, the graph refers the navigation whereas
directed and acyclic refers to how it is done.
Let's understand the Spark architecture.
Driver Program
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object. The purpose of SparkContext is to
coordinate the spark applications, running as independent sets of processes on
a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster
managers and then perform the following tasks: -
•It acquires executors on nodes in the cluster.
•Then, it sends your application code to the executors. Here, the

application code can be defined by JAR or Python files passed to the
SparkContext.
•At last, the SparkContext sends tasks to the executors to run.
Cluster Manager
•The role of the cluster manager is to allocate resources across
applications. The Spark is capable enough of running on a large number
of clusters.
•It consists of various types of cluster managers such as Hadoop YARN,

Apache Mesos and Standalone Scheduler.
•Here, the Standalone Scheduler is a standalone spark cluster manager
that facilitates to install Spark on an empty set of machines.
Worker Node
•The worker node is a slave node
•Its role is to run the application code in the cluster.
Executor
•An executor is a process launched for an application on a worker node.
•It runs tasks and keeps data in memory or disk storage across them.
•It read and write data to the external sources.
•Every application contains its executor.
Task
•A unit of work that will be sent to one executor.
Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.
Let's understand each Spark component in detail.
Spark Core
•The Spark Core is the heart of Spark and performs the core functionality.
•It holds the components for task scheduling, fault recovery, interacting
with storage systems and memory management.
Spark SQL
•The Spark SQL is built on the top of Spark Core. It provides support for
structured data.
•It allows to query the data via SQL (Structured Query Language) as well
as the Apache Hive variant of SQL?called the HQL (Hive Query
Language).
•It supports JDBC and ODBC connections that establish a relation between
Java objects and existing databases, data warehouses and business
intelligence tools.
•It also supports various sources of data like Hive tables, Parquet, and
JSON.
Spark Streaming
•Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
•It uses Spark Core's fast scheduling capability to perform streaming

analytics.
•It accepts data in mini-batches and performs RDD transformations on

that data.
•Its design ensures that the applications written for streaming data can
be reused to analyze batches of historical data with little modification.
•The log files generated by web servers can be considered as a real-time

example of a data stream.
MLlib
•The MLlib is a Machine Learning library that contains various machine
learning algorithms.
•These include correlations and hypothesis testing, classification and

regression, clustering, and principal component analysis.
•It is nine times faster than the disk-based implementation used by

Apache Mahout.
GraphX
•The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
•It facilitates to create a directed graph with arbitrary properties attached

to each vertex and edge.
•To manipulate graph, it supports various fundamental operators like

subgraph, join Vertices, and aggregate Messages.
What is RDD?
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a
collection of elements, partitioned across the nodes of the cluster so that we
can execute various parallel operations on it.
There are two ways to create RDDs:
•Parallelizing an existing data in the driver program

•Referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.
Parallelized Collections
To create parallelized collection, call SparkContext's parallelize method on an
existing collection in the driver program. Each element of collection is copied to
form a distributed dataset that can be operated on in parallel.
1. val info = Array(1, 2, 3, 4)

2. val distinfo = sc.parallelize(info)
Now, we can operate the distributed dataset (distinfo) parallel such like
distinfo.reduce((a, b) => a + b).
Object 4
Object 5
HTML Tutorial
External Datasets
In Spark, the distributed datasets can be created from any type of storage
sources supported by Hadoop such as HDFS, Cassandra, HBase and even our
local file system. Spark provides the support for text files, SequenceFiles, and
other types of Hadoop InputFormat.
SparkContext's textFile method can be used to create RDD's text file. This
method takes a URI for the file (either a local path on the machine or a hdfs://)
and reads the data of the file.
Now, we can operate data on by dataset operations such as we can add up the
sizes of all the lines using the map and reduceoperations as follows:
data.map(s => s.length).reduce((a, b) => a + b).
Next Topic RDD Operations
RDD Operations
The RDD provides the two types of operations:
•Transformation
•Action
Transformation
In Spark, the role of transformation is to create a new dataset from an existing
one. The transformations are considered lazy as they only computed when an
action requires a result to be returned to the driver program.
Let's see some of the frequently used RDD Transformations.
Transformation Description
It returns a new distributed dataset
map(func) formed by passing each element of
the source through a function func.
It returns a new dataset formed by

filter(func) selecting those elements of the
source on which func returns true.
Here, each input item can be mapped

to zero or more output items, so func
flatMap(func)
should return a sequence rather than
a single item.
mapPartitions(func) It is similar to map, but runs

separately on each partition (block) of
the RDD, so func must be of type
Iterator<T> => Iterator<U> when
running on an RDD of type T.
It is similar to mapPartitions that

provides func with an integer value
representing the index of the
mapPartitionsWithIndex(func)
partition, so func must be of type (Int,
Iterator<T>) => Iterator<U> when
running on an RDD of type T.
It samples the fraction fraction of the

sample(withReplacement, fraction, data, with or without replacement,
seed) using a given random number
generator seed.
It returns a new dataset that contains

union(otherDataset) the union of the elements in the
source dataset and the argument.
It returns a new RDD that contains the

intersection(otherDataset) intersection of elements in the source
dataset and the argument.
It returns a new dataset that contains

distinct([numPartitions])) the distinct elements of the source
dataset.
groupByKey([numPartitions]) It returns a dataset of (K, Iterable
When called on a dataset of (K, V)

pairs, returns a dataset of (K, V) pairs
where the values for each key are
reduceByKey(func, [numPartitions])
aggregated using the given reduce
function func, which must be of type
(V,V) => V.
aggregateByKey(zeroValue)(seqOp, When called on a dataset of (K, V)

combOp, [numPartitions]) pairs, returns a dataset of (K, U) pairs
where the values for each key are
aggregated using the given combine
functions and a neutral "zero" value.
It returns a dataset of key-value pairs

sortByKey([ascending], sorted by keys in ascending or
[numPartitions]) descending order, as specified in the
boolean ascending argument.
When called on datasets of type (K, V)

and (K, W), returns a dataset of (K, (V,
W)) pairs with all pairs of elements for
join(otherDataset, [numPartitions])
each key. Outer joins are supported
through leftOuterJoin, rightOuterJoin,
and fullOuterJoin.
When called on datasets of type (K, V)

cogroup(otherDataset,
and (K, W), returns a dataset of (K,
[numPartitions])
(Iterable
When called on datasets of types T

cartesian(otherDataset) and U, returns a dataset of (T, U) pairs
(all pairs of elements).
Pipe each partition of the RDD

pipe(command, [envVars]) through a shell command, e.g. a Perl
or bash script.
It decreases the number of partitions

coalesce(numPartitions)
in the RDD to numPartitions.
It reshuffles the data in the RDD

randomly to create either more or
repartition(numPartitions)
fewer partitions and balance it across
them.
repartitionAndSortWithinPartitions(p It repartition the RDD according to the

given partitioner and, within each
artitioner) resulting partition, sort records by
their keys.
Action
In Spark, the role of action is to return a value to the driver program after
running a computation on the dataset.
Object 6
Prime Ministers of India | List of Prime Minister of India (1947-2020)
Let's see some of the frequently used RDD Actions.
Action Description
It aggregate the elements of the dataset using a
function func (which takes two arguments and
reduce(func) returns one). The function should be commutative
and associative so that it can be computed correctly
in parallel.
It returns all the elements of the dataset as an array

at the driver program. This is usually useful after a
collect()
filter or other operation that returns a sufficiently
small subset of the data.
count() It returns the number of elements in the dataset.
It returns the first element of the dataset (similar to

first()
take(1)).
It returns an array with the first n elements of the

take(n)
dataset.
takeSample(withRepla It returns an array with a random sample of num

cement, num, [seed]) elements of the dataset, with or without
replacement, optionally pre-specifying a random
number generator seed.
takeOrdered(n, It returns the first n elements of the RDD using either

[ordering]) their natural order or a custom comparator.
It is used to write the elements of the dataset as a

text file (or set of text files) in a given directory in
saveAsTextFile(path) the local filesystem, HDFS or any other Hadoop-
supported file system. Spark calls toString on each
element to convert it to a line of text in the file.
It is used to write the elements of the dataset as a

saveAsSequenceFile(p
Hadoop SequenceFile in a given path in the local
ath)
filesystem, HDFS or any other Hadoop-supported file
(Java and Scala)
system.
saveAsObjectFile(path It is used to write the elements of the dataset in a

) simple format using Java serialization, which can
(Java and Scala) then be loaded usingSparkContext.objectFile().
It is only available on RDDs of type (K, V). Thus, it

countByKey() returns a hashmap of (K, Int) pairs with the count of
each key.
It runs a function func on each element of the

dataset for side effects such as updating an
foreach(func)
Accumulator or interacting with external storage
systems.
RDD Persistence
Spark provides a convenient way to work on the dataset by persisting it in
memory across operations. While persisting an RDD, each node stores any
partitions of it that it computes in memory. Now, we can also reuse them in
other tasks on that dataset.
We can use either persist() or cache() method to mark an RDD to be persisted.
Spark?s cache is fault-tolerant. In any case, if the partition of an RDD is lost, it
will automatically be recomputed using the transformations that originally
created it.
There is an availability of different storage levels which are used to store
persisted RDDs. Use these levels by passing a StorageLevel object (Scala,
Java, Python) to persist(). However, the cache() method is used for the default
storage level, which is StorageLevel.MEMORY_ONLY.
The following are the set of storage levels:
Object 7
OOPs Concepts in Java
Storage Level Description

It stores the RDD as deserialized Java objects in the
JVM. This is the default level. If the RDD doesn't fit in
MEMORY_ONLY
memory, some partitions will not be cached and
recomputed each time they're needed.
It stores the RDD as deserialized Java objects in the

JVM. If the RDD doesn't fit in memory, store the
MEMORY_AND_DISK
partitions that don't fit on disk, and read them from
there when they're needed.
It stores RDD as serialized Java objects ( i.e. one-byte

MEMORY_ONLY_SER
array per partition). This is generally more space-
(Java and Scala)
efficient than deserialized objects.
MEMORY_AND_DISK It is similar to MEMORY_ONLY_SER, but spill partitions

_SER that don't fit in memory to disk instead of recomputing
(Java and Scala) them.
DISK_ONLY It stores the RDD partitions only on disk.
MEMORY_ONLY_2,
It is the same as the levels above, but replicate each
MEMORY_AND_DISK
partition on two cluster nodes.
_2, etc.
It is similar to MEMORY_ONLY_SER, but store the data in

OFF_HEAP
off-heap memory. The off-heap memory must be
(experimental)
enabled.
RDD Shared Variables

In Spark, when any function passed to a transformation operation, then it is
executed on a remote cluster node. It works on different copies of all the
variables used in the function. These variables are copied to each machine, and
no updates to the variables on the remote machine are revert to the driver
program.
Broadcast variable
The broadcast variables support a read-only variable cached on each machine
rather than providing a copy of it with tasks. Spark uses broadcast algorithms
to distribute broadcast variables for reducing communication cost.
The execution of spark actions passes through several stages, separated by
distributed "shuffle" operations. Spark automatically broadcasts the common
data required by tasks within each stage. The data broadcasted this way is
cached in serialized form and deserialized before running each task.
To create a broadcast variable (let say, v), call SparkContext.broadcast(v). Let's
understand with an example.
Object 8
21.8M
507
Features of Java - Javatpoint
1. scala> val v = sc.broadcast(Array(1, 2, 3))

2. scala> v.value
Accumulator
The Accumulator are variables that are used to perform associative and
commutative operations such as counters or sums. The Spark provides support
for accumulators of numeric types. However, we can add support for new
types.
To create a numeric accumulator, call SparkContext.longAccumulator() or
SparkContext.doubleAccumulator() to accumulate the values of Long or Double
type.
1. scala> val a=sc.longAccumulator("Accumulator")

2. scala> sc.parallelize(Array(2,5)).foreach(x=>a.add(x))
3. scala> a.value
Spark Map function

In Spark, the Map passes each element of the source through a function and
forms a new distributed dataset.
Example of Map function

In this example, we add a constant value 10 to each element.
•To open the spark in Scala mode, follow the below command
1. $ spark-shell
•Create an RDD using parallelized collection.
1. scala> val data = sc.parallelize(List(10,20,30))

•Now, we can read the generated result by using the following command.
1. scala> data.collect
•Apply the map function and pass the expression required to perform.
1. scala> val mapfunc = data.map(x => x+10)

1. scala> mapfunc.collect
Here, we got the desired output.
Spark Filter Function

In Spark, the Filter function returns a new dataset formed by selecting those
elements of the source on which the function returns true. So, it retrieves only
the elements that satisfy the given condition.
Example of Filter function

In this example, we filter the given data and retrieve all the values except 35.
•To open the spark in Scala mode, follow the below command.
1. $ spark-shell
1. scala> val data = sc.parallelize(List(10,20,35,40))

•Apply filter function and pass the expression required to perform.
1. scala> val filterfunc = data.filter(x => x!=35)

1. scala> filterfunc.collect
Spark Count Function

In Spark, the Count function returns the number of elements present in the
dataset.
Example of Count function

In this example, we count the number of elements exist in the dataset.
1. scala> val data = sc.parallelize(List(1,2,3,4,5))

•Apply count() function to count number of elements.
1. scala> val countfunc = data.count()
Spark Distinct Function

In Spark, the Distinct function returns the distinct elements from the provided
dataset.
Example of Distinct function

In this example, we ignore the duplicate elements and retrieves only the
distinct elements.
1. $ spark-shell
1. scala> val data = sc.parallelize(List(10,20,20,40))

•Apply distinct() function to ignore duplicate elements.
1. scala> val distinctfunc = data.distinct()

1. scala> distinctfunc.collect
Spark Union Function

In Spark, Union function returns a new dataset that contains the combination of
elements present in the different datasets.
Example of Union function

In this example, we combine the elements of two datasets.
1. $ spark-shell
1. scala> val data1 = sc.parallelize(List(1,2))

1. scala> data1.collect
•Create another RDD using parallelized collection.
1. scala> val data2 = sc.parallelize(List(3,4,5))

•Apply union() function to return the union of the elements.
1. scala> val unionfunc = data1.union(data2)

1. scala> unionfunc.collect
Spark Intersection Function

In Spark, Intersection function returns a new dataset that contains the
intersection of elements present in the different datasets. So, it returns only a
single row. This function behaves just like the INTERSECT query in SQL.
Example of Intersection function
In this example, we intersect the elements of two datasets.
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
•Create an RDD using the parallelized collection.

•Create another RDD using parallelized collection.

•Apply intersection() function to return the intersection of the elements.
1. scala> val intersectfunc = data1.intersection(data2)

1. scala> intersectfunc.collect

Spark Cartesian Function
In Spark, the Cartesian function generates a Cartesian product of two datasets
and returns all the possible combination of pairs. Here, each element of one
dataset is paired with each element of another dataset.
Example of Cartesian function

In this example, we generate a Cartesian product of two datasets.
1. $ spark-shell

•Create another RDD using the parallelized collection.

•Apply cartesian() function to return the Cartesian product of the

elements.
1. scala> val cartesianfunc = data1.cartesian(data2)

1. scala> cartesianfunc.collect
Spark sortByKey Function

In Spark, the sortByKey function maintains the order of elements. It receives
key-value pairs (K, V) as an input, sorts the elements in ascending or
descending order and generates a dataset in an order.
Example of sortByKey Function

In this example, we arrange the elements of dataset in ascending and
descending order.
1. $ spark-shell
1. scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),("E",5)))

Now, we can read the generated result by using the following command.
For ascending,
Object 9
How to find Nth Highest Salary in SQL
•Apply sortByKey() function to ignore duplicate elements.

1. scala> val sortfunc = data.sortByKey()
1. scala> sortfunc.collect

For descending,
•Apply sortByKey() function and pass Boolean type as parameter.
1. scala> val sortfunc = data.sortByKey(false)

1. scala> sortfunc.collect

Spark groupByKey Function
In Spark, the groupByKey function is a frequently used transformation
operation that performs shuffling of data. It receives key-value pairs (K, V) as
an input, group the values based on key and generates a dataset of (K, Iterable
Example of groupByKey Function

In this example, we group the values based on the key.
1. $ spark-shell
1. scala> val data = sc.parallelize(Seq(("C",3),("A",1),("B",4),("A",2),("B",5)))

•Apply groupByKey() function to group the values.
1. scala> val groupfunc = data.groupByKey()

1. scala> groupfunc.collect
Spark reduceByKey Function

In Spark, the reduceByKey function is a frequently used transformation
operation that performs aggregation of data. It receives key-value pairs (K, V)
as an input, aggregates the values based on the key and generates a dataset
of (K, V) pairs as an output.
Example of reduceByKey Function

In this example, we aggregate the values on the basis of key.
1. $ spark-shell
1. scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),("B",5)))

•Apply reduceByKey() function to aggregate the values.
1. scala> val reducefunc = data.reduceByKey((value, x) => (value + x))

1. scala> reducefunc.collect
Object 10
HTML Tutorial
Spark cogroup Function
In Spark, the cogroup function performs on different datasets, let's say, (K, V)
and (K, W) and returns a dataset of (K, (Iterable groupWith.
Example of cogroup Function

In this example, we perform the groupWith operation.
1. $ spark-shell
1. scala> val data1 = sc.parallelize(Seq(("A",1),("B",2),("C",3)))

•Create another RDD using the parallelized collection.
1. scala> val data2 = sc.parallelize(Seq(("B",4),("E",5)))

Object 11
History of Java
•Apply cogroup() function to group the values.
1. scala> val cogroupfunc = data1.cogroup(data2)

1. scala> cogroupfunc.collect
park First Function

In Spark, the First function always returns the first element of the dataset. It is
similar to take(1).
Example of First function

In this example, we retrieve the first element of the dataset.
1. $ spark-shell

•Apply first() function to retrieve the first element of the dataset.
1. scala> val firstfunc = data.first()

Spark Take Function

In Spark, the take function behaves like an array. It receives an integer value
(let say, n) as a parameter and returns an array of first n elements of the
dataset.
Example of Take function

In this example, we return the first n elements of an existing dataset.
1. $ spark-shell

•Apply take() function to return an array of elements.
1. scala> val takefunc = data.take(3)

Spark Word Count Example
In Spark word count example, we find out the frequency of each word exists in
a particular file. Here, we use Scala language to perform Spark operations.
Steps to execute Spark word count example

In this example, we find and display the number of occurrences of each word.
•Create a text file in your local machine and write some text into it.
1. $ nano sparkdata.txt
•Check the text written in the sparkdata.txt file.
1. $ cat sparkdata.txt
•Create a directory in HDFS, where to kept text file.
1. $ hdfs dfs -mkdir /spark

•Upload the sparkdata.txt file on HDFS in the specific directory.
1. $ hdfs dfs -put /home/codegyani/sparkdata.txt /spark

•Now, follow the below command to open the spark in Scala mode.
1. $ spark-shell
•Let's create an RDD by using the following command.
1. scala> val data=sc.textFile("sparkdata.txt")

Here, pass any file name that contains the data.
1. scala> data.collect;
•Here, we split the existing data in the form of individual words by using
the following command.
1. scala> val splitdata = data.flatMap(line => line.split(" "));

1. scala> splitdata.collect;
•Now, perform the map operation.
1. scala> val mapdata = splitdata.map(word => (word,1));

Here, we are assigning a value 1 to each word.
Object 12
HTML Tutorial
1. scala> mapdata.collect;
•Now, perform the reduce operation
1. scala> val reducedata = mapdata.reduceByKey(_+_);

Here, we are summarizing the generated data.
1. scala> reducedata.collect;

Spark Char Count Example
In Spark char count example, we find out the frequency of each character
exists in a particular file. Here, we use Scala language to perform Spark
operations.
Steps to execute Spark char count example

In this example, we find and display the number of occurrences of each
character.
•Create a text file in your local machine and write some text into it.
1. $ nano sparkdata.txt
•Check the text written in the sparkdata.txt file.
1. $ cat sparkdata.txt
•Create a directory in HDFS, where to kept text file.
1. $ hdfs dfs -mkdir /spark

•Upload the sparkdata.txt file on HDFS in the specific directory.
1. $ hdfs dfs -put /home/codegyani/sparkdata.txt /spark

•Now, follow the below command to open the spark in Scala mode.
1. $ spark-shell
•Let's create an RDD by using the following command.
1. scala> val data=sc.textFile("sparkdata.txt");

Here, pass any file name that contains the data.
1. scala> data.collect;
•Here, we split the existing data in the form of individual words by using
the following command.
1. scala> val splitdata = data.flatMap(line => line.split(""));

1. scala> splitdata.collect;
•Now, perform the map operation.
1. scala> val mapdata = splitdata.map(word => (word,1));

Here, we are assigning a value 1 to each word.
Object 13
Difference between JDK, JRE, and JVM
1. scala> mapdata.collect;
•Now, perform the reduce operation
1. scala> val reducedata = mapdata.reduceByKey(_+_);

Here, we are summarizing the generated data.
1. scala> reducedata.collect;

What is Apache Spark? The Complete Guide

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

What is Apache Spark? The Complete Guide

Uploaded by

Copyright:

Available Formats

What is Spark?

Apache Spark is an open-source cluster computing framework. Its primary

History of Apache Spark

Features of Apache Spark

•Easy to Use - It facilitates to write the application in Java, Scala, Python,

•Generality - It provides a collection of libraries including SQL and

•Lightweight - It is a light unified analytics engine which is used for

•Stream processing: It is always difficult to handle the real-time

•Machine learning: Machine learning approaches become more feasible

•Interactive analytics: Spark is able to generate the respond rapidly.

•Resilient Distributed Dataset (RDD)

•Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

•Resilient: Restore the data on failure.

•Distributed: Data is distributed among different nodes.

•Dataset: Group of data.

Directed Acyclic Graph (DAG)

•It acquires executors on nodes in the cluster.

•Then, it sends your application code to the executors. Here, the

•At last, the SparkContext sends tasks to the executors to run.

•It consists of various types of cluster managers such as Hadoop YARN,

•Its role is to run the application code in the cluster.

•It read and write data to the external sources.

•Every application contains its executor.

•It uses Spark Core's fast scheduling capability to perform streaming

•It accepts data in mini-batches and performs RDD transformations on

•The log files generated by web servers can be considered as a real-time

•These include correlations and hypothesis testing, classification and

•It is nine times faster than the disk-based implementation used by

•It facilitates to create a directed graph with arbitrary properties attached

•To manipulate graph, it supports various fundamental operators like

•Parallelizing an existing data in the driver program

1. val info = Array(1, 2, 3, 4)

It returns a new dataset formed by

Here, each input item can be mapped

mapPartitions(func) It is similar to map, but runs

It is similar to mapPartitions that

It samples the fraction fraction of the

It returns a new dataset that contains

It returns a new RDD that contains the

It returns a new dataset that contains

groupByKey([numPartitions]) It returns a dataset of (K, Iterable

When called on a dataset of (K, V)

aggregateByKey(zeroValue)(seqOp, When called on a dataset of (K, V)

It returns a dataset of key-value pairs

When called on datasets of type (K, V)

When called on datasets of type (K, V)

When called on datasets of types T

Pipe each partition of the RDD

It decreases the number of partitions

It reshuffles the data in the RDD

repartitionAndSortWithinPartitions(p It repartition the RDD according to the

Let's see some of the frequently used RDD Actions.

It returns all the elements of the dataset as an array

count() It returns the number of elements in the dataset.

It returns the first element of the dataset (similar to

It returns an array with the first n elements of the

takeSample(withRepla It returns an array with a random sample of num

takeOrdered(n, It returns the first n elements of the RDD using either

It is used to write the elements of the dataset as a

It is used to write the elements of the dataset as a

saveAsObjectFile(path It is used to write the elements of the dataset in a