You are on page 1of 28

Spark Basics

Jesús Montes
jesus.montes@upm.es

Oct. 2022
Spark Basics
Apache Spark
● Defined by its creators as a “fast, general purpose cluster computing
system”.
● Centered on a data structure called resilient distributed dataset (RDD)
● Based on principles similar to MapReduce, but allowing a more flexible
dataflow structure.
● Uses in-memory computation to optimize performance.
● Nowadays, is the most popular MapReduce successor.
● Its current stable version is 3.3.0 (released Jun 2022)

Big Data Technologies 2


Apache Spark
Spark’s authors claim it is fast...

● Extends the MapReduce model


● Provides tools for interactive data exploration
● Takes advantage of in-memory distributed computation

… and general purpose

● Different types of workloads: batch, interactive, iterative, streaming…


● Multiple APIs: Scala, Java, Python, SQL, R
● Compatible with Hadoop

Big Data Technologies 3


Spark Environment and Components
Spark Environment: Spark Components:

● Spark can run as a standalone, local ● Spark Core: The base component
application or in a cluster, on top of ● Spark SQL: High-level structured-data
Yarn, Kubernetes, Mesos... processing
● Spark can read and write data from ● MLlib: Machine learning
many storage technologies ● GraphX: Graph processing
○ Local filesystem ● Spark Streaming: Stream processing
○ HDFS
○ Hive
○ HBase
○ Cassandra
○ ...

http://spark.apache.org/docs/latest/

Spark Basics 4
Spark Stack

Spark Basics 5
Spark Core
It contains the basic functionality of Spark

● Task scheduling
● Memory management
● Fault recovery
● Storage-layer interaction
● Cluster resource manager interaction (YARN/Kubernetes/Mesos)
● Resilient Distributed Datasets (RDDs) API

The Spark Core makes possible to create and manage Spark Applications.

Spark Basics 6
Spark Application
All Spark applications have the same components

● Driver: These is the central component that controls the application


execution. It contains a Spark Context that controls execution parameters.
● Cluster manager: Is the component in charge of organizing the
application execution, either directly (standalone) or through an external
application manager (YARN/Kubernetes/Mesos).
● Executors: They perform the data access and computation in the worker
nodes.

Spark Basics 7
Spark Application
Spark Driver
Spark Context

Cluster Manager

Worker Node Worker Node Worker Node


Executor Executor Executor
Task Task Task Task Task Task

Spark Basics 8
Spark Context
A Spark Context represents a connection to a computing cluster:

● Master node URL


● Application name and other application parameters
● Driver resources (jar and/or py files)

Using a Spark Context, the application can create Spark objects, like RDDs,
accumulators and broadcast variables.

A Spark Context is usually represented by an object of the SparkContext class,


and it can be configured using the SparkConf class.

Spark Basics 9
Resilient Distributed Dataset (RDD)
An RDD is a distributed collection of items.

● RDDs can be created... ● RDDs...


○ … from local application objects ○ … are Immutable,
(parallelize), ○ … are lazy-evaluated,
○ … from Hadoop InputFormats (such as ○ … are fault-tolerant
HDFS files) ○ … and provide persistency and
○ … or by transforming other RDDs. partitioning control.
● RDDs have...
○ … actions, which return values,... In principle, an RDD can contain any type of
○ … and transformations, which return object. In practice, only objects that can be
pointers to new RDDs. serialized can be used. Otherwise the RDD
could not be distributed throughout the
cluster.

Spark Basics 10
Resilient Distributed Dataset (RDD)
● RDDs are managed by RDD Cluster node
objects.
Partition
● An RDD object contains Memory partition
references to Partition
objects.
● Each Partition object Cluster node
corresponds to a subset of RDD Partition
Memory partition
data.
● Partitions are assigned to
nodes in the cluster.
Cluster node
● By default, each partition
Partition
will be stored in memory. Memory partition

Spark Basics 11
Resilient Distributed Dataset (RDD)

Spark Basics 12
Transformations and Actions
Transformations: Actions:

● RDDs are immutable, so the only way to ● Actions compute a result based on the
transform its content if by generating a contents of an RDD. The result can be
new RDD. returned to the Driver or stored (in a
● Transformations are not executed when local file, HDFS, etc.).
declared. Instead, they are registered by ● When an action is requested, the Driver
the Spark Driver and organized into an performs the execution plan, computing
execution plan. the target RDD contents and applying
● Examples of transformations: filter the the indicated action.
values of an RDD, group values by some ● Examples of actions: output the RDD
criteria. contents (totally or partially), count the
elements in an RDD.

Spark Basics 13
Transformations and Actions
Creation Transformation

RDD Action
Lineage

Result

Spark Basics 14
The Spark Shell
Spark can be used interactively through the Spark Shell

● Is a typical REPL (read-eval-print loop) application.


● Is launched with the spark-shell script located inside the bin directory of
the Spark installation.
● The default Spark Shell is, in fact, a standard Scala interpreter with several
tweaks (pre-loaded libraries, etc.).
● It contains an Spark Context already initialized (in the sc object).
● Spark also provides shell versions for Python (pyspark), R (sparkR) and SQL
(spark-sql).

Spark Basics 15
Spark APIs
● Scala
○ The most complete one (after all, Spark is written in Scala)
○ The default language of the Spark Shell
● Java and Python
○ Almost as complete as the Scala API
○ Some minor limitations/difficulties, but perfectly functional
● R
○ Only available at a DataFrame level (no direct access to RDDs)
○ Limited, but functional for basic/medium-complexity operations
○ Greatly improved in Spark 2.x, with a promising future

http://spark.apache.org/docs/latest/

Spark Basics 16
Let’s try Spark
1. First, start the Spark Shell val nums = Array((1 to 50):_*)
# spark-shell val oneRDD = sc.parallelize(nums)
This will start the Scala interpreter and oneRDD.count
initialize a Spark Context. After a short
while, you will see a scala prompt val otherRDD = oneRDD.map(_ * 2)
(scala>) val result = otherRDD.collect
2. Type the example code on the right in for (v <- result) println(v)
the shell. Make sure to run one line at a
time, and check the results of each one val sum = oneRDD.reduce(_ + _)
of them. print(sum)

Spark Basics 17
Let’s try PySpark
1. First, start the PySpark Shell nums = list(range(1,51))
# pyspark oneRDD = sc.parallelize(nums)
This will start the Python interpreter and oneRDD.count()
initialize a Spark Context. After a short
while, you will see a python prompt otherRDD = oneRDD.map(lambda x: x *
(>>>) 2)
2. Type the example code on the right in result = otherRDD.collect()
the shell. Make sure to run one line at a for v in result:
time, and check the results of each one print(v)
of them.
sum = oneRDD.reduce(lambda x,y: x+y)
print(sum)

Spark Basics 18
Creating/Loading/Saving RDDs
● Parallelize an existing collection
val myRDD1 = sc.parallelize(Array('a','b','c'))
● Load from file Only in
local mode
val myRDD2 = sc.textFile("file:///tmp/98-0.txt")
val myRDD3 = sc.textFile("hdfs:///tmp/book/98-0.txt")
● Save to file Only if HDFS
available
oneRDD.saveAsTextFile("/tmp/oneRDD")
otherRDD.saveAsObjectFile("/tmp/otherRDD")
● Different sources, destinations and data formats are supported.

Spark Basics 19
RDD transformations
● Most RDD transformations are based on passing functions as parameters
of RDD method calls.
● The result of these method calls is a new RDD object, containing the
results of the requested transformation.
● Scala’s powerful anonymous function syntax makes very easy to write
concise code for Spark.
● Most RDD transformations are based on basic principles of functional
programming and the MapReduce model.

Spark Basics 20
RDD transformations
The most basic RDD transformation is map:

map[U](f: (T) ⇒ U): RDD[U]

It operates over an RDD of objects of class T. Its only argument is a function


that takes an argument of class T and produces an output object of class U.
The result is a new RDD of objects of class U, obtained by applying the
function to each element of the input RDD.

Example: myRDD.map( (x: Int) => (x * 2) )

Spark Basics 21
RDD transformations
● flatMap[U](f: (T) ⇒ TraversableOnce[U]): RDD[U]
Similar to a regular map, but the parameter function can return zero or
more elements.
● filter(f: (T) ⇒ Boolean): RDD[T]
Return a new RDD containing only the elements that satisfy a predicate.
● distinct(): RDD[T]
Return a new RDD containing the distinct elements in this RDD.
● sample(withReplcment: Boolean, fraction: Double): RDD[T]
Returns a sampled subset of this RDD.

Spark Basics 22
RDD transformations
● sortBy[K](f: (T) ⇒ K, ascending: Boolean = true): RDD[T]
Return this RDD sorted by the given key function.
● union(other: RDD[T]): RDD[T]
Return the union of this RDD and another one.
● intersection(other: RDD[T]): RDD[T]
Return the intersection of this RDD and another one.
● subtract(other: RDD[T]): RDD[T]
Return an RDD with the elements from this that are not in other.
● ...

Spark Basics 23
Pair RDDs
● In Scala, tuples are part of the language basic syntax, and RDDs of tuples
can be created in Spark.
val myKVArray = Array((1, 'a'), (2, 'b'), (3, 'c'))
val myPairRDD = sc.parallelize(myKVArray)
● Spark can use the tuples inside the RDD as key-value pairs.
● The Spark API provides a specific set of transformations and actions for
working with key-value RDDs.
● RDDs of tuples can be used in the Java and Python APIs. These APIs
contain the appropriate extensions to allow usage similar to the Scala API.

Spark Basics 24
Pair RDD transformations
● keys: RDD[K]
● values: RDD[V]
● groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
Group the values for each key in the RDD into a single sequence.
● reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Merge the values for each key using an associative and commutative
reduce function.
● join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
Return an RDD containing all pairs of elements with matching keys.
● ...

Spark Basics 25
RDD actions
Basic RDD actions: ● foreach(f: (T) ⇒ Unit): Unit
Applies a function f to all elements of
● take(num: Int): Array[T] this RDD.
Take the first num elements of the RDD. ● ...
● collect(): Array[T]
Return an array that contains all of the Pair RDD actions:
elements in this RDD.
● count(): Long ● collectAsMap(): Map[K, V]
Return the number of elements in the RDD. Return the key-value pairs in this RDD to
● reduce(f: (T, T) ⇒ T): T the master as a Map.
Reduces the elements of this RDD using the ● lookup(key: K): Seq[V]
specified commutative and associative Return the list of values in the RDD for
binary operator. key key.
● ...
Spark Basics 26
Word Count with Spark RDDs
Start the Spark Shell and type the following code:

val bookRDD = sc.textFile("hdfs:///tmp/book/98-0.txt")


val wordsRDD = bookRDD.flatMap(_.split(' ')).filter(_ != "")

val wordPairRDD = wordsRDD.map((_,1))


var countedWordsRDD = wordPairRDD.reduceByKey(_+_)

countedWordsRDD = countedWordsRDD.sortBy(_._2, ascending=false)


countedWordsRDD.saveAsTextFile("hdfs:///tmp/spark_wc_output")

Now take a look at the contents of the output directory.

Spark Basics 27
Word Count with Spark RDDs in Python
Start PySpark and type the following code:

bookRDD = sc.textFile("file:///tmp/book/98-0.txt")
wordsRDD = bookRDD.flatMap(lambda x: x.split(' ')) \
.filter(lambda x: x != "")

wordPairRDD = wordsRDD.map(lambda x: (x,1))


countedWordsRDD = wordPairRDD.reduceByKey(lambda x,y : x+y)

countedWordsRDD = countedWordsRDD \
.sortBy(lambda x: x[1], ascending=False)
countedWordsRDD.saveAsTextFile("file:///tmp/pyspark_wc_output")

Spark Basics 28

You might also like