Professional Documents
Culture Documents
Jesús Montes
jesus.montes@upm.es
Oct. 2022
Spark Basics
Apache Spark
● Defined by its creators as a “fast, general purpose cluster computing
system”.
● Centered on a data structure called resilient distributed dataset (RDD)
● Based on principles similar to MapReduce, but allowing a more flexible
dataflow structure.
● Uses in-memory computation to optimize performance.
● Nowadays, is the most popular MapReduce successor.
● Its current stable version is 3.3.0 (released Jun 2022)
● Spark can run as a standalone, local ● Spark Core: The base component
application or in a cluster, on top of ● Spark SQL: High-level structured-data
Yarn, Kubernetes, Mesos... processing
● Spark can read and write data from ● MLlib: Machine learning
many storage technologies ● GraphX: Graph processing
○ Local filesystem ● Spark Streaming: Stream processing
○ HDFS
○ Hive
○ HBase
○ Cassandra
○ ...
http://spark.apache.org/docs/latest/
Spark Basics 4
Spark Stack
Spark Basics 5
Spark Core
It contains the basic functionality of Spark
● Task scheduling
● Memory management
● Fault recovery
● Storage-layer interaction
● Cluster resource manager interaction (YARN/Kubernetes/Mesos)
● Resilient Distributed Datasets (RDDs) API
The Spark Core makes possible to create and manage Spark Applications.
Spark Basics 6
Spark Application
All Spark applications have the same components
Spark Basics 7
Spark Application
Spark Driver
Spark Context
Cluster Manager
Spark Basics 8
Spark Context
A Spark Context represents a connection to a computing cluster:
Using a Spark Context, the application can create Spark objects, like RDDs,
accumulators and broadcast variables.
Spark Basics 9
Resilient Distributed Dataset (RDD)
An RDD is a distributed collection of items.
Spark Basics 10
Resilient Distributed Dataset (RDD)
● RDDs are managed by RDD Cluster node
objects.
Partition
● An RDD object contains Memory partition
references to Partition
objects.
● Each Partition object Cluster node
corresponds to a subset of RDD Partition
Memory partition
data.
● Partitions are assigned to
nodes in the cluster.
Cluster node
● By default, each partition
Partition
will be stored in memory. Memory partition
Spark Basics 11
Resilient Distributed Dataset (RDD)
Spark Basics 12
Transformations and Actions
Transformations: Actions:
● RDDs are immutable, so the only way to ● Actions compute a result based on the
transform its content if by generating a contents of an RDD. The result can be
new RDD. returned to the Driver or stored (in a
● Transformations are not executed when local file, HDFS, etc.).
declared. Instead, they are registered by ● When an action is requested, the Driver
the Spark Driver and organized into an performs the execution plan, computing
execution plan. the target RDD contents and applying
● Examples of transformations: filter the the indicated action.
values of an RDD, group values by some ● Examples of actions: output the RDD
criteria. contents (totally or partially), count the
elements in an RDD.
Spark Basics 13
Transformations and Actions
Creation Transformation
RDD Action
Lineage
Result
Spark Basics 14
The Spark Shell
Spark can be used interactively through the Spark Shell
Spark Basics 15
Spark APIs
● Scala
○ The most complete one (after all, Spark is written in Scala)
○ The default language of the Spark Shell
● Java and Python
○ Almost as complete as the Scala API
○ Some minor limitations/difficulties, but perfectly functional
● R
○ Only available at a DataFrame level (no direct access to RDDs)
○ Limited, but functional for basic/medium-complexity operations
○ Greatly improved in Spark 2.x, with a promising future
http://spark.apache.org/docs/latest/
Spark Basics 16
Let’s try Spark
1. First, start the Spark Shell val nums = Array((1 to 50):_*)
# spark-shell val oneRDD = sc.parallelize(nums)
This will start the Scala interpreter and oneRDD.count
initialize a Spark Context. After a short
while, you will see a scala prompt val otherRDD = oneRDD.map(_ * 2)
(scala>) val result = otherRDD.collect
2. Type the example code on the right in for (v <- result) println(v)
the shell. Make sure to run one line at a
time, and check the results of each one val sum = oneRDD.reduce(_ + _)
of them. print(sum)
Spark Basics 17
Let’s try PySpark
1. First, start the PySpark Shell nums = list(range(1,51))
# pyspark oneRDD = sc.parallelize(nums)
This will start the Python interpreter and oneRDD.count()
initialize a Spark Context. After a short
while, you will see a python prompt otherRDD = oneRDD.map(lambda x: x *
(>>>) 2)
2. Type the example code on the right in result = otherRDD.collect()
the shell. Make sure to run one line at a for v in result:
time, and check the results of each one print(v)
of them.
sum = oneRDD.reduce(lambda x,y: x+y)
print(sum)
Spark Basics 18
Creating/Loading/Saving RDDs
● Parallelize an existing collection
val myRDD1 = sc.parallelize(Array('a','b','c'))
● Load from file Only in
local mode
val myRDD2 = sc.textFile("file:///tmp/98-0.txt")
val myRDD3 = sc.textFile("hdfs:///tmp/book/98-0.txt")
● Save to file Only if HDFS
available
oneRDD.saveAsTextFile("/tmp/oneRDD")
otherRDD.saveAsObjectFile("/tmp/otherRDD")
● Different sources, destinations and data formats are supported.
Spark Basics 19
RDD transformations
● Most RDD transformations are based on passing functions as parameters
of RDD method calls.
● The result of these method calls is a new RDD object, containing the
results of the requested transformation.
● Scala’s powerful anonymous function syntax makes very easy to write
concise code for Spark.
● Most RDD transformations are based on basic principles of functional
programming and the MapReduce model.
Spark Basics 20
RDD transformations
The most basic RDD transformation is map:
Spark Basics 21
RDD transformations
● flatMap[U](f: (T) ⇒ TraversableOnce[U]): RDD[U]
Similar to a regular map, but the parameter function can return zero or
more elements.
● filter(f: (T) ⇒ Boolean): RDD[T]
Return a new RDD containing only the elements that satisfy a predicate.
● distinct(): RDD[T]
Return a new RDD containing the distinct elements in this RDD.
● sample(withReplcment: Boolean, fraction: Double): RDD[T]
Returns a sampled subset of this RDD.
Spark Basics 22
RDD transformations
● sortBy[K](f: (T) ⇒ K, ascending: Boolean = true): RDD[T]
Return this RDD sorted by the given key function.
● union(other: RDD[T]): RDD[T]
Return the union of this RDD and another one.
● intersection(other: RDD[T]): RDD[T]
Return the intersection of this RDD and another one.
● subtract(other: RDD[T]): RDD[T]
Return an RDD with the elements from this that are not in other.
● ...
Spark Basics 23
Pair RDDs
● In Scala, tuples are part of the language basic syntax, and RDDs of tuples
can be created in Spark.
val myKVArray = Array((1, 'a'), (2, 'b'), (3, 'c'))
val myPairRDD = sc.parallelize(myKVArray)
● Spark can use the tuples inside the RDD as key-value pairs.
● The Spark API provides a specific set of transformations and actions for
working with key-value RDDs.
● RDDs of tuples can be used in the Java and Python APIs. These APIs
contain the appropriate extensions to allow usage similar to the Scala API.
Spark Basics 24
Pair RDD transformations
● keys: RDD[K]
● values: RDD[V]
● groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
Group the values for each key in the RDD into a single sequence.
● reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Merge the values for each key using an associative and commutative
reduce function.
● join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
Return an RDD containing all pairs of elements with matching keys.
● ...
Spark Basics 25
RDD actions
Basic RDD actions: ● foreach(f: (T) ⇒ Unit): Unit
Applies a function f to all elements of
● take(num: Int): Array[T] this RDD.
Take the first num elements of the RDD. ● ...
● collect(): Array[T]
Return an array that contains all of the Pair RDD actions:
elements in this RDD.
● count(): Long ● collectAsMap(): Map[K, V]
Return the number of elements in the RDD. Return the key-value pairs in this RDD to
● reduce(f: (T, T) ⇒ T): T the master as a Map.
Reduces the elements of this RDD using the ● lookup(key: K): Seq[V]
specified commutative and associative Return the list of values in the RDD for
binary operator. key key.
● ...
Spark Basics 26
Word Count with Spark RDDs
Start the Spark Shell and type the following code:
Spark Basics 27
Word Count with Spark RDDs in Python
Start PySpark and type the following code:
bookRDD = sc.textFile("file:///tmp/book/98-0.txt")
wordsRDD = bookRDD.flatMap(lambda x: x.split(' ')) \
.filter(lambda x: x != "")
countedWordsRDD = countedWordsRDD \
.sortBy(lambda x: x[1], ascending=False)
countedWordsRDD.saveAsTextFile("file:///tmp/pyspark_wc_output")
Spark Basics 28