You are on page 1of 6

Architecture and Components of Spark

▸ top row are APIs; we can use all in a single spark application
▸ bottom row are cluster managers that spark works with for resource management
▸ Spark Core: distributes workloads, monitors application, schedule tasks, memory mgt, fault recovery, interacts
with HDFS, houses APIs that defines RDDs
▸ Spark API Libraries: built on top of spark, inherits all Spark core features like fault tolerance...
    ▹ Spark SQL: structured data processing
    ▹ Spark Streaming: processing of live data stream
    ▹ Spark MLlib: common machine learning functionality
    ▹ GraphX: library for manipulating graphs and performing graph parallel computations
    ▹ SparkR: provides lightweight frontend to use Apache Spark from R
▸ Spark Cluster Manager: for resource allocation - spark can connect to pluggable resource managers
    ▹ Standalone: simple cluster manager included within Spark and makes it easy to setup a cluster
    ▹ Apache Mesos: general cluster manager taht can run Hadoop MapReduce and service application
    ▹ Hadoop YARN: cluster manager of Hadoop 2
▸ Spark Runtime Architecture
    ▹ master-slave architecture; master=driver, slave=executor
    ▹ drivers and executors run their own java processes
    ▹ Spark application is launched using cluster manager
 
▸ SparkContext
    ▹ main entry point to everything Spark
    ▹ defined in the driver program
    ▹ tells Spark how and where to access cluster
    ▹ connects to cluster manager
    ▹ coordinates Spark processes running on different nodes
    ▹ used to create RDDs and shared variables on a cluster
▸ Driver
    ▹ this is where main() method of Spark is
    ▹ converts user program into tasks (smallest unit of work); tasks are bundled into "stages"
    ▹ it schedules tasks on executors
▸ Executors
    ▹ runs in the entire lifetime of an app
    ▹ registers themselves to the driver, which allows driver to schedule tasks
    ▹ worker processes run individual tasks and returns result to driver
    ▹ provides in-memory storage for RDDs as well as disk storage
▸ Workflow:
User app → Driver program contacts cluster manager for resources → cluster manager launches executors → driver
splits program into tasks and send them to executors → executors perform tasks and return to driver → executors are
terminated and resources released

Resilient Distributed Datasets


▸ APIs of Spark
    ▹ RDD: core data API
    ▹ DataFrame API: uses schema to describe data
    ▹ DataSet API: combines RDD and DataFrame API
▸ RDD:
    ▹ abstraction for working with data in Spark
    ▹ "distributed collection of elements"
    ▹ Spark distributes RDDs across nodes in a cluster
    ▹ fault tolerant and can be rebuilt
    ▹ read-only collection of elements, immutable once constructed
▸ Creating RDDs (3 methods)
    ▹ parallelize/distribue a collection (list or set)
    ▹ load an external dataset or external storage system
    ▹ transform an existing RDD into new RDD
lines = sc.parallelize(["I","am","learning","Spark"])
lines = sc.textFile("README.md")
newlines = lines.transform(...)
▸ Operations on RDDs
    ▹ Transformations
        ⦿ create new dataset from existing and returns an RDD (because the former is immutable!)
        ⦿ lazy evaluation, not computed immediately
        ⦿ works element wise
        ⦿ transformations are recorded into DAG
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningRDD = inputRDD.filter(lambda x: "warning" in x)
badlinesRDD = errorRDD.union(warningRDD)
    ▹ Actions
        ⦿ RDDs are evaluated when action is called
        ⦿ entire RDD is computed from scratch
print "Number of problems in logfile:" + badlinesRDD.count()
badlinesRDD.take(10)
▸ Lineage graph of RDD (DAG)
    ▹ when more transformations are called and new RDDs are derived, lineage graph is updated and keeps track of
relationship between RDDs
    ▹ if a task failed, DAG is replicated and each RDD is computed on demand, making it fault tolerant
    
▸ Spark Program Lifecycle
    ▹ create RDD frm external data or existing collection
    ▹ lazily transform into new RDD
    ▹ cache some RDD for repeated use
    ▹ perform actions on RDD

▸ 
▹ 
RDD Operations
▸ Spark offers >80 high level operations beyond MapReduce,
    ▹ ex. transformations: map, filter, distinct, flatMap
    ▹ ex. actions: collect, count, take, reduce
▸ map: apply function to each element in RDD and return new RDD
RDD {4,5,10,10}, using Scala syntax:
rdd.map(x => x+1)
Result: {5,6,11,11}
▸ filter
rdd.filter(x => x < 10)
Result: {4,5}
▸ distinct
rdd.distinct()
Result: {4,5,10}
▸ flatmap: similar to map but returns a sequence rather than a single element; apply function then flatten result
rdd.flatmap(x => list(x-1,x+1))
Result: {3,5,4,6,9,11,9,11}

▸ collect: return all elements


rdd.collect()
Result: {4,5,10,10}
▸ count
rdd.count()
Result: 4
▸ take(num)
rdd.take(2)
Result: {4,5}
▸ reduce(func): combine elements in RDD together in paraller
rdd.reduce((x,y) => x+y)
Result: 29
▸ Sample Program
data = xrange(1,30)
xrangeRDD = sc.parallelize(data,4)
subRDD = xrangeRDD.map(lambda x: x-1)
filteredRDD = xrangeRDD.filter(lambda x: x<10)
subRDD.collect()
subRDD.count()

    ▹ partitioning is not equal parts but based on workers capacity  


result:subRDD
result: filteredRDD
  

Key Value Pair RDDs


▸ ex. Bag-of-words model
input = sc.textFile("data.txt")
lines = input.flatMap(lambda line: line.split())
pairs = lines.map(lambda word: (word,1))
counts = pairs.reduceByKey(lambda a,b: a+b)
or by using group by...
input = sc.textFile("data.txt")
lines = input.flatMap(lambda line: line.split())
pairs = lines.map(lambda word: (word,1))
groups = pairs.groupByKey()
counts = groups.map(lambda (word,count): (word, sum(count))
▹ but the second one is inefficient, because pairs are not reduced before the final reduce... so all data pairs are
transferred throughout the network

Quiz
1) Which of the following are Spark API libraries: 7) What is a task?
Correct answer Correct answer
Spark SQL, Spark MLlib A smallest unit of work sent to one executor

Explanation Explanation
Although ETL and Deep learning can be performed A task is a smallest unit of work sent to one executor; tasks are bundled
using Spark, there are no libraries of these names. into “stages”. The driver splits the user program into tasks and stages.
ETL can be performed using Spark SQL and Deep
learning can be performed using Spark MLlib. 8) Driver return results to executors.
Correct answer
2) Spark core is the base engine of Spark. Choose False
the correct functions of Spark core:
Correct answer Explanation
Task scheduler and memory management, Fault Driver schedules the tasks on executors. Results from these tasks are
recovery delivered back to the driver.

Explanation 9) A Spark program consists of a driver program, user program and a


Spark core does not provide a storage system but is cluster manager.
able to interact with other storage systems. Correct answer
False
3) Spark has its own cluster manager.
Correct answer Explanation
True A cluster manager is not part of a Spark program. A cluster manager is
only used to launch a Spark program.
Explanation
Spark includes its own standalone cluster manager 10) When do executors terminate in a Spark program?
and also supports working with YARN and Mesos. Correct answer
4) Which of the following are true about Spark
driver and executor processes?
Correct answer
Driver program is process where main() method of
When driver’s main() method exits , When SparkContext.stop() is called
Spark program run, Spark program consists of
driver program and executors
Explanation
When driver’s main() method exits or SparkContext.stop() is called,
Explanation
executors are terminated and the cluster manager releases the resources
Driver schedules tasks on executors and not vice a
versa.
11) Executor processes start when a task starts and terminate upon task
completion.
5) Executors provide in-memory storage for RDDs.
Correct answer
Correct answer
False
True
Explanation
Explanation
Executor processes are Java processes, launched once at the beginning
Executors provide in-memory storage for RDDs, as
of the application and typically run for the entire lifetime of an
well as disk storage so that running tasks can persist
application. The lifetime of a task does not determine the lifetime of an
intermediate results to memory or disk.
executor process.
6) SparkContext is main entry point to Spark
programs.
Correct answer
True

Quiz
1) reduceByKey is preferred to 9) RDDs can be created by which of the following approaches:
groupByKey Correct answer
Correct answer By using Spark API like textFile(), By transforming other RDD
True
Explanation
Explanation RDDs can be created by loading from external storage like a file system. The textFile()
groupByKey causes shuffling of on SparkContext converts the contents of a file into a RDD. Transformation operations
large amounts of data and hence on RDD also results in RDD. But actions operations on RDD do not yield a RDD.
not preferred. reduceByKey on the
other hand, reduces by key first 10) If a RDD wordsRDD contains {'pencil', 'paper', 'computer', 'mouse'}, what is the
and then shuffles data to worker result of wordsRDD.map(lambda x : len(x)).reduce(lambda x,y: x+y) Hint: len() is a
nodes where further reducing function that returns length of the string
happens. Correct answer
the value 24
2) Actions are lazily evaluated.
Correct answer Explanation
False The map function returns a RDD containing the length of each string which is :
{6,5,8,5} The reduce function is chained to the map function and adds all the lengths to
Explanation return the value of "24"
Actions are evaluated immediately
and are responsible for getting 11) Which of the following statements are true about RDDs
results from Spark data operations. Correct answer
All of the above
3) Which of the following are
operators in Spark Explanation
Correct answer RDDs is the primary data API allowing data to be processed in Spark. RDDs are
map, reduce distributed collection of elements that can be reconstructed on failure and hence fault
tolerant. RDDs are immutable, as transformations on RDDs result in new RDDs with
Explanation the original RDD staying untouched.
"print" is not a operator, meaning
is neither a transformation or an 12) Transformations on RDDs result in
action. "print" is offered by the Correct answer
native Java, Python or Scala API. a new RDD, update of the DAG

4) DAG is lazily evaluated Explanation


Correct answer Transformations of RDDs are lazily evaluated and result in a new RDD, without
True modifying the original RDD. The DAG is updated to reflect the transformation, which
also serves as the lineage graph used to recover at times of failure.
Explanation
A DAG keeps track of the lineage 13) The filter() operator is Spark is an action and returns the filtered results
of a RDD and is only evaluated immediately.
when an action is called. Correct answer
False
5) reduceByKey can be called on
Pair RDDs only Explanation
Correct answer filter() is a transformation and is lazily evaluated
True
14) Which of the following statements are true:
Explanation Correct answer
reduceByKey aggregrates elements All of the above
by key, when each element is a key
value pair Explanation
count, collect and take are all actions returning results to the driver or persistent storage
6) If a RDD wordsRDD contains
{'pencil', 'paper', 'computer', 15) calling collect() on a large RDD can run into memory errors
'mouse'}, what is the result of Correct answer
wordsRDD.map(lambda x : x + 's') True
Correct answer
a new RDD, a RDD containing Explanation
{'pencils', 'papers', 'computers', If a RDD is very large such that it does not fit into the driver's memory, you will get a
'mouses'} memory error. Collect will attempt to copy every single element in the RDD onto the
single driver, and then run out of memory and crash.
Explanation
wordsRDD is transformed into a
new RDD where each element of
wordsRDD is appended with the
letter 's'

7) A function passed to a Spark


operator, is executed on each
element of the RDD
Correct answer
True

Explanation
When a Spark operator takes a
function as a parameter, operations
are invoked on RDDs by executing
functions on each element of the
RDD.
8) Operations on RDDs are
grouped into transformations,
collections and actions
Correct answer
False

Explanation
Operations on RDDs are either
transformations or actions.

You might also like