You are on page 1of 46

• Cluster computing frameworks like MapReduce have been

widely adopted for large-scale data analytics letting users write


parallel computations without having to worry about work
distribution and fault tolerance.
• Although MapReduce provide numerous abstractions for
accessing a cluster’s computational resources, it lacks
abstractions for leveraging distributed memory (RAM).
C0 C1 D0 C1 C2 C4 C0 C4

C5 C2 C5 C3 D0 D1 … D1 C3

Input mapper Map results reducer Reduce results


(in HDFS) (in local system) (in HDFS)

2
• This makes inefficient for an important class of emerging
applications: those that reuse intermediate results across
multiple computations, such as in many iterative machine
learning and graph algorithms, including PageRank, K-means
clustering, and logistic regression, or simply running multiple
ad-hoc queries on the same subset of the data.
• The only way to reuse data between computations (e.g.,
between two MapReduce jobs) is to write it to an external
stable storage system, e.g., a distributed file system.
• This incurs substantial overheads due to data replication, disk
I/O, and serialization which can dominate application execution
times.
Input
(in HDFS)
MR MR result
(in HDFS)
MR MR result
(in HDFS)
MR MR result
(in HDFS)

3
• In 2012, a group in AMPLab – UC Berkeley proposed a new abstraction
called resilient distributed datasets (RDDs) that enables efficient data reuse
in a broad range of applications and was implemented in Apache Spark.
• RDDs are fault-tolerant, parallel data structures that let users explicitly
persist intermediate results in memory, control their partitioning to
optimize data placement, and manipulate them using a rich set of
operators.

Original paper: Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing." in 9th NSDI. 2012. 4
• In the master node, you
have the driver program
which drives your
application. The code you
are writing behaves as a
driver program.
• Inside the driver program,
the first thing you do is
creating a Spark Context.
• Spark context is like a gateway to all the Spark functionalities.
• Now, this Spark context works with the cluster manager to manage
various jobs.
• A job is split into multiple tasks which are distributed over the
worker nodes.
• Anytime an RDD is created in Spark context, it can be distributed across
various nodes and can be cached there.
5
There are two ways to create RDDs:
1. Parallelizing an existing collection in your driver
program,
2. Referencing a dataset in an external storage
system, such as local file system, HDFS, HBase,
Cassandra, Amazon S3, etc.

6
1. Parallelizing an existing collection in your driver program,

data = [1, 2, 3, 4, 5]
dist_data = sc.parallelize(data)
• Once created, the distributed dataset (dist_data) can be
operated in parallel. For example, we can
call dist_data.reduce(lambda a, b: a + b) to add up
the elements of the list.
• The data is divided into n partitions and Spark will run one task
for each partition of the cluster.
• Spark tries to set the number of partitions automatically based
on your cluster, typically 2-4 partitions for each CPU in cluster.
• However, you can also set it manually by passing it as a second
parameter to parallelize (e.g. sc.parallelize(data, 10))

7
2. Referencing a dataset in an external storage system, such as
local file system, HDFS, HBase, Cassandra, Amazon S3, etc.
From local file system
dist_data = sc.textFile("data.txt")

• If using a path on the local file system, the file must also
be accessible at the same path on worker nodes. Either
copy the file to all workers or use a network-mounted
shared file system.

8
2. Referencing a dataset in an external storage system, such as
local file system, HDFS, HBase, Cassandra, Amazon S3, etc.

From HDFS
dist_data = sc.textFile("hdfs://host:port/file")

# example from our pre-configured VM


dist_data =
sc.textFile("hdfs://localhost:9000/purchases/purchases.txt")

* You can see HDFS “host:port” in “/home/bigdata/hadoop-2.8.5/etc/hadoop/core-site.xml”


<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

Write back RDD to HDFS


dist_data = sc.saveAsTextFile("hdfs://host:port/directory")
9
RDDs support two types of operations:
1. Transformations, which create a new RDD from an existing
one,
2. Actions, which return a value to the driver program after
running a computation on the dataset.
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

• 1st line: creates RDD, lines, from local file system. It’s not
loaded in memory. lines is merely a pointer to the file.
• 2nd line: defines new RDD, lineLengths, as the result of a map
transformation. Due to lazy concept of RDD, lineLengths is not
immediately computed
• 3rd line: runs “reduce”, which is an action.

10
1. map(func)
Return a new distributed data formed by passing each
element of the source through a function func.

Existing
collection

data.txt
External
One Two
data
Three

Note:
• In “external data”, each element refers to each “line data”.
• collect() is RDD action that fetches the entire RDD to a single machine (driver node)

11
1. map(func) with custom function
• This will be useful when you want to make more complex map
function that cannot be written in lambda expression.

equivalent

12
2. flatMap(func)
Similar to map transformation, but followed by flattening the
result.
data.txt
One Two
Three

13
2. flatMap(func)
Similar to map transformation, but followed by flattening the result.

Note that flattening


here is different
with the one in
numpy. See the
comparison in the
left

14
3. filter(func)
Return a new RDD formed by selecting those elements of the
source on which func returns true.

15
4. reduceByKey(func,[numPartition])
On an RDD of (K, V) pairs, it returns an RDD of (K, V) pairs
where the values for each key are aggregated using the given
reduce function func, which must be of type (V,V) => V. The
number of reduce tasks is configurable through an optional
second argument.

16
17
5. sortByKey([ascending][numPartition])
On an RDD of (K, V) pairs, it returns an RDD of (K, V) pairs
sorted by keys in ascending or descending order (specified in
the boolean value)

➢ For other transformations (join, union, intersection, etc), you can take a
look at Apache Spark documentation

18
1. reduce(func)
Aggregate the elements of the RDD using a function func
(which takes two arguments and returns one).

19
2. collect()
Fetch the entire RDD to a single machine (driver node). It is
only suitable for relatively small RDD

3. take(n)
Similar to collect but return only n first elements

➢ For other actions (count, saveAsTextFile, etc), you can take a look at
Apache Spark documentation
20
Suppose a web service is experiencing errors and an operator wants to
search terabytes of logs in HDFS to find the cause.
Code example
lines = sc.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
hdfs_errors = errors.filter(_.contains(“HDFS"))
hdfs_errors.count()
Corresponding lineage (computation graph)

One of core concepts of Spark:


Representing computation as DAG
(directed acyclic graph)
action count()
count value
Instead of computing transformations immediately, Spark will only store the
transformations how get the data (i.e., its lineage). The computation will be
performed only when an “action” is applied. In the example above is “count()”.
21
• One of the most important
capabilities in RDD Spark
is persisting intermediate
result in memory across
operations.
• You can mark an RDD to be persisted, e.g.,
errors.persist(). The first time it is computed in an
action, it will be kept in memory on the nodes.
• Using our “error log search” example, we can persist the
“errors” RDD in memory allowing future actions to be much
faster (often by more than 10x). For example:
mysql_error = errors.filter(_.contains(“MySQL")).count()
error_time = errors.filter(_.contains(“MySQL")).map(getTime_funct)

22
In 1TB of Wikipedia page view
logs (2 years of data). Three
queries are performed to find
total views of: (i) all pages, (ii)
pages with titles exactly
matching a given word, and (iii)
pages with titles partially
matching a word. See the result
using 100 nodes using Spark by
persisting related data in the
right.

• Benchmark: querying the 1 TB file from disk took 170s


Source: Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing." in 9th NSDI. 2012.
23
Storage level in persisting RDD
• MEMORY_ONLY
If the RDD does not fit in memory, some partitions will not be
cached and will be recomputed on the fly each time they're
needed. This is the default level.
• MEMORY_AND_DISK
If the RDD does not fit in memory, store the partitions that
don't fit on disk, and read them from there when they're
needed.
• DISK_ONLY
Store the RDD partitions only on disk.

➢ You can see other options in Apache Spark documentation

24
• Regarding data source, we have data replication, e.g., when
we use data from HDFS
• Regarding the data persisted in the memory, e.g., RDD
errors in our previous example, Spark will rebuilds it based
on its lineage, i.e. a filter transformation only to the
corresponding partition of lines.

25
• Logistic regression is one of iterative machine learning
method that may use gradient descent during the learning
process.

with n = number
of features and
m = number of
data points

* For more detailed derivation, you can see [here]


26
• Logistic regression is one of iterative machine learning
method that may use gradient descent during the learning
process.

➢ Once we get W,
prediction can be
obtained using

with n = number
of features and
m = number of
data points

27
• Logistic regression is one of iterative machine learning
method that may use gradient descent during the learning
process.

By persisting
data points (X
and Y) in
memory, it can
yield 20x
with n = number speedup.
of features and
m = number of
data points

28
Running Logistic regression and K-Means using 10GB of data

Source: RDD paper


29
Overview K-Means
Step 1 Step 2

Step 3 Step 4.1

30
Overview K-Means
Step 4.2 Step 5

Step 7, and so on..


Step 6

Repeat until converged =


having same
centroids/cluster
members between two
consecutive iterations

31
Running Logistic regression and K-Means w.r.t numb. of machine

Source: RDD paper


32
• Immutable collections of objects spread
across a cluster.
• Built through parallel transformations (map,
filter, etc) followed by actions, and is
represented in DAG.
• Uses lazy concept.
• Automatically rebuilt on failure.
• Controllable persistence (e.g., caching in
RAM).

33
• Deployment: Apache Mesos, Hadoop via YARN, or Spark’s
own cluster manager.
• Polyglot: provides high-level APIs in Java, Scala, Python, and
R.

34
• In the master node, you
have the driver program
which drives your
application. The code you
are writing behaves as a
driver program.
• Inside the driver program,
the first thing you do is
creating a Spark Context.
• Spark context is like a gateway to all the Spark functionalities.
• Now, this Spark context works with the cluster manager to manage
various jobs.
• A job is split into multiple tasks which are distributed over the
worker nodes.
• Anytime an RDD is created in Spark context, it can be distributed across
various nodes and can be cached there.
35
• Step 1: The client submits spark user application code, and the
driver implicitly converts user code that contains transformations
and actions into a logically directed acyclic graph called DAG
36
• Step 2: The logical graph called DAG is then converted into
physical execution plan with many stages. After converting into a
physical execution plan, it creates physical execution units called
tasks under each stage. 37
• Step 3: Now the driver talks to the cluster manager and negotiates the
resources. Cluster manager launches executors in worker nodes on behalf of
the driver, and the driver will send the tasks to the executors based on data
placement. When executors start, they register themselves with drivers. So,
the driver will have a complete view of executors that are executing the task.
38
• Step 4: During the course of execution of tasks, driver program will
monitor the set of executors that runs. Driver node also schedules
future tasks based on data placement. In case there is a failure,
driver will know and will resend the task.
39
• Step 5: Once all the task are done (until executing action), the
result will be returned to the driver node.

40
Spark Web User Interface

41
• Since in the first release (ver.
0.5 - 2012), Spark introduced
RDD as its core data concept.
• In ver. 1.3 - 2015, Spark
introduced “DataFrame” and
in ver. 1.6 - 2016 it introduced
“Dataset”.
• DataFrame is provided for
processing structured data
with relational queries with
richer optimizations under
the hood.
• DataFrames can be
constructed from: structured
data files, Hive table, HDFS,
existing RDDs, etc.
42
• As the development of Spark, there are new APIs with their own
“gateway”, e.g., RDD with SparkContext, SQL with SQLContext, hive with
HiveContext, etc. Spark session gives a unified view of all these contexts.
43
• Several simple operations

44
• Filter and sort operation

45
• Select-where and aggregate-sort

o And so on for other query


expressions.

➢ Using Spark SQL, it allows you to do SQL queries from a DataFrame.


46

You might also like