Professional Documents
Culture Documents
C5 C2 C5 C3 D0 D1 … D1 C3
2
• This makes inefficient for an important class of emerging
applications: those that reuse intermediate results across
multiple computations, such as in many iterative machine
learning and graph algorithms, including PageRank, K-means
clustering, and logistic regression, or simply running multiple
ad-hoc queries on the same subset of the data.
• The only way to reuse data between computations (e.g.,
between two MapReduce jobs) is to write it to an external
stable storage system, e.g., a distributed file system.
• This incurs substantial overheads due to data replication, disk
I/O, and serialization which can dominate application execution
times.
Input
(in HDFS)
MR MR result
(in HDFS)
MR MR result
(in HDFS)
MR MR result
(in HDFS)
…
3
• In 2012, a group in AMPLab – UC Berkeley proposed a new abstraction
called resilient distributed datasets (RDDs) that enables efficient data reuse
in a broad range of applications and was implemented in Apache Spark.
• RDDs are fault-tolerant, parallel data structures that let users explicitly
persist intermediate results in memory, control their partitioning to
optimize data placement, and manipulate them using a rich set of
operators.
Original paper: Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing." in 9th NSDI. 2012. 4
• In the master node, you
have the driver program
which drives your
application. The code you
are writing behaves as a
driver program.
• Inside the driver program,
the first thing you do is
creating a Spark Context.
• Spark context is like a gateway to all the Spark functionalities.
• Now, this Spark context works with the cluster manager to manage
various jobs.
• A job is split into multiple tasks which are distributed over the
worker nodes.
• Anytime an RDD is created in Spark context, it can be distributed across
various nodes and can be cached there.
5
There are two ways to create RDDs:
1. Parallelizing an existing collection in your driver
program,
2. Referencing a dataset in an external storage
system, such as local file system, HDFS, HBase,
Cassandra, Amazon S3, etc.
6
1. Parallelizing an existing collection in your driver program,
data = [1, 2, 3, 4, 5]
dist_data = sc.parallelize(data)
• Once created, the distributed dataset (dist_data) can be
operated in parallel. For example, we can
call dist_data.reduce(lambda a, b: a + b) to add up
the elements of the list.
• The data is divided into n partitions and Spark will run one task
for each partition of the cluster.
• Spark tries to set the number of partitions automatically based
on your cluster, typically 2-4 partitions for each CPU in cluster.
• However, you can also set it manually by passing it as a second
parameter to parallelize (e.g. sc.parallelize(data, 10))
7
2. Referencing a dataset in an external storage system, such as
local file system, HDFS, HBase, Cassandra, Amazon S3, etc.
From local file system
dist_data = sc.textFile("data.txt")
• If using a path on the local file system, the file must also
be accessible at the same path on worker nodes. Either
copy the file to all workers or use a network-mounted
shared file system.
8
2. Referencing a dataset in an external storage system, such as
local file system, HDFS, HBase, Cassandra, Amazon S3, etc.
From HDFS
dist_data = sc.textFile("hdfs://host:port/file")
• 1st line: creates RDD, lines, from local file system. It’s not
loaded in memory. lines is merely a pointer to the file.
• 2nd line: defines new RDD, lineLengths, as the result of a map
transformation. Due to lazy concept of RDD, lineLengths is not
immediately computed
• 3rd line: runs “reduce”, which is an action.
10
1. map(func)
Return a new distributed data formed by passing each
element of the source through a function func.
Existing
collection
data.txt
External
One Two
data
Three
Note:
• In “external data”, each element refers to each “line data”.
• collect() is RDD action that fetches the entire RDD to a single machine (driver node)
11
1. map(func) with custom function
• This will be useful when you want to make more complex map
function that cannot be written in lambda expression.
equivalent
12
2. flatMap(func)
Similar to map transformation, but followed by flattening the
result.
data.txt
One Two
Three
13
2. flatMap(func)
Similar to map transformation, but followed by flattening the result.
14
3. filter(func)
Return a new RDD formed by selecting those elements of the
source on which func returns true.
15
4. reduceByKey(func,[numPartition])
On an RDD of (K, V) pairs, it returns an RDD of (K, V) pairs
where the values for each key are aggregated using the given
reduce function func, which must be of type (V,V) => V. The
number of reduce tasks is configurable through an optional
second argument.
16
17
5. sortByKey([ascending][numPartition])
On an RDD of (K, V) pairs, it returns an RDD of (K, V) pairs
sorted by keys in ascending or descending order (specified in
the boolean value)
➢ For other transformations (join, union, intersection, etc), you can take a
look at Apache Spark documentation
18
1. reduce(func)
Aggregate the elements of the RDD using a function func
(which takes two arguments and returns one).
19
2. collect()
Fetch the entire RDD to a single machine (driver node). It is
only suitable for relatively small RDD
3. take(n)
Similar to collect but return only n first elements
➢ For other actions (count, saveAsTextFile, etc), you can take a look at
Apache Spark documentation
20
Suppose a web service is experiencing errors and an operator wants to
search terabytes of logs in HDFS to find the cause.
Code example
lines = sc.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
hdfs_errors = errors.filter(_.contains(“HDFS"))
hdfs_errors.count()
Corresponding lineage (computation graph)
22
In 1TB of Wikipedia page view
logs (2 years of data). Three
queries are performed to find
total views of: (i) all pages, (ii)
pages with titles exactly
matching a given word, and (iii)
pages with titles partially
matching a word. See the result
using 100 nodes using Spark by
persisting related data in the
right.
24
• Regarding data source, we have data replication, e.g., when
we use data from HDFS
• Regarding the data persisted in the memory, e.g., RDD
errors in our previous example, Spark will rebuilds it based
on its lineage, i.e. a filter transformation only to the
corresponding partition of lines.
25
• Logistic regression is one of iterative machine learning
method that may use gradient descent during the learning
process.
with n = number
of features and
m = number of
data points
➢ Once we get W,
prediction can be
obtained using
with n = number
of features and
m = number of
data points
27
• Logistic regression is one of iterative machine learning
method that may use gradient descent during the learning
process.
By persisting
data points (X
and Y) in
memory, it can
yield 20x
with n = number speedup.
of features and
m = number of
data points
28
Running Logistic regression and K-Means using 10GB of data
30
Overview K-Means
Step 4.2 Step 5
31
Running Logistic regression and K-Means w.r.t numb. of machine
33
• Deployment: Apache Mesos, Hadoop via YARN, or Spark’s
own cluster manager.
• Polyglot: provides high-level APIs in Java, Scala, Python, and
R.
34
• In the master node, you
have the driver program
which drives your
application. The code you
are writing behaves as a
driver program.
• Inside the driver program,
the first thing you do is
creating a Spark Context.
• Spark context is like a gateway to all the Spark functionalities.
• Now, this Spark context works with the cluster manager to manage
various jobs.
• A job is split into multiple tasks which are distributed over the
worker nodes.
• Anytime an RDD is created in Spark context, it can be distributed across
various nodes and can be cached there.
35
• Step 1: The client submits spark user application code, and the
driver implicitly converts user code that contains transformations
and actions into a logically directed acyclic graph called DAG
36
• Step 2: The logical graph called DAG is then converted into
physical execution plan with many stages. After converting into a
physical execution plan, it creates physical execution units called
tasks under each stage. 37
• Step 3: Now the driver talks to the cluster manager and negotiates the
resources. Cluster manager launches executors in worker nodes on behalf of
the driver, and the driver will send the tasks to the executors based on data
placement. When executors start, they register themselves with drivers. So,
the driver will have a complete view of executors that are executing the task.
38
• Step 4: During the course of execution of tasks, driver program will
monitor the set of executors that runs. Driver node also schedules
future tasks based on data placement. In case there is a failure,
driver will know and will resend the task.
39
• Step 5: Once all the task are done (until executing action), the
result will be returned to the driver node.
40
Spark Web User Interface
41
• Since in the first release (ver.
0.5 - 2012), Spark introduced
RDD as its core data concept.
• In ver. 1.3 - 2015, Spark
introduced “DataFrame” and
in ver. 1.6 - 2016 it introduced
“Dataset”.
• DataFrame is provided for
processing structured data
with relational queries with
richer optimizations under
the hood.
• DataFrames can be
constructed from: structured
data files, Hive table, HDFS,
existing RDDs, etc.
42
• As the development of Spark, there are new APIs with their own
“gateway”, e.g., RDD with SparkContext, SQL with SQLContext, hive with
HiveContext, etc. Spark session gives a unified view of all these contexts.
43
• Several simple operations
44
• Filter and sort operation
45
• Select-where and aggregate-sort