You are on page 1of 30

Resilient Distributed Datasets: A

Fault-Tolerant Abstraction for In-


Memory Cluster Computing
BIRSU ION
CHIOIBAS EMIL
GHIMISI ALEXANDRU
Context

 Cloud computing adopted for large-scale data analytics


 Current frameworks – inefficient
 Reusing intermediate results across multiple computations
 e.g. iterative ML and graph algorithms
 Interactive data mining
 Run queries on the same subset of data
Context

 Frameworks that support only specific computation patterns


 E.g. Pregel, HaLoop

 SOLUTION: Resilient Distributed Datasets (RDDs)


 Load several datasets into memory
 Run queries across them
Resilient Distributed Datasets

 A read-only, partitioned collection of records


 Stored in RAM or on Disk
 Fault-tolerant, parallel data structures
 Automatically rebuilt if failure
 Users can:
 Persist intermediate results in memory
 Control their portioning
 Manipulate them using operators
Fault tolerance interface

 An interface based on coarse-grained transformations


 Apply the same operation to many data items

HOW DOES IT WORK?


 Log the transformations used to build the dataset
 Recompute partition using lineage information
How are RDDs created?

 Implemented in Spark
 Computing framework for iterative tasks
 Exposed through a language-integrated API

 Created using:
 Data in stable storage
 Other RDDs (using transformations)
Persistence

 Which RDDs will be reused in future


 By default: in memory
 Not enough space?: on disk

 Persistence strategies using flags


 Persistence priorities
RDDs vs
Distributed
shared
memory
Spark Programming Interface

 RDD abstraction through an API in


Scala
 RDDs are statically typed objects
parametrized by an element type

HOW TO USE?
 Write a driver program
 Define RDDs
 Invoke actions on them
RDDs
operations in
Spark
Representing RDD

 Common interface consisting in:


 A set of partitions (atomic pieces of the dataset)
 A set of dependencies on parent RDDs
 A function for computing the dataset based on its parents
 Metadata about partitioning scheme
 Data placement
Representing
RDDs
Dependencies between RDDs

 Types:
 Narrow: each partition of the parent RDD is used by at most one partition of a child
RRD
 Wide: each partition of the parent RDD is used by multiple partitions of a child RDD
Example of
dependencies
Implementation

 The system runs over Mesos cluster manager


 Each Spark program runs as a separate Mesos application
 Mesos handles resource sharing between applications
Job Scheduling

 Similar to Dryad’s , but takes into account which partitions are available in
memory
 Assigns tasks to machines based on data localization
 Is not fault tolerant for scheduler failures
DAG of
stages
Example of
Spark
interpreter
translation
Memory management

 in-memory storage as deserialised Java objects


 in-memory storage as serialized data
 on-disk storage
Evaluation

 Comparison with existing frameworks (Hadoop)


 Machine learning algorithms
 PageRank
 Fault recovery
 Behavior when data does not fit in memory
Iterative Machine Learning Applications
 Hadoop takes longer in first iteration due to heartbeat protocol between its master
and workers
 On average iterations, Spark has up to 20 x speed up :
 Minimum overhead of the Hadoop software stack
 Overhead of HDFS while serving data
 Deserialization cost to convert binary records to usable in-memory Java objects.
Page rank
Fault recovery
Behavior with Insufficient Memory
Discussion

 Apparent limited programming interface due to their immutable nature and


coarse-grained transformations.
 Suitable for a wide class of applications
 Advantages that result in optimizations:
 Keeping specific data in memory
 Partitioning the data do minimize communication
 Recovering from failures efficiently
Expressing existing programming models

 MapReduce – flatMap + groupByKey / reduceByKey


 DryadLINQ – bulk operators that correspond with RDD transformations
(e.g. map, groupByKey, join)
 SQL – data-parallel operations on set records
 Pregel – specialized model for iterative graph applications. This user
function is applied to all the vertices on each iteration
 Iterative MapReduce – HaLoop, Twister – easily expressed with RDDs
Related work

 Data flow models such as MapReduce, Dryad, Ciel – rich set of operators for
processing data, but share it through storage systems
 High-level programming interfaces for data flow systems (DryadLINQ,
FlumeJava) – provide language-integrated APIs. However, the parallel
collections represent files on disk.
 Caching systems (Nectar) can reuse intermediate results across DryadLINQ jobs
by identifying common subexpressions with program analysis. However, Nectar
does not provide in-memory caching, placing the data in a distributed file system
 Relational Databases – RDDs are conceptually similar to views in a database,
and persistent RDDs resemble materialized views. However, fine-grained read-
write access isn’t allowed. Also, logging of operations and data for fault
tolerance is a not required overhead.
Conclusions

 RDDs are an efficient, general-purpose and fault-tolerant abstraction for sharing


data in cluster applications
 A wide range of parallel applications can be expressed with RDDs
 Although RDDs are limited by only using bulk operations, many parallel
applications are suited to this model
 In case of node failure, data can be efficiently recovered using lineage
 Spark is the implementation of RDDs, outperforming Hadoop by up to 20x in
iterative applications

You might also like