prezentareBD Tot

Resilient Distributed Datasets: A
Fault-Tolerant Abstraction for In-

Memory Cluster Computing
BIRSU ION
CHIOIBAS EMIL
GHIMISI ALEXANDRU
Context
 Cloud computing adopted for large-scale data analytics

 Current frameworks – inefficient
 Reusing intermediate results across multiple computations
 e.g. iterative ML and graph algorithms
 Interactive data mining
 Run queries on the same subset of data
Context
 Frameworks that support only specific computation patterns

 E.g. Pregel, HaLoop
 SOLUTION: Resilient Distributed Datasets (RDDs)

 Load several datasets into memory
 Run queries across them
Resilient Distributed Datasets
 A read-only, partitioned collection of records

 Stored in RAM or on Disk
 Fault-tolerant, parallel data structures
 Automatically rebuilt if failure
 Users can:
 Persist intermediate results in memory
 Control their portioning
 Manipulate them using operators
Fault tolerance interface
 An interface based on coarse-grained transformations

 Apply the same operation to many data items
HOW DOES IT WORK?

 Log the transformations used to build the dataset
 Recompute partition using lineage information
How are RDDs created?
 Implemented in Spark
 Computing framework for iterative tasks
 Exposed through a language-integrated API
 Created using:
 Data in stable storage
 Other RDDs (using transformations)
Persistence
 Which RDDs will be reused in future

 By default: in memory
 Not enough space?: on disk
 Persistence strategies using flags

 Persistence priorities
RDDs vs
Distributed
shared
memory
Spark Programming Interface
 RDD abstraction through an API in

Scala
 RDDs are statically typed objects
parametrized by an element type
HOW TO USE?
 Write a driver program
 Define RDDs
 Invoke actions on them
RDDs
operations in
Spark
Representing RDD
 Common interface consisting in:

 A set of partitions (atomic pieces of the dataset)
 A set of dependencies on parent RDDs
 A function for computing the dataset based on its parents
 Metadata about partitioning scheme
 Data placement
Representing
RDDs
Dependencies between RDDs
 Types:
 Narrow: each partition of the parent RDD is used by at most one partition of a child
RRD
 Wide: each partition of the parent RDD is used by multiple partitions of a child RDD
Example of
dependencies
Implementation
 The system runs over Mesos cluster manager

 Each Spark program runs as a separate Mesos application
 Mesos handles resource sharing between applications
Job Scheduling
 Similar to Dryad’s , but takes into account which partitions are available in
memory
 Assigns tasks to machines based on data localization
 Is not fault tolerant for scheduler failures
DAG of
stages
Example of
Spark
interpreter
translation
Memory management
 in-memory storage as deserialised Java objects

 in-memory storage as serialized data
 on-disk storage
Evaluation
 Comparison with existing frameworks (Hadoop)

 Machine learning algorithms
 PageRank
 Fault recovery
 Behavior when data does not fit in memory
Iterative Machine Learning Applications
 Hadoop takes longer in first iteration due to heartbeat protocol between its master
and workers
 On average iterations, Spark has up to 20 x speed up :
 Minimum overhead of the Hadoop software stack
 Overhead of HDFS while serving data
 Deserialization cost to convert binary records to usable in-memory Java objects.
Page rank
Fault recovery
Behavior with Insufficient Memory
Discussion
 Apparent limited programming interface due to their immutable nature and

coarse-grained transformations.
 Suitable for a wide class of applications
 Advantages that result in optimizations:
 Keeping specific data in memory
 Partitioning the data do minimize communication
 Recovering from failures efficiently
Expressing existing programming models
 MapReduce – flatMap + groupByKey / reduceByKey

 DryadLINQ – bulk operators that correspond with RDD transformations
(e.g. map, groupByKey, join)
 SQL – data-parallel operations on set records
 Pregel – specialized model for iterative graph applications. This user
function is applied to all the vertices on each iteration
 Iterative MapReduce – HaLoop, Twister – easily expressed with RDDs
Related work
 Data flow models such as MapReduce, Dryad, Ciel – rich set of operators for
processing data, but share it through storage systems
 High-level programming interfaces for data flow systems (DryadLINQ,
FlumeJava) – provide language-integrated APIs. However, the parallel
collections represent files on disk.
 Caching systems (Nectar) can reuse intermediate results across DryadLINQ jobs
by identifying common subexpressions with program analysis. However, Nectar
does not provide in-memory caching, placing the data in a distributed file system
 Relational Databases – RDDs are conceptually similar to views in a database,
and persistent RDDs resemble materialized views. However, fine-grained read-
write access isn’t allowed. Also, logging of operations and data for fault
tolerance is a not required overhead.
Conclusions
 RDDs are an efficient, general-purpose and fault-tolerant abstraction for sharing

data in cluster applications
 A wide range of parallel applications can be expressed with RDDs
 Although RDDs are limited by only using bulk operations, many parallel
applications are suited to this model
 In case of node failure, data can be efficiently recovered using lineage
 Spark is the implementation of RDDs, outperforming Hadoop by up to 20x in
iterative applications

prezentareBD Tot

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

prezentareBD Tot

Uploaded by

Copyright:

Available Formats

Resilient Distributed Datasets: A

Fault-Tolerant Abstraction for In-

 Cloud computing adopted for large-scale data analytics

 Frameworks that support only specific computation patterns

 SOLUTION: Resilient Distributed Datasets (RDDs)

 A read-only, partitioned collection of records

 An interface based on coarse-grained transformations

HOW DOES IT WORK?

 Which RDDs will be reused in future

 Persistence strategies using flags

 RDD abstraction through an API in

 Common interface consisting in:

 The system runs over Mesos cluster manager

 in-memory storage as deserialised Java objects

 Comparison with existing frameworks (Hadoop)

 Apparent limited programming interface due to their immutable nature and

 MapReduce – flatMap + groupByKey / reduceByKey

 RDDs are an efficient, general-purpose and fault-tolerant abstraction for sharing

You might also like