Professional Documents
Culture Documents
prezentareBD Tot
prezentareBD Tot
Implemented in Spark
Computing framework for iterative tasks
Exposed through a language-integrated API
Created using:
Data in stable storage
Other RDDs (using transformations)
Persistence
HOW TO USE?
Write a driver program
Define RDDs
Invoke actions on them
RDDs
operations in
Spark
Representing RDD
Types:
Narrow: each partition of the parent RDD is used by at most one partition of a child
RRD
Wide: each partition of the parent RDD is used by multiple partitions of a child RDD
Example of
dependencies
Implementation
Similar to Dryad’s , but takes into account which partitions are available in
memory
Assigns tasks to machines based on data localization
Is not fault tolerant for scheduler failures
DAG of
stages
Example of
Spark
interpreter
translation
Memory management
Data flow models such as MapReduce, Dryad, Ciel – rich set of operators for
processing data, but share it through storage systems
High-level programming interfaces for data flow systems (DryadLINQ,
FlumeJava) – provide language-integrated APIs. However, the parallel
collections represent files on disk.
Caching systems (Nectar) can reuse intermediate results across DryadLINQ jobs
by identifying common subexpressions with program analysis. However, Nectar
does not provide in-memory caching, placing the data in a distributed file system
Relational Databases – RDDs are conceptually similar to views in a database,
and persistent RDDs resemble materialized views. However, fine-grained read-
write access isn’t allowed. Also, logging of operations and data for fault
tolerance is a not required overhead.
Conclusions