You are on page 1of 5

This paper appears in: Parallel and Distributed Systems, IEEE Transactions on

booktitle = "SC-MTAGS'09",
In recent years ad hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. However, the processing frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of a cloud. Consequently, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper, we discuss the opportunities and challenges for efficient parallel data processing in clouds and present our research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's IaaS clouds for both, task scheduling and execution. Particular tasks of a processing job can be assigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Based on this new framework, we perform extended evaluations of MapReduce-inspired processing jobs on an IaaS cloud system and compare the results to the popular data processing framework Hadoop.

Who is Nephele Software?


Posted on December 26, 2010 by nephele

Nephele Software provides cloud-based applications that ensure that data can only be accessed and read by its owner. Sending data to the cloud provides access to your data from anywhere in the world. Your data in the cloud is also backed up and immune to computer crashes, fires, and other disasters. However, these benefits come at a cost: you no longer control who can read your data. Although most cloud-service providers secure your data against external access, what about internal access? Why should you trust your data to someone else? The fact is you should not. The fact is you dont have to. We at Nephele Software want to provide you with the peace of mind that your data is secure against unauthorized access, while also taking advantage of all of the benefits that cloud-computing provides. Our software ensure that you, and you alone, may read your data. No one can open your documents, view your pictures, or read your email. You control the access; what a novel concept.

We look forward to introducing you to the new cloud. The secure cloud. Where all of your data is yours and yours alone. Data. Secure. Everywhere

Reference: [Top] Dryad: distributed data-parallel programs from sequential building blocks. Authors: Michael Isard Mihai Budiu Yuan Yu Andrew Birrell Dennis Fetterly Organization: EuroSys

Condor-G: A Computation Management Agent for Multi-Institutional Grids. Authors: James Frey Todd Tannenbaum Miron Livny Ian T. Foster Steven Tuecke Organization: HPDC

Maximum likelihood network topology identification from edge-based unicast measurements. Authors: Mark Coates Rui Castro Robert Nowak Manik Gadhiok Ryan King Yolanda Tsang Organization: SIGMETRICS

Pig latin: a not-so-foreign language for data processing. Authors: Christopher Olston Benjamin Reed Utkarsh Srivastava Tomkins Organization: SIGMOD Conference

Ravi Kumar

Andrew

Map-reduce-merge: simplified relational data processing on large clusters. Authors: Hung-chih Yang Ali Dasdan Ruey-Lung Hsiao Douglas Stott Parker Jr. Organization: SIGMOD Conference

VDE: Virtual Distributed Ethernet. Authors: Renzo Davoli Organization: TRIDENTCOM

LEO - DB2's LEarning Optimizer. Authors: Michael Stillger Guy M. Lohman Organization: VLDB

Volker Markl

Mokhtar Kandil

SCOPE: easy and efficient parallel processing of massive data sets.

Authors: Ronnie Chaiken Weaver Jingren Zhou Organization: PVLDB

Bob Jenkins

Per-ke Larson

Bill Ramsey

Darren Shakib

Simon

virtio: towards a de-facto standard for virtual I/O devices. Authors: Rusty Russell Organization: Operating Systems Review

Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Authors: Ewa Deelman Gurmeet Singh Mei-Hui Su James Blythe Yolanda Gil Carl Kesselman Gaurang Mehta Karan Vahi G. Bruce Berriman John Good Anastasia C. Laity Joseph C. Jacob Daniel S. Katz Organization: Scientific Programming

Interpreting the data: Parallel analysis with Sawzall. Authors: Rob Pike Sean Dorward Robert Griesemer Organization: Scientific Programming Cited By: [Top]

Sean Quinlan

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. Authors: Dominic Battr Stephan Ewen Fabian Hueske Odej Kao Volker Markl Daniel Warneke Organization: SoCC

Nephele and scheduling in the cloud


A summary of "Nephele: Efficient Parallel Data Processing in the Cloud".

Goal: a data processing framework with support for dynamic allocation and de-allocation of different computational resources in the cloud.

Compute resources available in a cloud environment are highly dynamic and possibly heterogeneous. In addition, the network topology is hidden so scheduling optimizations based on knowledge of the distance to a particular rack or server are impossible.

Topology

A job graph is a DAG of tasks connected with edges. Tasks process records implementing a common interface. A task may have an arbitrary number of input and output gates though which records enter and leave the task. A task can be seen as a set of parallel subtasks processing different partitions of the data. By default each subtask is assigned to a dedicated server.

A job graph is transformed into execution graph by the job manager. The execution graph has two levels of detail:

y the abstract level describes the job execution on a task level (without parallelization) and the scheduling of instance allocation/deallocation. A Group Vertex is created for every Job Graph vertex to control the set of subtasks. The edges between Group Vertices are ephemeral and do not represent any physical communication paths. y the concrete level defines the mapping of subtasks to servers and the communication channels between them. An Execution Vertex is created for each subtask. Each Execution Vertex is always controlled by its corresponding Group Vertex. Execution vertices are connected by channels. Channel types

All edges of an Execution Graph are replaced by a channel before processing can begin. There are three channel types: y A network channel is based on a TCP socket connection. Two subtasks connected via a network channel can be executed on different instances. Since they must be executed at the same time, they are required to run in the same Execution Stage. y An in-memory channel uses the server memory to buffer data. The two connected subtasks must be scheduled to run on the same instance and in the same Execution Stage. y A file channel allows two subtasks to exchange records via the local file system. Two subtasks are assigned to the same instance and the consuming group vertex must be scheduled to run in a later Execution Stage than the producing group vertex. Subtasks must exchange records across different stages via file channels because they are the only channel types which store the intermediate records in a persistent manner. Execution Stage Scheduling

The requested server types may be temporarily unavailable in the cloud but for cost-efficiency servers should be ideally allocated just before they can be used. The Execution Graph is split into one or more Execution Stages.

when the processing of a stage begins, all servers required within the stage are allocated.

y all subtasks included in this stage are sent to the corresponding Task Managers and ready to receive records. y before the processing of a new stage, all intermediate results of its preceding stages are stored in a persistent manner. So the execution stage is similar to a checkpoint because a job can be interrupted and resumed later after a stage is completed. The user can provide manual hints to change the default scheduling behavior:

y y y y

into how many parallel subtasks should a task be split at runtime how many subtasks can share the same server which execution groups can share servers channel type of each edge

y server type required by a task (to characterize the hardware requirements) Server type support

Server types are simple string identifiers such as "m1.small". The scheduler is given a list of available server types and their cost per time unit. Each task can be executed on its own server type. To support it, each subtask must be mapped to an Execution Instance. An Execution Instance has an ID and an server type representing the hardware characteristics.

Before beginning to process a new Execution Stage, the scheduler collects all Execution Instances from that stage and tries to replace them with matching cloud instances. If all required instances could be allocated the subtasks are sent to the corresponding server s and set up for execution.

Nephele keeps track of server allocation time to minimize costs when usage is charged by the hour. An idle server of a particular type is not immediately deallocated if a server of the same type is required in an upcoming Execution Stage. It is kept allocated until the end of its current lease period. If the next Execution Stage begins before the end of that period, the server is reassigned to the Execution Vertex of that stage. Otherwise the server is deallocated in time not to cause any additional cost.

You might also like