Parallel & Distributed Computing

Parallel & Distributed Computing
Computer Cluster, MapReduce, Hadoop
What is Serial Computing?

Traditionally, software has been written for serial computation:
o To be run on a single computer having a single Central Processing Unit (CPU) o A problem is broken into a discrete series of instructions o Instructions are executed one after another o Only one instruction may execute at any moment in time
Serial Computing
What is Parallel Computing?

In the simplest sense, the simultaneous use of multiple computing resources to solve a computational problem:
o Run using multiple CPUs in a single computer o Problem is broken into discrete parts that can be solved concurrently o Each part is further broken down to a series of instructions o Instructions from each part execute simultaneously on different CPUs
All processors may have access to a shared memory to exchange information between processors
Parallel Computing
Uses for Parallel Computing (1)

Science and Engineering:
o Atmosphere, Earth, Environment o Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics o Bioscience, Biotechnology, Genetics o Chemistry, Molecular Sciences o Geology, Seismology o Mechanical Engineering - from prosthetics to spacecraft o Electrical Engineering, Circuit Design, Microelectronics o Computer Science, Mathematics
Uses for Parallel Computing (2)

Industrial and Commercial:
o o o o o o o o Databases, data mining Oil exploration Web search engines, web based business services Medical imaging and diagnosis Pharmaceutical design Financial and economic modeling Management of national and multi-national corporations Advanced graphics and virtual reality, particularly in the entertainment industry o Networked video and multi-media technologies o Collaborative work environments
What is Distributed Computing?

A field of computer science that studies distributed systems Purpose is to coordinate use of shared resources Run using multiple CPUs across many computers Problem is divided into many tasks, each of which is solved by one or a collection of independent computers that appears to its users as a single coherent system System where hardware/software components located at networked computers communicate and coordinate their actions only by message passing Each processor has its own private memory (distributed memory)
Distributed Computing
Distributed & Parallel Systems Differences
Distributed Parallel Computing
Computer Cluster
What is a Computer Cluster?

Rapidly growing trend has emerged as a type of parallel or distributed processing system Consists of a set of loosely connected computers that work together to be viewed as a single system Components are connected to each other through fast local area networks, each node running its own instance of an operating system Activities of the computing nodes are orchestrated by "clustering middleware"
Computer Cluster Architecture
Computer Cluster Configuration
Why Computer Cluster?

More computing horsepower & better reliability by orchestrating a number of low cost commercial off-theshelf computers Deployed to improve performance and availability over that of a single computer More cost-effective alternative to single computers of comparable speed or availability Relies on centralized management approach, makes nodes available as orchestrated shared servers
Cluster Problems to Solve

Largest most important problem is software skew o When software configuration on some nodes is different than on others o Small differences (minor version numbers on libraries) can cripple a parallel program Second most important problem is adequate job control of the parallel process o Signal propagation o Cleanup Can be hard to manage without experience Determining where something has failed increases linearly as cluster size increases
Cluster Beowulf Design

Beowulf cluster: basic approach to building a cluster Beowulf system: application programs never see the computational (slave) nodes Only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves Master has two network interfaces
o One communicates with the private Beowulf network for the slaves o Another for the general purpose network of the organization
Beowulf Cluster Configuration
Cluster Task Scheduling

When a large multi-user cluster needs to access very large amounts of data, task scheduling becomes a challenge In a complex application environment the performance of each job depends on the characteristics of the underlying cluster, mapping tasks onto Central Processing Unit cores provides significant challenges An area of ongoing research and algorithms that combine and extend MapReduce and Hadoop have been proposed, studied, & implemented
MapReduce
What is MapReduce?
Framework developed by Google for processing parallelizable problems across large data sets using a large cluster of computers (nodes) Programing paradigm, splits up a task into smaller subtasks that can be executed in parallel & therefore run faster compared to a single computer execution Simplified data processing on large clusters a large server farm can use MapReduce to sort a petabyte of data in only a few hours Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured)
MapReduce Overview
Word Count Application - MapReduce
Word Count MapReduce Example

Counts the appearance of each word in a set of documents function map(String name, String document) { // name: document name // document: document contents for each word w in document { emit (w, 1) } } function reduce(String word, Iterator partialCounts) { // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts { sum += pc emit (word, sum) } }
MapReduce Architecture
Job Scheduling System
o Jobs made up of tasks, master scheduler assigns tasks to slave machines (nodes), easy to distribute across nodes o Input, final output are stored on a distributed file system o Master pings slaves periodically to detect failures o Slaves send heartbeats back to master periodically o Master responds with task if a slot is free, picking task with data closest to the node o Takes advantage of locality of data, processing data on or near the storage assets to decrease transmission of data
Automatic parallelization & distribution

o Allows for distributed processing of the map and reduction operations o A set of 'reducers' can perform the reduction phase if all outputs of the map operation that share the same key are presented to the same reducer at the same time o If each mapping operation is independent of the others, all maps can be performed in parallel
Fault Tolerance
o Parallelism offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled
MapReduce Programming Model

Inspired by the map and reduce functions commonly used in Lisp and other functional programming languages, although their purpose in the MapReduce framework is not the same as their original forms The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop
MapReduce Distributed Execution
MapReduce Dataflow
Consists of a single master JobTracker and one slave TaskTracker per cluster-node Dataflow: an input reader a Map function a partition function a compare function a Reduce function an output writer
MapReduce Dataflow
MapReduce Input Reader

Divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) Framework assigns one split to each Map function
Reader reads data from stable storage (typically a distributed file system) and generates key/value pairs A common example will read a directory full of text files and return each line as a record
MapReduce Map Function

Master node takes the input, divides it into smaller subproblems, and distributes them to worker nodes Worker node may do this again in turn, leading to a multi-level tree structure Worker node processes the smaller problem, and passes the answer back to its master node Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) list(k2,v2) Map function is applied in parallel to every pair in the input dataset, which produces a list of pairs for each call MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each one of the different generated keys
MapReduce Partition Function

Each Map function output is allocated to a particular reducer for horizontal partitioning purposes Partition function is given the key and the number of reducers and returns the index of the desired reduce Typical default is to hash the key and modulo the number of reducers Important to pick a partition function that gives an approximately uniform distribution of data per shard for load balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers to finish Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced
MapReduce Key Comparison Function

The input for each Reduce is pulled from the node machine where the Map ran and sorted using the application's comparison function Key Comparison is used to sort the final emitted outputs of reduce before returning the list of result keys
MapReduce Reduce Step

Master node collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) list(v3)
Each Reduce call typically produces either one value or an empty return, which is collected as the desired result list
MapReduce Output Writer

Writes the output of the Reduce to stable storage, usually a distributed file system
MapReduce Parallel Execution
MapReduce Distribution & Reliability

Achieves reliability by parceling out a number of operations on the set of data to each node in the network Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes Master node attempts to schedule reduce operations on the same node, or in the same rack as the node holding the data being operated on, which is desirable as it conserves bandwidth across the backbone network of the datacenter Implementations are not necessarily highly-reliable, for example, in Hadoop the NameNode is a single point of failure for the distributed file system
MapReduce Uses
MapReduce aids organizations in processing and analyzing large volumes of multi-structured data, are often difficult to implement using the standard SQL employed by relational DBMSs Uses Include: distributed pattern-based searching, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web, replacing the old ad hoc programs that updated the index and ran the various analyses
Hadoop
What is Hadoop?
Open source Java software framework to support dataintensive distributed (file system) applications provided by Apache Software Foundation Solves the problem of a tremendous amount of data that needs to be analyzed and processed very quickly Allows for the distributed processing of large data sets across clusters of computers using simple programming models Delivers a highly-available service on top of a cluster of computers Designed to scale up from single servers to thousands of machines and to detect and handle failures at the application layer
Hadoop Main Components
What is Hive?
Hive has gained the most acceptance in the industry Main benefit - dramatically improves the simplicity and speed of MapReduce development Its SQL-like syntax makes it easy to use by nonprogrammers who are comfortable using SQL HiveQL statements entered using a command line or Web interface, or may be embedded in applications that use ODBC and JDBC interfaces to the Hive system Hive Driver system converts the query statements into a series of MapReduce jobs Data files in Hive are seen in the form of tables (and views), but do not support the concepts of primary or foreign keys or constraints of any type
Hive Main Components
Hadoop Architecture
Hadoop Cluster Architecture

To eliminate overhead impeding performance
o No server virtualization o No hypervisor layer
Runs best on Linux machines, working directly with the underlying hardware Utilize rack servers (not blades) populated in racks connected to the top of rack switch Majority of the servers will be Slave nodes with lots of local disk storage and moderate amounts of CPU and DRAM Some machines will be Master nodes that might have a slightly different configuration favoring more DRAM and CPU, less local storage
Hadoop
Hadoop Deployment Machine Roles

Client machine: load data into the cluster, submit Map Reduce jobs, and then retrieve the results of the job when its finished Master nodes: oversee the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce) Name Node: oversees, coordinates, controls the data storage function (HDFS), manages access control Job Tracker: hands out tasks to the slave nodes, oversees and coordinates the parallel processing of data using Map Reduce Slave Nodes: make up the majority of machines, do the work of storing the data and running the computations
o Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their master nodes o The Task Tracker daemon is a slave to the Job Tracker, the Data Node daemon a slave to the Name Node
Hadoop MapReduces Job Flow
Hadoop MapReduces Job Flow

Driver program submits the job configuration to the JobTracker node JobTracker - splits the Job into individual tasks and submits them to respective TaskTrackers, which reside on the DataNodes themselves, or at least on the same rack on which the data is present TaskTracker - receives its share of input data, and starts processing the map function specified by the configuration When all Map Tasks are completed, JobTracker asks the TaskTrackers to start processing the reduce function JobTracker deals with failed and unresponsive tasks by running backup tasks, whichever node completes first, gets its output accepted When both the Map and Reduce tasks are completed, JobTracker notifies the client program, and dumps the output into the specified output directory

Parallel &amp; Distributed Computing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel &amp; Distributed Computing

Uploaded by

Copyright:

Available Formats

Parallel & Distributed Computing

Computer Cluster, MapReduce, Hadoop

What is Serial Computing?

What is Parallel Computing?

Uses for Parallel Computing (1)

Uses for Parallel Computing (2)

What is Distributed Computing?

Distributed & Parallel Systems Differences

Distributed Parallel Computing

What is a Computer Cluster?

Computer Cluster Architecture

Computer Cluster Configuration

Why Computer Cluster?

Cluster Problems to Solve

Cluster Beowulf Design

Beowulf Cluster Configuration

Cluster Task Scheduling

Word Count Application - MapReduce

Word Count MapReduce Example

Automatic parallelization & distribution

MapReduce Programming Model

MapReduce Distributed Execution

MapReduce Input Reader

MapReduce Map Function

MapReduce Partition Function

MapReduce Key Comparison Function

MapReduce Reduce Step

MapReduce Output Writer

MapReduce Parallel Execution

MapReduce Distribution & Reliability

Hadoop Main Components

Hive Main Components

Hadoop Cluster Architecture

Hadoop Deployment Machine Roles

Hadoop MapReduces Job Flow

Hadoop MapReduces Job Flow

You might also like

Parallel & Distributed Computing

Parallel & Distributed Computing