You are on page 1of 52

Parallel & Distributed Computing

Computer Cluster, MapReduce, Hadoop

What is Serial Computing?


Traditionally, software has been written for serial computation:
o To be run on a single computer having a single Central Processing Unit (CPU) o A problem is broken into a discrete series of instructions o Instructions are executed one after another o Only one instruction may execute at any moment in time

Serial Computing

What is Parallel Computing?


In the simplest sense, the simultaneous use of multiple computing resources to solve a computational problem:
o Run using multiple CPUs in a single computer o Problem is broken into discrete parts that can be solved concurrently o Each part is further broken down to a series of instructions o Instructions from each part execute simultaneously on different CPUs

All processors may have access to a shared memory to exchange information between processors

Parallel Computing

Uses for Parallel Computing (1)


Science and Engineering:
o Atmosphere, Earth, Environment o Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics o Bioscience, Biotechnology, Genetics o Chemistry, Molecular Sciences o Geology, Seismology o Mechanical Engineering - from prosthetics to spacecraft o Electrical Engineering, Circuit Design, Microelectronics o Computer Science, Mathematics

Uses for Parallel Computing (2)


Industrial and Commercial:
o o o o o o o o Databases, data mining Oil exploration Web search engines, web based business services Medical imaging and diagnosis Pharmaceutical design Financial and economic modeling Management of national and multi-national corporations Advanced graphics and virtual reality, particularly in the entertainment industry o Networked video and multi-media technologies o Collaborative work environments

What is Distributed Computing?


A field of computer science that studies distributed systems Purpose is to coordinate use of shared resources Run using multiple CPUs across many computers Problem is divided into many tasks, each of which is solved by one or a collection of independent computers that appears to its users as a single coherent system System where hardware/software components located at networked computers communicate and coordinate their actions only by message passing Each processor has its own private memory (distributed memory)

Distributed Computing

Distributed & Parallel Systems Differences

Distributed Parallel Computing

Computer Cluster

What is a Computer Cluster?


Rapidly growing trend has emerged as a type of parallel or distributed processing system Consists of a set of loosely connected computers that work together to be viewed as a single system Components are connected to each other through fast local area networks, each node running its own instance of an operating system Activities of the computing nodes are orchestrated by "clustering middleware"

Computer Cluster Architecture

Computer Cluster Configuration

Why Computer Cluster?


More computing horsepower & better reliability by orchestrating a number of low cost commercial off-theshelf computers Deployed to improve performance and availability over that of a single computer More cost-effective alternative to single computers of comparable speed or availability Relies on centralized management approach, makes nodes available as orchestrated shared servers

Cluster Problems to Solve


Largest most important problem is software skew o When software configuration on some nodes is different than on others o Small differences (minor version numbers on libraries) can cripple a parallel program Second most important problem is adequate job control of the parallel process o Signal propagation o Cleanup Can be hard to manage without experience Determining where something has failed increases linearly as cluster size increases

Cluster Beowulf Design


Beowulf cluster: basic approach to building a cluster Beowulf system: application programs never see the computational (slave) nodes Only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves Master has two network interfaces
o One communicates with the private Beowulf network for the slaves o Another for the general purpose network of the organization

Beowulf Cluster Configuration

Cluster Task Scheduling


When a large multi-user cluster needs to access very large amounts of data, task scheduling becomes a challenge In a complex application environment the performance of each job depends on the characteristics of the underlying cluster, mapping tasks onto Central Processing Unit cores provides significant challenges An area of ongoing research and algorithms that combine and extend MapReduce and Hadoop have been proposed, studied, & implemented

MapReduce

What is MapReduce?
Framework developed by Google for processing parallelizable problems across large data sets using a large cluster of computers (nodes) Programing paradigm, splits up a task into smaller subtasks that can be executed in parallel & therefore run faster compared to a single computer execution Simplified data processing on large clusters a large server farm can use MapReduce to sort a petabyte of data in only a few hours Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured)

MapReduce Overview

Word Count Application - MapReduce

Word Count MapReduce Example


Counts the appearance of each word in a set of documents function map(String name, String document) { // name: document name // document: document contents for each word w in document { emit (w, 1) } } function reduce(String word, Iterator partialCounts) { // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts { sum += pc emit (word, sum) } }

MapReduce Architecture
Job Scheduling System
o Jobs made up of tasks, master scheduler assigns tasks to slave machines (nodes), easy to distribute across nodes o Input, final output are stored on a distributed file system o Master pings slaves periodically to detect failures o Slaves send heartbeats back to master periodically o Master responds with task if a slot is free, picking task with data closest to the node o Takes advantage of locality of data, processing data on or near the storage assets to decrease transmission of data

Automatic parallelization & distribution


o Allows for distributed processing of the map and reduction operations o A set of 'reducers' can perform the reduction phase if all outputs of the map operation that share the same key are presented to the same reducer at the same time o If each mapping operation is independent of the others, all maps can be performed in parallel

Fault Tolerance
o Parallelism offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled

MapReduce Programming Model


Inspired by the map and reduce functions commonly used in Lisp and other functional programming languages, although their purpose in the MapReduce framework is not the same as their original forms The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop

MapReduce Distributed Execution

MapReduce Dataflow
Consists of a single master JobTracker and one slave TaskTracker per cluster-node Dataflow: an input reader a Map function a partition function a compare function a Reduce function an output writer

MapReduce Dataflow

MapReduce Input Reader


Divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) Framework assigns one split to each Map function

Reader reads data from stable storage (typically a distributed file system) and generates key/value pairs A common example will read a directory full of text files and return each line as a record

MapReduce Map Function


Master node takes the input, divides it into smaller subproblems, and distributes them to worker nodes Worker node may do this again in turn, leading to a multi-level tree structure Worker node processes the smaller problem, and passes the answer back to its master node Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) list(k2,v2) Map function is applied in parallel to every pair in the input dataset, which produces a list of pairs for each call MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each one of the different generated keys

MapReduce Partition Function


Each Map function output is allocated to a particular reducer for horizontal partitioning purposes Partition function is given the key and the number of reducers and returns the index of the desired reduce Typical default is to hash the key and modulo the number of reducers Important to pick a partition function that gives an approximately uniform distribution of data per shard for load balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers to finish Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced

MapReduce Key Comparison Function


The input for each Reduce is pulled from the node machine where the Map ran and sorted using the application's comparison function Key Comparison is used to sort the final emitted outputs of reduce before returning the list of result keys

MapReduce Reduce Step


Master node collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) list(v3)

Each Reduce call typically produces either one value or an empty return, which is collected as the desired result list

MapReduce Output Writer


Writes the output of the Reduce to stable storage, usually a distributed file system

MapReduce Parallel Execution

MapReduce Distribution & Reliability


Achieves reliability by parceling out a number of operations on the set of data to each node in the network Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes Master node attempts to schedule reduce operations on the same node, or in the same rack as the node holding the data being operated on, which is desirable as it conserves bandwidth across the backbone network of the datacenter Implementations are not necessarily highly-reliable, for example, in Hadoop the NameNode is a single point of failure for the distributed file system

MapReduce Uses
MapReduce aids organizations in processing and analyzing large volumes of multi-structured data, are often difficult to implement using the standard SQL employed by relational DBMSs Uses Include: distributed pattern-based searching, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web, replacing the old ad hoc programs that updated the index and ran the various analyses

Hadoop

What is Hadoop?
Open source Java software framework to support dataintensive distributed (file system) applications provided by Apache Software Foundation Solves the problem of a tremendous amount of data that needs to be analyzed and processed very quickly Allows for the distributed processing of large data sets across clusters of computers using simple programming models Delivers a highly-available service on top of a cluster of computers Designed to scale up from single servers to thousands of machines and to detect and handle failures at the application layer

Hadoop Main Components

What is Hive?
Hive has gained the most acceptance in the industry Main benefit - dramatically improves the simplicity and speed of MapReduce development Its SQL-like syntax makes it easy to use by nonprogrammers who are comfortable using SQL HiveQL statements entered using a command line or Web interface, or may be embedded in applications that use ODBC and JDBC interfaces to the Hive system Hive Driver system converts the query statements into a series of MapReduce jobs Data files in Hive are seen in the form of tables (and views), but do not support the concepts of primary or foreign keys or constraints of any type

Hive Main Components

Hadoop Architecture

Hadoop Cluster Architecture


To eliminate overhead impeding performance
o No server virtualization o No hypervisor layer

Runs best on Linux machines, working directly with the underlying hardware Utilize rack servers (not blades) populated in racks connected to the top of rack switch Majority of the servers will be Slave nodes with lots of local disk storage and moderate amounts of CPU and DRAM Some machines will be Master nodes that might have a slightly different configuration favoring more DRAM and CPU, less local storage

Hadoop

Hadoop Deployment Machine Roles


Client machine: load data into the cluster, submit Map Reduce jobs, and then retrieve the results of the job when its finished Master nodes: oversee the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce) Name Node: oversees, coordinates, controls the data storage function (HDFS), manages access control Job Tracker: hands out tasks to the slave nodes, oversees and coordinates the parallel processing of data using Map Reduce Slave Nodes: make up the majority of machines, do the work of storing the data and running the computations
o Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their master nodes o The Task Tracker daemon is a slave to the Job Tracker, the Data Node daemon a slave to the Name Node

Hadoop MapReduces Job Flow

Hadoop MapReduces Job Flow


Driver program submits the job configuration to the JobTracker node JobTracker - splits the Job into individual tasks and submits them to respective TaskTrackers, which reside on the DataNodes themselves, or at least on the same rack on which the data is present TaskTracker - receives its share of input data, and starts processing the map function specified by the configuration When all Map Tasks are completed, JobTracker asks the TaskTrackers to start processing the reduce function JobTracker deals with failed and unresponsive tasks by running backup tasks, whichever node completes first, gets its output accepted When both the Map and Reduce tasks are completed, JobTracker notifies the client program, and dumps the output into the specified output directory

You might also like