You are on page 1of 19
Unit 5 MapReduce and YARN Topic: The architecture of YARN apes an YARN consis copomtn 2018 Topic: The architecture of YARN © Copyright IBM Corp. 2018 5-48 ‘Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN Hadoop v1 to Hadoop v2 Single Use System Multi Purpose Platform Batch Apps Usually Batch, Interactive, Online, Streaming Hadoop 1.0 Hadoop 2.0 HDFS (redundant, reliable storage) grace ane APO cxpyinitancogomte 2018 Hadoop v1 to Hadoop v2 The most notable change from Hadoop v1 to Hadoop v2 is the separation of cluster and resource management from the execution and data processing environment. This allows for a new variety of application types to run, including MapReduce v2. © Copyright IBM Corp. 2018 5-49 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN Architecture of MRv1 * Classic version of MapReduce (MRv1) =a i ae — Schedules job subrritted by clents Regatel te tata raft “eae tease Architecture of MRv1 The effect most prominently seen with the overall job control. In MapReduce v1 there is just one JobTracker that is responsible for allocation of resources, task assignment to data nodes (as TaskTrackers), and ongoing monitoring ("heartbeat") as each job is run (the TaskTrackers constantly report back to the JobTracker on the status of each running task). © Copyright IBM Corp. 2018 5-50 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN YARN architecture * High level architecture of YARN Resoureanager (FM) Negetanager NN) ~Kezpe rag ete Nogetanegers po rons soit rscitcon Hiei alas Nosemanager Santor prseasea curing a resources to Sppreratseppicaone ana asks -Mntoreapetcaton mestare Sen Appleatondaster AM) Las} (Graphaee) | “cntanersfonntests CContaners Nedetianager lent a= ~ Gaprun aren ypes ct Can subot ny yp Seer) Pep | [Pee] say Sepeain ‘tapptcaton ca — Hat ciforont zee 9, RAM, ‘supported by YARN GPU apes an YARN @copygptisu copostos 2or8 YARN architecture In the YARN architecture, a global ResourceManager runs as a master daemon, usually on a dedicated machine, that arbitrates the available cluster resources among various competing applications. The ResourceManager tracks how many live nodes and resources are available on the cluster and coordinates what applications submitted by users should get these resources and when. The ResourceManager is the single process that has this information so it can make its allocation (or rather, scheduling) decisions in a shared, secure, and multi-tenant manner (for instance, according to an application priority, a queue capacity, ACLs, data locality, etc.). When a user submits an application, an instance of a lightweight process called the ApplicationMaster is started to coordinate the execution of all tasks within the application. This includes monitoring tasks, restarting failed tasks, speculatively running slow tasks, and calculating total values of application counters. These responsibilities were previously assigned to the single JobTracker for all jobs. The ApplicationMaster and tasks that belong to its application run in resource containers controlled by the NodeManagers. © Copyright IBM Corp. 2018 551 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit 5 MapReduce and YARN The NodeManager is a more generic and efficient version of the TaskTracker. Instead of having a fixed number of map and reduce slots, the NodeManager has a number of dynamically created resource containers. The size of a container depends upon the amount of resources it contains, such as memory, CPU, disk, and network IO. Currently, only memory and CPU (YARN-3) are supported. cgroups might be used to control disk and network IO in the future. The number of containers on a node is a product of configuration parameters and the total amount of node resources (such as total CPUs and total memory) outside the resources dedicated to the slave daemons and the OS. Interestingly, the ApplicationMaster can run any type of task inside a container. For example, the MapReduce ApplicationMaster requests a container to launch a map or a reduce task, while the Giraph ApplicationMaster requests a container to run a Giraph task. You can also implement a custom ApplicationMaster that runs specific tasks and, in this way, invent a shiny new distributed application framework that changes the big data world. | encourage you to read about Apache Twill, which aims to make it easy to write distributed applications sitting on top of YARN. In YARN, MapReduce is simply degraded to a role of a distributed application (but still a very popular and useful one) and is now called MRv2. MRv2 is simply the re-implementation of the classical MapReduce engine, now called MRv1, that runs on top of YARN. © Copyright IBM Corp. 2018 552 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN Terminology changes from MRv1 to YARN YARN terminology Instead of MRv1 terminology ResourceManager Cluster Manager ApplicationMaster JobTracker (but dedicated and shortived) NodeManager TaskTracker Distributed Application One particular MapReduce job Container Slot ape an YARN consis copomtn 2018 Terminology changes from MRvt to YARN © Copyright IBM Corp. 2018 553 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN Acronym for Yet Another Resource Negotiator New resource manager included in Hadoop 2.x and later De-couples Hadoop workload and resource management Introduces a general purpose application container Hadoop 2.2.0 includes the first GA version of YARN Most Hadoop vendors support YARN Raa ne YARN consis copomtn 2018 YARN YARN is a key component in Hortonworks Data Platform. © Copyright IBM Cou Corp. 2018 1urse materials may not be reproduced in whole or in part without the prior written permission of IBM. 554 Unit § MapReduce and YARN YARN high level arc! fecture * In Hortonworks Data Platform (HDP), users can take advantage of YARN and applications written to YARN APIs. Scrat Ait | Hease | Spark ape an YARN YARN high level arch \Coponte 2018 ture © Copyright IBM Corp. 2018 555 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit 5 MapReduce and YARN Running an application in YARN (1 of 7) ape an YARN oaigHIbM Comoe 2018 Running an application in YARN © Copyright IBM Corp. 2018 5-56 ‘Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit 5 MapReduce and YARN Running an application in YARN (2 of 7) 5 — Nodemanager onoseis4 rg "og ape an YARN oeyi5hibM Copomtn 2018 © Copyright IBM Corp. 2018 657 ‘Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN Running an application in YARN (3 of 7) s ‘Analyze lneter table = Resource request Container Ibs NodeManagerwbigaeri36 oeyi5hibM Copomtn 2018 ape an YARN © Copyright IBM Corp. 2018 558 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit 5 MapReduce and YARN Running an application in YARN (4 of 7) 3 sepcation ape an YARN © Copyright IBM Corp. 2018 ‘Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Cry Nodewanager @noe'# él él ro NodeManager @novei36 oeyi5hibM Copomtn 2018 Unit 5 MapReduce and YARN \ ie NodeManager @novei36 ape an YARN oeyi5hibM Copomtn 2018 © Copyright IBM Corp. 2018 5-60 ‘Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit 5 MapReduce and YARN Nodewanager @noe'# ape an YARN oeyi5hibM Copomtn 2018 © Copyright IBM Corp. 2018 561 ‘Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit 5 MapReduce and YARN ape an YARN oeyi5hibM Copomtn 2018 © Copyright IBM Corp. 2018 5-62 ‘Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN How YARN runs an application allocate resources (eartbeat) ‘epicaton process Appiation poe ape an YARN Bccpmtisa copontn 2018 How YARN runs an application To run an application on YARN, a client contacts the resource manager and asks it to run an application master process (step 1). The resource manager then finds a node manager that can launch the application master in a container (steps 2a and 2b). Precisely what the application master does once it is running depends on the application. It could simply run a computation in the container it is running in and return the result to the client. Or it could request more containers from the resource managers (step 3), and use them to run a distributed computation (steps 4a and 4b). This is well described in: White, T. (2015) Hadoop: The definitive guide (4th ed.). Sabastopol, CA: O'Reilly Media, p. 80. © Copyright IBM Corp. 2018 563 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN Container Java command line * Container JVM command (generally behind the scenes) * Launched by "yarn" user with /bin/bash yarn 1251527 1199943 0 14:38 7 00:0 fndp/jék/bin/java ~Djava.net + Ifyou count "java" process ids (pids) running with the "yarn" user, you will 0 /bin/bash ~2 see 2X 00:00:09 /bin/bash ~c /ndp/jdk/bin/java 00:00:00 /pin/bash ~© /dp/idk/bin/ java 00:07:48 /ndp/jdk/bin/java ~Djava.net.pe 00: 8:21 /ndp/jdk/bin/java ~Djava.net.pe temmants tos eon + Gitomentetatin) esau espn te Ns ape an YARN consis copomtn 2018 Container JAVA command line This is for the more experienced user. yarn 1251527 1199943 0 14:38 2 00:00:00 /bin/bash ~c /ndp/jdk/bin/java ~ Djava.net.preferIPv4stack-true -Dhadoop.metrics.iog.level-WARN —Xmx2000m -Xms1000m ~ Xmni 00m “Xtune:virtualized ~ Xehareclasses:nane=mrsce_tg, grouphccess, cacheDir=/var/hdp/hadoop/tmp, nonFatal ~Xsen20m ~Koump java: fi le~/var/hdp/hadoop/tmp/javacore. $Yinid.SH8M85.%pid.Aseq.txt Xdunp:heap: file~/var/hdp/nadoop/ tmp/heapdump. tYimtd. $H8N8S.tpid.Sseq.phd ~ Djava. io, tmpdir=/data6/yarn/local/nodemanager~ local /usercache/bigeq) /appeache/ application 1417731580977_0002/containez_1417731580977_0 002_01_000095/tmp -Dlog4}.configuration=container-log4j properties ~ byafn.app.container.1og.dir=/var/hdp/nadoop/yarn/ legs /application_1417731580977_0002/con tainer_141773:580977_0002_01 000095 -pyarn.app.container.log.filesize=0 - Dhadoop. root. logger=INFO, CLA org.apache.hadoop.mapred.YarnChild 9.30.75.55 $1923 attempt 1417731580977_0002_m_000073_0 95 i>/var /hdp/nadoop/yarh/logs/application_1417731580977_0002/container_1417731580977_0002_ 03_900095/stdout 2>/var/hdp/hadoop/yarn/1ogs/application_1417731580977_0002/container_1417731580977_0002_ 01_000095/stade1 © Copyright IBM Corp. 2018 5-64 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN Provisioning, managing, and monitoring * The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web Ul backed by its RESTful APIS. * Ambari enables System Administrators to: * Provision a Hadoop Cluster = Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. = Ambari handles configuration of Hadoop services for the cluster. + Manage a Hadoop Cluster = Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster. * Monitor a Hadoop Cluster ~ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster. ~ Ambar leverages Ganglia for metrics collection. ~ Ambari leverages Naaios for system alerting and will send emails when your attention is, needed (for example, a node goes down, remaining disk space is low, etc.) * Ambari enables Application Developers and System Integrators to: «= Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs ape an YARN consis copomtn 2018 Provisioning, managing, and monitoring This is just a reminder of the role of Ambari in Hadoop 2. With Hadoop 2 and YARN there is a great need for provisioning, management, and monitoring because of the greater complexity. In the final unit you will look at Ambari Slider as a mechanism for dynamically changing requirements at run time for long running jobs. © Copyright IBM Corp. 2018 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit § MapReduce and YARN Spark with Hadoop 2+ * Spark is an alternative in-memory framework to MapReduce * Supports general workloads as well as streaming, interactive queries and machine learning providing performance gains * Spark SQL provides APIs that allow SQL queries to be embedded in Java, Scala or Python programs in Spark * MLlib: Spark optimized library support machine learning functions * GraphX: API for graphs and parallel computation * Spark streaming: Write applications to process streaming data in Java, Scala or Python Spare ape an YARN consis copomtn 2018 Spark in Hadoop 2+ Apache Spark is a new, alternative, in-memory framework that is an alternative to MapReduce. Spark is the subject of the next unit. © Copyright IBM Corp. 2018 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

You might also like