You are on page 1of 6
June 14, 2012 Optimizing MapReduce Job Performance Todd Lipcon [@tlipcon] cloudera Introductions TE + Software Engineer at Cloudera since 2009 * Committer and PMC member on HDFS, MapReduce, and HBase Spend lots of time looking at full stack performance This talk is to help you develop faster jobs — If you want to hear about how we made Hadoop faster, see my Hadoop World 2011 talk on cloudera.com cloudera Aspects of Performance * Algorithmic performance — big-O, join strategies, data structures, asymptotes Physical performance — Hardware (disks, CPUs, etc) Implementation performance — Efficiency of code, avoiding extra work — Make good use of available physical perf 5 __—— cloudera Performance fundamentals + You can’t tune what you don’t understand — MR’s strength as a framework is its black-box nature — To get optimal performance, you have to understand the internals * This presentation: understanding the black box cloudera Performance fundamentals (2) * You can’t improve what you can’t measure — Ganglia/Cacti/Cloudera Manager/etc a must — Top 4 metrics: CPU, Memory, Disk, Network — MR job metrics: slot-seconds, CPU-seconds, task wall-clocks, and I/O + Before you start: run jobs, gather data cloudera Graphing bottlenecks This job might be CPU-bound belt Cluster Load Last hour bott Cluster CPU Last hour IN Map phase w Most jobs not | i © & CPU-bound | 2 Plenty of free © af RAN, perhaps 4 : B48 28 Wise cu Ghice cy mnie ou malT Cv © can make better —mreming preresies | Bisie cru | use of it? last hour holt Cluster network Last hour get tel 5 406 = sae a0 ny i uae a (Meaory Shared Memory Cached 1 of wv flat- Ll Hi votat te Core Penory perce iets zs “ network — bottleneck? cloudera

You might also like