Cloudera Developer Training For Apache Hadoop Instructor Guide PDF

1
201403
Cloudera Developer Training for

Apache Hadoop: Instructor Guide
CONFIDENTIAL
This guide is confidential, and contains Cloudera proprietary information. It must not
be made available to anyone other than Cloudera instructors and approved Partner
instructors.
Version Release Date Description
201403 03/18/2014 Minor bugfixes, new product slides in Chapter 1
201310 12/23/2013 Many improvements and major re-org
201301 02/19/2013 Minor typos and bugfixes
201212 12/27/2012 Re-written to use the New API. Also general bug fixes.
201210 11/2012 Re-brand
201209 09/24/2012 Re-organization of material, new exercises
201203 04/21/2012 Initial release
NOTE: The Exercise Instructions follow the course slides.
© Copyright 2010–2017 Cloudera. All Rights Reserved.

Not to be reproduced or shared without prior written consent from Cloudera. 1
2
Suggested Course Timings

Per-Chapter Timings
Arrivals and Registration [15 minutes total]
1. Introduction [40 minutes total]
• 40 minutes lecture
2. The Motivation for Hadoop [30 minutes total]
3. Hadoop Basic Concepts and HDFS [1 hour, 20 minutes total]
• 20 minutes exercise(s)
4. Introduction to MapReduce [60 minutes total]
5. Hadoop Clusters and the Hadoop Ecosystem [1 hour, 15 minutes total]
6. Writing a MapReduce Program in Java [2 hours, 30 minutes total]
7. Writing a MapReduce Program Using [35 minutes total]
Streaming
8. Unit Testing MapReduce Programs [30 minutes total]
9. Delving Deeper into the Hadoop API [2 hours, 30 minutes total]
10. Practical Development Tips and Techniques [3 hours, 20 minutes total]
11. Partitioners and Reducers [1 hour, 30 minutes total]

3
12. Data Input and Output [3 hours, 5 minutes total]
13. Common MapReduce Algorithms [2 hours, 35 minutes total]
14. Joining Data Sets in MapReduce Jobs [30 minutes total]
15. Integrating Hadoop into the Enterprise [1 hour, 5 minutes total]
Workflow
16. An Introduction to Hive, Impala, and Pig [1 hour, 20 minutes total]
17. An Introduction to Oozie [40 minutes total]
18. Conclusion [5 minutes total]
A. Cloudera Enterprise [15 minutes total]
Final Questions and Post-Course Survey [15 minutes total]
Per-Day Timings
• Day 1
[Total classroom time: 6 hours, 15 minutes]
Complete all lectures and exercises for Chapters 1–5 and part of Chapter 6.
• Day 2

4
Complete Chapter 6, all lectures and exercises for Chapters 7–9, and part of Chapter
10.
• Day 3
Complete Chapter 10, all lectures and exercises for Chapters 11 and 12, and part of
Chapter 13.
• Day 4
Complete Chapter 13 and all lectures and exercises for Chapters 14–18.

5
Most Recent Changes

201403 Minor bugfixes, replace product slides in Chapter 1
201310 Major improvements and re-organization
Chapter Re-organization
• Chapter 3 “Hadoop: Basic Concepts” was split into three chapters:
◦ Chapter 3 “Hadoop Basic Concepts and HDFS”
◦ Chapter 4 “Introduction to MapReduce”
◦ Chapter 5 “Hadoop Clusters and the Hadoop Ecosystem”
• Chapter 4 “Writing MapReduce” was split into two chapters:
◦ Chapter 6 “Writing a MapReduce Program in Java”
◦ Chapter 7 “Writing a MapReduce Program using Streaming”
• New chapter “Reducers and Partitioners” added
◦ Comprised of material from “Delving Deeper” (partitioners) and “Tools and
Tips” (number of reducers).
• Chapter 12 “Machine Learning and Mahout “was removed
• Appendix B “Graph Manipulation” was removed
Exercise Manual Changes

• New exercises added:
◦ “Using ToolRunner and Passing Parameters”
◦ “Testing with Local Job Runner”
◦ “Logging” (optional)
◦ “Implementing a Custom WritableComparable” (this used to be an optional add-on
to the Word Co-occurrence exercise; it is now a stand-alone exercise earlier in the
course)
• Bonus exercises:
◦ A new section at the end of the exercise manual for exercises students are not
expected to do during class; they can do it after or during breaks but no class time
is allotted
◦ One exercise for now: “Exploring a Secondary Sort” (optional)
• Eclipse is now the “default” assumption in Java-based exercises
• A “files and directories used in this exercise” sidebar added to every exercise
• Supplementary document provided: “Eclipse Exercise Instructions” for students
unfamiliar with Eclipse

6
Lab Environment Changes

• All Java code is now in packages: stubs, hints, solution (for exercises) and example
(for code presented in class).
◦ This allows students to view and work with all three at the same time. (This is
similar to the recent update to the HBase class)
◦ A single script (~/scripts/developer/training_setup_dev.sh) copies all the
exercises into the Eclipse workspace (no need to import into Eclipse). It also starts
and stops services as required for the course (e.g. turns off HBase)
General Slide Changes

• Lots of streamlining of slides, addition of/improvement to graphics, detailed code
examples and illustrations of examples
• New “color coding” of code in slides:
◦ Blue = Java or SQL/HiveQL
◦ Yellow = command line/interactive
◦ Grey = pseudo-code
• Standard icons/colors for Map (blue) and Reduce (green) are used through the
course
• A new “MapReduce Flow” overview diagram is used throughout the course
Minor Exercise Manual Bugs

• [CUR-1662] Correct exercise notes, page 3: Version number
• [CUR-1663] Correct exercise notes, page 9: Remove unneeded paragraph
• [CUR-1665] Exercise notes: Correct word, page 14.
• [CUR-1666] Exercise notes: Modify page 15 output for consistency with our program
• [CUR-1667] Exercise document: Correct page 20 output to match actual results
• [CUR-1668] Exercise document: Correct directory path on page 21
• [CUR-1669] Exercise document: Massage programming instruction on page 40.
• [CUR-1672] Exercise document: Correct character on page 46.
• [CUR-1950] Code typo in Hadoop Developer_Exercise_Instructions.pdf
• [CUR-1429] JobTracker Doesn’t Inform of Killing
Minor Slide Bugs

• [CUR-513] Slides 8-26 through 8-29 contradict TDGse
• [CUR-726] Hive and Pig slides in Developer Training notes.

7
• [CUR-1377] Slide 3-19 should say "copy" instead of "move"

• [CUR-1391] s/SORT BY/ORDER BY/ on hive sample answers in *developer* course
• [CUR-1490] instructor notes "Hadoop Streaming"
• [CUR-1501] curl Example is incorrect
• [CUR-2003] Dev. class says Impala in Beta
• [CUR-1956] Hadoop Developer Instructor Guide: 3 Should be 4
Minor VM Bugs
• [CUR-1302] Solution for Word CoOccurrence has funky data
• [CUR-1383] When using stubs_with_hints, mrunit tests do not fail when exercise is
starting out
Tasks
• [CUR-350] 11-9, 11-11: Modify diagram to convey "data warehouse"
• [CUR-511] Rewrite introduction and motivation to Hadoop
• [CUR-526] Sample code for file input format and record reader would be helpful
• [CUR-579] Rework material on Pig in the Dev course
• [CUR-684] Add Changing the Logging Level on a Job feature to the course
• [CUR-739] Labs need consistent documentation about input files, output files, JAR file
name, and source file names
• [CUR-948] Add logging lab
• [CUR-968] Secondary sort could use sample code
• [CUR-1119] Reorg of Dev Chapters 3 and 4
• [CUR-1189] Need solution for extra credit step (step 3) in Word Co-Occurrence
exercise
• [CUR-1208] Join chapter needs pictures
• [CUR-1221] Make Hint mode the default in Dev
• [CUR-1223] Grouping Comparator data example
• [CUR-1224] Secondary Sort -- Use data in examples
• [CUR-1239] Simplify Partitioner Exercise
• [CUR-1253] Add info on the boolean param for waitForCompletion on 04-37
• [CUR-1255] Combiner exercise has changed and now completely misses the point
• [CUR-1259] SumReducer for Partitioner exercise uses hasNext and iterator.next.
Why?

8
• [CUR-1288] Mention that Identity Function is Default for Mapper and Reducer
• [CUR-1384] Remove mention of addInput method from MRUnit slides and IG notes
• [CUR-1386] Slide 7-22 has a smart quote
• [CUR-1387] Slide 11-13: "Cannot serve interactive queries"
• [CUR-1416] Reduce-Side Join - Remove Mapper abstraction slide(s) and isPrimary
reference
• [CUR-1417] Show example data/diagram/flow with in Reduce-side join
• [CUR-1439] Writing and Implementing a Combiner Copy Instructions Incorrect For
Eclipse
• [CUR-1444] Add Oozie Examples to Oozie Links
• [CUR-1470] Secondary Sort – Illustration
• [CUR-1471] Illustrations for Oozie and Flume in Dev. Ecosystem slide deck
• [CUR-1481] Reducer example for Streaming?
• [CUR-1482] Combiner exercise assumes you are not using eclipse
• [CUR-1545] mapreduce.job.reduces doesn’t really work
• [CUR-1664] Exercise notes: Add comment on the less command, page 13
• [CUR-1673] Exercise document: Add note about the answers to page 46.
• [CUR-1713] Add ’Essential Points’ slides, remove conclusion slide from each chapter
• [CUR-1828] Partitioner lab: More explicit on how to copy files
• [CUR-1978] Hadoop Developer Instructor Guide Needs To Update LineRecordReader
Algorithm

Cloudera Developer Training
for Apache Hadoop
201403
Copyright © 2010–2017 Cloudera. All rights reserved. Not to be re-

produced or shared without prior written consent from Cloudera.
Introduction
Chapter 1
Chapter Goal
This chapter is intended to inform students what to expect from the course and for the instructor to learn
about the students’ level of expertise as well as how they plan to apply what they’ll learn.
Course Chapters
▪ Introduction
▪ The Motivation for Hadoop
▪ Hadoop Basic Concepts and HDFS
▪ Introduction to MapReduce
▪ Hadoop Clusters and the Hadoop Ecosystem
▪ Writing a MapReduce Program in Java
▪ Writing a MapReduce Program Using Streaming
▪ Unit Testing MapReduce Programs
▪ Delving Deeper into the Hadoop API
▪ Practical Development Tips and Techniques
▪ Partitioners and Reducers
▪ Data Input and Output
▪ Common MapReduce Algorithms
▪ Joining Data Sets in MapReduce Jobs
▪ Integrating Hadoop into the Enterprise Workflow
▪ An Introduction to Hive, Impala, and Pig
▪ An Introduction to Oozie
▪ Conclusion
▪ Appendix: Cloudera Enterprise
Copyright © 2010–2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-2
Trademark Information
▪ The names and logos of Apache products mentioned in Cloudera training
courses, including those listed below, are trademarks of the Apache Software
Foundation
Apache Accumulo Apache Lucene
Apache Avro Apache Mahout
Apache Bigtop Apache Oozie
Apache Crunch Apache Parquet
Apache Flume Apache Pig
Apache Hadoop Apache Sentry
Apache HBase Apache Solr
Apache HCatalog Apache Spark
Apache Hive Apache Sqoop
Apache Impala (incubating) Apache Tika
Apache Kafka Apache Whirr
Apache Kudu Apache ZooKeeper
▪ All other product names, logos, and brands cited herein are the property of
their respective owners
This slide is intended to clearly convey to students that, while we may sometimes refer to products like
Hadoop, Hive, and Impala later in the course, these are simply shorthand for the longer and more formal
names. Apache Hadoop, as well as many related software projects that Cloudera helps to develop and
distribute, is owned by the Apache Software Foundation (ASF).
Throughout its history, Cloudera has been strongly committed to a community-driven, Hadoop-based
platform based on open standards that meets the highest enterprise expectations for stability and
reliability. Cloudera’s Chief Architect, Doug Cutting, served as Director of the ASF for more than five years.
He has been an Apache committer for more than 15 years, serving alongside dozens of other Clouderans
who also help to work on many of the open source projects. In fact, Cloudera employees have founded
more than 20 successful Hadoop ecosystem projects, including Apache Hadoop itself. Cloudera is a
Platinum-level sponsor of the ASF. http://apache.org/foundation/thanks.html
Additionally, course material may make occasional and incidental references to other product names
covered by trademark, such as commercial software from partner companies or brand names for hardware
on which one might deploy a cluster. Reference to any products, services, processes or other information,
by trade name, trademark, manufacturer, supplier, or otherwise does not necessarily constitute or imply
endorsement, sponsorship, or recommendation by Cloudera.
Chapter Topics
Introduction
▪ About This Course

▪ About Cloudera
▪ Course Logistics
▪ Introductions
Course Objectives (1)
During this course, you will learn
▪ The core technologies of Hadoop
▪ How HDFS and MapReduce work
▪ How to develop and unit test MapReduce applications
▪ How to use MapReduce combiners, partitioners, and the distributed cache
▪ Best practices for developing and debugging MapReduce applications
▪ How to implement custom data input and output formats in MapReduce
applications
▪ Algorithms for common MapReduce tasks
▪ How to join datasets in MapReduce
▪ How Hadoop integrates into the data center
▪ How Hive, Impala, and Pig can be used for rapid application development
▪ How to create workflows using Oozie
Chapter Topics
Introduction

▪ About Cloudera
▪ Introductions
About Cloudera (1)
▪ The leader in Apache Hadoop-based software and services

─ Our customers include many key users of Hadoop
▪ Founded by Hadoop experts from Facebook, Yahoo, Google, and Oracle
▪ Provides support, consulting, training, and certification for Hadoop users
▪ Staff includes committers to virtually all Hadoop projects
▪ Many authors of authoritative books on Apache Hadoop projects
Christophe Bisciglia from Google, Amr Awadallah from Yahoo, Mike Olson from Oracle, and Jeff
Hammerbacher from Facebook founded Cloudera in 2008. Our staff also includes the co-creator of Hadoop
and former ASF chairperson, Doug Cutting, as well as many involved in the project management committee
(PMC) of various Hadoop-related projects. The person who literally wrote the book on Hadoop, Tom White,
also works for Cloudera (Hadoop: The Definitive Guide).
There are many Cloudera employees who have written or co-authored books on Hadoop-related topics,
and you can find an up-to-date list here at our list of Hadoop ecosystem books. Instructors are encouraged
to point out that many of these books are available at a substantial discount to students in our classes
O’Reilly’s Cloudera discount.
About Cloudera (2)
▪ We have a variety of training courses:
─ Developer Training for Apache Spark and Hadoop
─ Cloudera Administrator Training for Apache Hadoop
─ Cloudera Data Analyst Training
─ Cloudera Search Training
─ Data Science at Scale using Spark and Hadoop
─ Cloudera Training for Apache HBase
▪ We offer courses online OnDemand and in instructor-led physical and virtual
classrooms
▪ We also offer private courses:
─ Can be delivered on-site, virtually, or online OnDemand
─ Can be tailored to suit customer needs
Discussing OnDemand offerings, you can give them https://ondemand.cloudera.com/, which

gives a nice listing of all the courses and info/about pages.
You can see a list of customers that we can reference on our website
http://www.cloudera.com/customers.html. Note that Cloudera also has many customers
who do not wish wish us to refer to them, and it is essential that we honor this. The only exception to
this important rule is that you may refer to something that was intentionally made available to the public
in which Cloudera or that customer has disclosed that they are a Cloudera customer. For example, it is
permissible to mention an article in a reputable trade publication in which Cloudera’s CEO mentions a
specific customer or the keynote address that the customer’s CTO gave at the Strata conference talking
about the benefits they’ve experienced as a Cloudera customer.
About Cloudera (3)
▪ In addition to our public training courses, Cloudera offers two levels
of certifications. All Cloudera professional certifications are hands-on,
performance-based exams requiring you to complete a set of real-world tasks
on an working multi-node CDH cluster
▪ Cloudera Certified Professional (CCP)
─ The industry’s most demanding performance-based certification, CCP Data
Engineer evaluates and recognizes your mastery of the technical skills most
sought after by employers
─ CCP Data Engineer
▪ Cloudera Certified Associate (CCA)
─ To successfully achieve CCA certification, you complete a set of core tasks on
a working CDH cluster instead of guessing at multiple-choice questions
─ CCA Spark and Hadoop Developer
─ CCA Data Analyst
─ CCA Administrator
CDH
CDH (Cloudera’s Distribution including Apache Hadoop)
▪ 100% open source,

enterprise-ready
distribution of
Hadoop and related
projects
▪ The most complete,
tested, and
widely deployed
distribution of
Hadoop
▪ Integrates all
the key Hadoop
ecosystem projects
▪ Available as RPMs
and Ubuntu, Debian,
or SuSE packages, or as a tarball
You can think of CDH as analogous to what Red Hat does with Linux: although you could download the
“vanilla” kernel from kernel.org, in practice, nobody really does this. Students in this class are probably
thinking about using Hadoop in production and for that they’ll want something that’s undergone greater
testing and is known to work at scale in real production systems. That’s what CDH is: a distribution which
includes Apache Hadoop and all the complementary tools they’ll be learning about in the next few days, all
tested to ensure the different products work well together and with patches that help make it even more
useful and reliable. And all of this is completely open source, available under the Apache license from our
Web site.
RPM = Red Hat Package Manager.
Ubuntu, Debian, and SuSE are all Linux distributions like CentOS or Red Hat.
Cloudera Express
▪ Cloudera Express
─ Completely free to
download and use
▪ The best way to get started
with Hadoop
▪ Includes CDH
▪ Includes Cloudera Manager
─ End-to-end administration
for Hadoop
─ Deploy, manage, and
monitor your cluster

Main point: Cloudera Express is free, and adds Cloudera-specific features on top of CDH, in particular
Cloudera Manager (CM).
Cloudera Enterprise (1)
PROCESS, ANALYZE, SERVE
BATCH S TR E A M SQL SEARCH

Spark, Hive, Pig, Spark Impala Solr
MapReduce
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
YARN S e n try, R e co rd S e rvice DATA MANAGEMENT
OPERATIONS Cloudera Navigator
Cloudera Manager
Encrypt and KeyTrustee
Cloudera Director
Optimizer
FILESYSTEM RELATIONAL N oSQ L OTHER

HDFS K udu HBase Object Store
STORE
BATCH R E A L-TIM E
Sqoop Kafka, Flume
INTEGRATE
This slide is meant to highlight the differences among CDH (100% open source, 100% free), Cloudera
Express (mostly open source, except Cloudera Manager, but 100% free), and Cloudera Enterprise (mostly
open source, but with our key differentiators: Cloudera Manager, Cloudera Director, Cloudera Navigator,
Encrypt, and KeyTrustee Optimizer).
From: Cloudera Navigator Optimizer, “Cloudera Navigator Optimizer gives you the insights and risk-
assessments you need to build out a comprehensive strategy for Hadoop success. Simply upload your
existing SQL workloads to get started, and Navigator Optimizer will identify relative risks and development
costs for offloading these to Hadoop based on compatibility and complexity.”
From: Cloudera Director “Flexible, self-service deployment. Through an intuitive user interface (UI),
multiple user groups can quickly take advantage of the added speed and greater flexibility of the cloud with
self-service provisioning for the fastest time-to-value.”
From: Cloudera Security “Cloudera Navigator Encrypt provides massively scalable, high-performance
encryption for critical Hadoop data. Navigator Encrypt leverages industry-standard AES-256 encryption and
provides a transparent layer between the application and filesystem that dramatically reduces performance
impact of encryption. With automatic deployment through Cloudera Navigator and simple configuration,
you can secure your data with ease in minutes instead of days.”
“Cloudera Navigator Key Trustee is a ‘virtual safe-deposit box’ for managing encryption keys and other
security assets. It provides software-based key management that supports a variety of robust, configurable,
and easy-to-implement policies governing access to secure artifacts. In compliance with NIST requirements,
these keys and other Hadoop security assets are always stored separately from encrypted data and
wrapped in multiple layers of cryptography.”
Cloudera Enterprise (2)
▪ Subscription product including CDH and Cloudera
Manager
▪ Provides advanced features, such as
─ Operational and utilization reporting
─ Configuration history and rollbacks
─ Rolling updates and service restarts
─ External authentication (LDAP/SAML)
─ Automated backup and disaster recovery
▪ Specific editions offer additional capabilities, such as
─ Governance and data management (Cloudera Navigator)
─ Active data optimization (Cloudera Navigator Optimizer)
─ Comprehensive encryption (Cloudera Navigator Encrypt )
─ Key management (Cloudera Navigator Key Trustee)
• LDAP = Lightweight Directory Access Protocol

• SAML = Security Assertion Markup Language
Cloudera Enterprise formerly came in two versions, but now it comes in five:
• Basic Edition
• Data Engineering Edition
• Operational Database Edition
• Analytic Database Edition
• Enterprise Data Hub Edition
Details about the differences among the five can be found on the Cloudera Enterprise
datasheet.
Chapter Topics
Introduction

▪ About Cloudera
▪ Introductions
Logistics
▪ Class start and finish time
▪ Lunch
▪ Breaks
▪ Restrooms
▪ Wi-Fi access
▪ Virtual machines
Your instructor will give you details on how to access the

course materials and exercise instructions for the class.
“Virtual machines” is a cue for the instructor to explain briefly how to perform hands-on exercises in this
class; for example, whether that is through virtual machines running locally or in the cloud. This is also a
good time to verify that the virtual machines are already running, and to start them if they are not.
The registration process for students is:
1. Visit http://training.cloudera.com/
2. Register as a new user (they should give an email address which they can check immediately, at the class
site, since the system will send a confirmation email)
3. Confirm registration by clicking on the link in the email
4. Log in if necessary
5. Enter the course ID and enrollment key (which the instructor will have received the week before the
class starts)
6. Download the exercise instructions and, if desired, slides
Emphasize that they must, at the very least, download the exercise instructions. Also, unless this is an
onsite course they should not download the VM—it’s already on the classroom machines, and trying to
download it will just swamp the training center’s bandwidth.
Chapter Topics
Introduction

▪ About Cloudera
▪ Introductions
Introductions
▪ About your instructor
▪ About you
─ Experience with Hadoop?
─ Experience as a developer?
─ What programming languages do you use?
─ Expectations from the course?
Establish your credibility and enthusiasm here. You’ll likely want to mention your experience as an
instructor, plus any relevant experience as a developer, system administrator, DBA or business analyst. If
you can relate this to the audience (because you’re from the area of have worked in the same industry), all
the better.
It’s a good idea to draw out a grid corresponding to the seat layout and write students’ names down as they
introduce themselves, allowing you to remember someone’s name a few days later based on where they’re
sitting.
The outline for all our courses are available online (http://university.cloudera.com/
training/courses.html), so you should be familiar with them and will know whether a student’s
expectations from the course are reasonable.
The Motivation for Hadoop
Chapter 2
Chapter Goal
This chapter needs a goal.
Course Chapters
▪ Introduction
▪ Conclusion
The Motivation For Hadoop
In this chapter, you will learn
▪ What problems exist with traditional large-scale computing systems
▪ What requirements an alternative approach should have
▪ How Hadoop addresses those requirements
And here’s what we’re going to learn in this chapter…
Chapter Topics
▪ Problems with Traditional Large-Scale Systems

▪ Introducing Hadoop
▪ Hadoop-able Problems
▪ Conclusion
Traditional Large-Scale Computation
▪ Traditionally, computation has been
processor-bound
─ Relatively small amounts of data
─ Lots of complex processing

▪ The early solution: bigger computers
─ Faster processor, more memory
─ But even this couldn’t keep up
From the 1970s until the 1990s, most large-scale computing was based on having a single powerful
computer (like a Cray) and trying to improve performance and capacity by replacing it with another
machine that is faster and more powerful and has more memory.
Most supercomputing is based on doing intensive computations on relatively small amounts of data.
During the 1990s, even supercomputing moved away from this monolithic approach towards distributed
systems which use MPI (Message Passing Interface) and PVM (Parallel Virtual Machine). (Condor is a batch
queuing system developed at University of Wisconsin at Madison that can distribute work among a cluster
of machines, either using its own library or through MPI and PVM).
The photo is of the computer Collosus Mark 2. It was used during WWII to process the most important
“big data” of the day: decrypting “vast quantity of encrypted high-level telegraphic messages between the
German High Command (OKW) and their army commands throughout occupied Europe”.
illustration is a Cray-1 supercomputer (circa 1976).
Distributed Systems
▪ The better solution: more computers
─ Distributed systems – use multiple machines
for a single job

“In pioneer days they used oxen

for heavy pulling, and when
one ox couldn’t budge a log, we
didn’t try to grow a larger ox. We
shouldn’t be trying for bigger
computers, but for more systems
of computers.”
—Grace Hopper
Public domain image from http://commons.wikimedia.org/wiki/

File:Grace_Hopper.jpg
In addition to being the mother of modern computer science, she was also a programmer in WWII working
on the Mark 1 computer, and invented the first high level computer language, FLOW-MATIC, which went
onto become COBOL.
Distributed Systems: Challenges
▪ Challenges with distributed systems
─ Programming complexity
─ Keeping data and processes in sync
─ Finite bandwidth
─ Partial failures
Programming complexity: programs to manage data and processes on that data across hundreds or
thousands of nodes are very complex (expensive to develop, error-prone, hard to maintain, etc.)
Finite bandwidth: You have to balance the benefit of distribution against the time it takes to distribute the
data. Data is growing faster than the hardware (networks and disks) that carry and hold it.
Partial failures: with thousands of nodes, there WILL be failures. Systems must be developed to
accommodate this reality which adds to the complexity.
Also relevant are the “Fallacies of Distributed Computing” attributed to Peter Deutsch and others at Sun
Microsystems. Numbers 1, 3 and 7 are especially important to Hadoop’s design.
1. The network is reliable.

2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn’t change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
Distributed Systems: The Data Bottleneck (1)
▪ Traditionally, data is stored in a central location
▪ Data is copied to processors at runtime
▪ Fine for limited amounts of data
In traditional systems, all your data is stored in a single place (e.g. a SAN [Storage Area Network]), and when
you need to process the data, it needs to be copied to the distributed nodes doing the computation.
A SAN can hold a lot of data, but getting the data off the SAN is a bottleneck.
Distributed Systems: The Data Bottleneck (2)
▪ Modern systems have much more data
─ terabytes+ a day
─ petabytes+ total
▪ We need a new approach…
Modern systems have to deal with far more data than was the case in the past
Organizations are generating huge amounts of data
That data has inherent value, and cannot be discarded
Examples:
• RIM/Blackberry infrastructure: generates 500 Tb/day of instrumentation data, 100+ Pb total
• JP Morgan Chase – 150 PB total.
• Ebay - 9 Pb stored on Hadoop + Exadata. source http://www.computerworld.com/s/
article/359899/Hadoop_Is_Ready_for_the_Enterprise_IT_Execs_Say
• Facebook – over 70PB of data
• Chevron: a single oil well generates 15 Tb a day of data, has hundreds of thousands of wells.
Quick calculation
Typical disk data transfer rate: 75MB/sec
Time taken to transfer 100GB of data to the processor: approx 22 minutes!
Assuming sustained reads
Actual time will be worse, since most servers have less than 100GB of RAM available
A new approach is needed!
Chapter Topics

▪ Conclusion
Hadoop
▪ A radical new approach to distributed computing
─ Distribute data when the data is stored
─ Run computation where the data is
▪ Originally based on work done at Google
▪ Open-source project overseen by the Apache Software Foundation

Hadoop is based on papers published by Google

Google File System (2003) - http://research.google.com/archive/gfs.html
MapReduce (2004) - http://research.google.com/archive/mapreduce.html
This work takes a radical new approach to the problem of distributed computing
Core concept: distribute the data as it is initially stored in the system
Individual nodes can work on data local to those nodes
No data transfer over the network is required for initial processing
Core Hadoop Concepts
▪ Applications are written in high-level code
▪ Nodes talk to each other as little as possible
▪ Data is distributed in advance
─ Bring the computation to the data
▪ Data is replicated for increased availability and reliability
▪ Hadoop is scalable and fault-tolerant
• Applications are written in high-level code.

The Hadoop framework handles the low level coordination of processing, data transfer over
the network, task management, node failure, etc. Developers do not worry about network
programming, temporal dependencies etc. (This addresses the “complexity” challenge mentioned
earlier)
• Nodes talk to each other as little as possible
Nodes communicate with the master, not with each other. (Remember, you might have hundreds
or thousands of nodes.) ‘Shared nothing’ architecture. (This is the ideal; occasional communication
does occur but it is minimal)
• Data is spread among machines in advance (as discussed on the last slide)
Computation happens where the data is stored, wherever possible
Data is replicated multiple times on the system for increased availability and reliability
• Scalable and fault-tolerant are covered on the next two slides
Scalability
▪ Adding nodes adds capacity proportionally
▪ Increasing load results in a graceful decline in performance
─ Not failure of the system

Horizontally scalable = add more computers, not make computers bigger.

The design of Hadoop attempts to avoid bottlenecks by limiting the role of “master” machines in the
cluster. There are currently production clusters with more than 4,000 nodes (Yahoo!) and work is currently
underway to further scale to at least 50% beyond that (HDFS Federation and MRv2).
Fault Tolerance
▪ Node failure is inevitable
▪ What happens?
─ System continues to function
─ Master re-assigns tasks to a different node
─ Data replication = no loss of data
─ Nodes which recover rejoin the cluster automatically
Developers spend more time designing for failure than they do actually working on the problem itself.
Inevitability Example:
100 computers in a cluster
10 disks per computer = 1000 disks
If “mean time to failure” for one disk is 3 years = about 1000 days
Disk failure on average once per day
Fault tolerance requirements:
Failure of a component should result in a graceful degradation of application performance, not complete
failure of the entire system
If a component fails, its workload should be assumed by still-functioning units in the system
Failure should not result in the loss of any data
Component failures during execution of a job should not affect the outcome of the job
If a node fails, the master will detect that failure and re-assign the work to a different node on the system
Restarting a task does not require communication with nodes working on other portions of the data
If a failed node restarts, it is automatically added back to the system and assigned new tasks
Mention if asked: “Speculative execution”: If a node appears to be running slowly, the master can
redundantly execute another instance of the same task
Results from the first to finish will be used
Chapter Topics

▪ Conclusion
Common Types of Analysis with Hadoop
▪ Text mining ▪ Collaborative filtering
▪ Index building ▪ Prediction models
▪ Graph creation and analysis ▪ Sentiment analysis
▪ Pattern recognition ▪ Risk assessment
Text mining is a broad category that includes some of these other things. Anytime you have text and want
to search or find patterns…
Index building: make text searchable – you don’t want to have to search ever document every time you
want to search. So you build search indexes ahead of time, which is often done with Hadoop.
Graph: storing and analyzing. Example: social network graph. E.g. facebook, linkedin…the suggest
connections for you. (“Do you know so-n-so?”) That’s graph analysis: who are my friends connected to that
I’m not yet connected to. Another example: finding the quickest path through a graph.
Pattern recognition: e.g. faces in satellite pictures; natural language processing: is this text in English,
Spanish, French?
Collaborative filtering: “a fancy way of saying ‘recommendations’”. e.g. on Amazon, when you view a
product, you view other products that people have also viewed/bought/searched for/clicked on.
Prediction models: How can I tell from what’s happened in the past what will happen in the future? How
popular will an upcoming book be so I can prepare for launch?
Sentiment analysis: Do people like me/my company? Are they happy with my service?
Risk assessment: e.g. look at financial data on someone and based on age and other factors, are they “at
risk” for defaulting on a loan?
What is Common Across Hadoop-able Problems?
▪ Nature of the data
─ Volume
─ Velocity
─ Variety

▪ Nature of the analysis
─ Batch processing
─ Parallel execution
─ Distributed data
What are the characteristics of the types of problems Hadoop can solve?
First, the data. “Volume, velocity, variety”.
Volume: Typically these problems have lots of data…hundred of TB to PB, and growing.
Variety: Structured data from RDBMs, CS, XML, etc; Semi-structured data such as log files; unstructured
data in text, html and pdf files. Also the data is often not as homogenous or “clean” as you would hope.
(e.g. are all your HTML files formatted properly? Is there always a close tag for every open tag?)
Velocity: many data sources, and you have to be able to take it all in as fast as it comes. E.g. log files from
many servers; seismic data from oil wells; credit card transactions; etc.
Analysis:
Batch processing – we don’t work on one little piece of data at a time, we batch it up, and work on it in
larger units.
Parallel execution across a cluster
Some analysis was previously impossible and can now be done.
Benefits of Analyzing With Hadoop
▪ Previously impossible or impractical analysis
▪ Lower cost
▪ Less time
▪ Greater flexibility
▪ Near-linear scalability
▪ Ask Bigger Questions
Many companies are able to ask questions that previously were unanswerable, or the answers were so slow
that by the time the answers came, they were useless.
Lower cost: use commodity hardware instead of (some of) your expensive data systems. Reduce
development and maintenance time. Less IT involvement necessary to meet business needs.
Less time: fast is good! Some questions you may want to ask may take so long with your current system
that the answers aren’t relevant.
Flexibility: You need to be able to answer questions you didn’t anticipate when you started. If your data is
so large that you are forced to discard most of it and only save/process what you need right now, you are
unable to respond quickly to changing business needs.
Scalability: is your data growing? Maybe now you are only working with a subset of your data? No problem.
The system you invest in needs to be able to grow to accommodate all your data.
“empowering enterprises to Ask Bigger Questions(TM) and gain rich, actionable insights from all their data,
to quickly and easily derive real business value that translates into competitive advantage.”
Chapter Topics

▪ Conclusion
Key Points
▪ Traditional large-scale computing involved complex processing on small
amounts of data
▪ Exponential growth in data drove development of distributed computing
▪ Distributed computing is difficult!
▪ Hadoop addresses distributed computing challenges
─ Bring the computation to the data
─ Fault tolerance
─ Scalability
─ Hadoop hides the ‘plumbing’ so developers can focus on the data
Hadoop Basic Concepts and HDFS
Chapter 3
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
▪ What Hadoop is
▪ What features the Hadoop Distributed File System (HDFS) provides
Chapter Topics
▪ The Hadoop Project and Hadoop Components

▪ The Hadoop Distributed File System (HDFS)
▪ Hands-On Exercise: Using HDFS
▪ Conclusion
Hadoop Components
Hadoop consists of two core components: The Hadoop Distributed File System (HDFS) and MapReduce
There are many other projects based around core Hadoop. Often referred to as the ‘Hadoop Ecosystem’:
Pig, Hive, HBase, Flume, Oozie, Sqoop, etc (Many are discussed later in the course)
Hadoop got its name from a stuffed elephant toy owned by Doug Cutting’s son. Names for other projects
related to Hadoop tend to use animal themes, particularly those related to elephants (Mahout is derived
from the Hindi word for “elephant driver” while Oozie is Burmese for “elephant handler”). Mahout is
generally pronounced in the Hadoop community as “muh-HAUT” (where the last syllable rhymes with
“doubt” or “clout”), though some (most notably, those actually from India) pronounce it as “muh-HOOT” (in
which the last syllable rhymes with “boot” or “loot”).
[Deleted: Pig, Hive and HBase are built on Hadoop, while Flume, Oozie and Sqoop help you use or integrate
Hadoop.]
Core Components: HDFS and MapReduce
▪ HDFS (Hadoop Distributed File System)
─ Stores data on the cluster
▪ MapReduce
─ Processes data on the cluster
If you need to do large-scale data processing, you need two things: a place to store large amounts of data
and a system for processing it. HDFS provides the storage and MapReduce provides a way of processing it.
A Simple Hadoop Cluster
▪ A Hadoop cluster: a group of machines
working together to store and process
data

▪ Any number of ‘slave’ or ‘worker’ nodes
─ HDFS to store data
─ MapReduce to process data

▪ Two ‘master’ nodes
─ Name Node: manages HDFS
─ Job Tracker: manages MapReduce
We will discuss how the cluster works is more detail later. Cover very lightly here.
Chapter Topics

▪ Conclusion
HDFS Basic Concepts (1)
▪ HDFS is a filesystem written in Java
─ Based on Google’s GFS
▪ Sits on top of a native filesystem
─ Such as ext3, ext4 or xfs
▪ Provides redundant storage for massive amounts of data
─ Using readily-available, industry-standard computers

GFS: Means “Google File System” in this context and should not be confused with Global File System, which
is another open source distributed filesystem also commonly abbreviated as GFS.
HDFS runs in “user space” which means it is not coupled to the operating system’s kernel. It is really just an
application that stores its data as files on the native filesystem (such as ext3) of the system on which it is
running. Thus, you cannot generally use it like a normal filesystem (that is, do things like type “ls” at a shell
prompt and see its contents or click File -> Open in your text editor to view files stored in HDFS). There are
ways to provide this kind of access though, such as FUSE and NFS proxy, that are mentioned a bit later.
“Readily-available, industry-standard” replaced “commodity” in the slide, which replaced “cheap,
unreliable.” It doesn’t mean you should buy second-hand computers at a garage sale, it means that you
could buy servers towards the lower-end of the vendor’s range without expensive redundancy features
like RAID or hot-swappable CPUs. In other words, you could buy something like a Dell C2100 for $6,000
rather than a Sun Fire Enterprise 25K that costs 100 times as much (or more). This advice applies mainly to
the “worker” nodes; however, the “master” nodes (Name Node, Secondary Name Node and Job Tracker)
should use high-quality, reliable hardware. Be sensitive to the fact that that some of our partners - Oracle
and NetApp, for example - deploy Hadoop on higher-end configurations at reasonably favorable price
points.
HDFS Basic Concepts (2)
▪ HDFS performs best with a ‘modest’ number of large files
─ Millions, rather than billions, of files
─ Each file typically 100MB or more
▪ Files in HDFS are ‘write once’
─ No random writes to files are allowed
▪ HDFS is optimized for large, streaming reads of files
─ Rather than random reads
What “modest” means depends on the size of the cluster and its hardware specs (in particular, amount of
RAM in the Name Node), but is probably in the range of “hundreds of thousands” for smaller clusters up to
“tens of millions” for large clusters.
“No random writes to files” means that you cannot modify a file that already exists. The typical workaround
for this is to read the file, modify the contents in memory and write it back out again as a new file. Although
append support is in CDH3, you should ABSOLUTELY NOT use it – it’s buggy, and will lead to data loss.
Again, HDFS is designed to process the data in fairly large chunks, which offsets both the overhead of disk
latency and also the overhead of starting up a new Mapper to process it once it has been read in.
How Files Are Stored
▪ Data files are split into blocks and distributed at load time
▪ Each block is replicated on multiple data nodes (default 3x)
▪ NameNode stores metadata
Data is split into blocks and distributed across multiple nodes in the cluster
Each block is typically 64MB or 128MB in size
The HDFS block size is HUGE (perhaps 4000 times larger) when compared to the block size of a native UNIX
filesystem. This is because HDFS is optimized for reading large amounts of data in order to minimize the
overall performance impact (latency) associated with getting the disk head positioned to read the first byte.
Cloudera typically recommends using a block size of 128MB, rather than the default of 64MB. One
benefit of doing so is to reduce the memory requirements for the NameNode, since each block in
HDFS requires about 150 bytes on the NameNode (http://www.mail-archive.com/hdfs-
user@hadoop.apache.org/msg00815.html). By having larger blocks, you will have fewer of them
and therefore use less memory overall.
Data is distributed across many machines at load time
Default is to replicate each block three times
Replication increases reliability (because there are multiple copies), availability (because those copies are
distributed to different machines) and performance (because more copies means more opportunities to
“bring the computation to the data”). Replication also means a corresponding decrease in the amount of
usable disk space, as we’ll discuss later in this chapter.
Different blocks from the same file will be stored on different machines
This provides for efficient MapReduce processing (see later)
Blocks are replicated across multiple machines, known as DataNodes
Default replication is three-fold: Meaning that each block exists on three different machines
A master node called the NameNode keeps track of which blocks make up a file, and where those blocks are
located. Known as the metadata
The filesystem “namespace” is basically the overall file/directory structure of the filesystem, while
metadata is information about the files such as ownership and permissions.
Basic information about the NameNode can be found in TDG 3e, page 46 (TDG 2e, 44), and Lin & Dyer (page
30).
Example: Storing and Retrieving Files (1)
This example takes three slides. Example scenario is a system for storing dated log files, and an HDFS cluster
of 5 nodes.
Example:
Two log files for different dates (March 15 2012 and April 12 2013). They are currently on a local disk on a
single computer somewhere. What happens when we add them to HDFS? (pushing them from wherever
they were first collected)
When the files are stored:

1. HDFS splits them into blocks. In this example, 0313 has blocks 1,2 and 3. 0412 has blocks 4,5.
2. Blocks are distributed to various nodes (3x replication.)
3. (this is the key point) The NameNode Metadata stores what files comprise what blocks, and what blocks
are on what node:
the metadata on the left maps the filename to its blocks.
The metadata on the right lists what nodes each block lives on. (Each block lives three places
because of 3x replication)
While the files are accessed:

Now suppose a client needs one of the files, 0423.
1. (blue line from cient to NN) It queries the NameNode: “I need file /log/042313.log, please”
2. (blue line from NN to client) The NameNode responds with a list of blocks that make up that file and (not
shown in the diagram), the list of nodes those blocks are on. (In this example it would repond “Block 4/
Nodes A, B and E” and “Block 5/Nodes C, E and D”. Key point: the Name node does not hand back the
actual data in the file. NameNode does not store or deliver data…only metadata (information about
the data). So the interchange between client and NameNode is very fast and involves minimal network
traffic)
1. (red lines from nodes to client) The client gets the actual file by requesting each of the blocks from the
nodes where they live. (For each block, the client has three choices for where to get that data from. It will
try the first one on the list…if that’s unavailable it will try the second, then the third.) Key point: data is
transferred directly between the node and the client, without involving the NameNode.
(Additional point if anyone asks – Hadoop will attempt to retrieve the block from the “closest” node, if
available, to improve performance.)
HDFS NameNode Availability
▪ The NameNode daemon must be running at all times
─ If the NameNode stops, the cluster becomes inaccessible

▪ High Availability mode (in CDH4 and later)
─ Two NameNodes: Active and Standby

▪ Classic mode
─ One NameNode
─ One “helper” node called
SecondaryNameNode
─ Bookkeeping, not backup
The NameNode must be running in order for the cluster to be usable. However you will not lose any data if
the NameNode fails; you will simply be unable to continue processing until you replace it.
Your system administrator will choose whether to configure your HDFS cluster in HA mode or ‘classic’
mode.
In HA mode, there’s an Active NameNode, and a Standby NameNode that can “hot swap” for the Active
NameNode at any time. The Standby NameNode is kept in a ready state at all times. It isn’t entirely idle
while in standby mode, though. It also does “bookkeeping” for the cluster: It does a periodic merge
operation on the NameNode’s filesystem image and edit log files (to keep the edit log from growing too
large; see TDG 3e pages 46 and 340 (TDG 2e, 45 and 294) for more information).
In “classic” (non-HA) mode, the NameNode is a single point of failure. However, while this seems
frightening at first, it’s important to note that it’s a single point of failure for availability of data – not
reliability/consistency of data in a properly managed cluster (in which the NameNode’s files are written
to an NFS volume as we recommend). The fact that Hadoop has been used in production for several years
before HA was available should tell you that the NameNode single point of failure is not really much of a
problem in practice.
In “classic” mode, the “bookkeeping” functions are handled on a second non-essential node.
Secondary NameNode is an awful name, as it implies it’s a “hot swap” or provides some sort of failover for
the NameNode, which is not true. If the SecondaryNameName goes down, it does not affect the functioning
of the cluster.
Note that the VM we use in class is in Classic mode, but does not have a SecondaryNameNode. This is fine
for running for limited amounts of time, but if they want to keep running the VM, they should periodically
restart so that checkpoints are correctly set.
Options for Accessing HDFS
▪ FsShell Command line: hadoop fs
▪ Java API
▪ Ecosystem Projects
─ Flume
─ Collects data from network
sources (e.g., system logs)
─ Sqoop
─ Transfers data between HDFS
and RDBMS
─ Hue
─ Web-based interactive UI. Can
browse, upload, download,
and view files
Since HDFS is not a normal UNIX filesystem (i.e. one that is tied into the operating system’s kernel), it is not
available like a regular filesystem would be. In other words, you can not click File -> Open in your text editor
and open a file that is stored in HDFS. Instead, you must copy it from HDFS to your local filesystem (e.g.
using the “hadoop fs –copyToLocal” command).
Typically, files are created on a local filesystem and must be moved into HDFS.
Likewise, files stored in HDFS may need to be moved to a machine’s local filesystem
Access to HDFS from the command line is achieved with the hadoop fs command
Early in this class, we will be accessing HDFS using the command line tool, covered shortly.
Applications can read and write HDFS files directly via the Java API. Covered later in the course. In practice,
writing Java code to read/write data in HDFS using the HDFS API is fairly uncommon, but it is worth knowing
about and we’ll cover it later in addition to some other alternative approaches.
NOTE: However, as we’ll see in chapter 11, you can use FuseDFS to mount your HDFS filesystem so that
you would be able to access its files as you would a normal local filesystem, with the caveat the HDFS’s
restrictions and limitations still apply.
hadoop fs Examples (1)
▪ Copy file foo.txt from local disk to the user’s directory in HDFS
$ hadoop fs -put foo.txt foo.txt
─ This will copy the file to /user/username/foo.txt

▪ Get a directory listing of the user’s home directory in HDFS
$ hadoop fs -ls
▪ Get a directory listing of the HDFS root directory
$ hadoop fs -ls /
The user’s home directory in HDFS (e.g. /user/training for the ‘training’ user) is the default target directory
when no directory is explicitly specified.
The filesystem addressing scheme follows UNIX conventions. Therefore, there is a single root (/) directory
and directory paths are separated using slash (/) characters, rather than backslash (\) characters. Those
who have little UNIX experience will find this a change from how Windows/DOS denotes file paths (e.g. c:
\foo\bar\baz.txt).
▪ Display the contents of the HDFS file /user/fred/bar.txt
$ hadoop fs -cat /user/fred/bar.txt
▪ Copy that file to the local disk, named as baz.txt
$ hadoop fs -cat /user/fred/bar.txt baz.txt
▪ Create a directory called input under the user’s home directory
$ hadoop fs -mkdir input

Note: copyFromLocal is a synonym for put; copyToLocal is a synonym for get
Those who do have experience with UNIX probably feel at home with most of the “hadoop fs” commands,
as they closely match their UNIX counterparts.
However, there are sometimes slight differences in how a UNIX command works and how the
corresponding “hadoop fs” command works. The -mkdir command is a good example of this. In UNIX, the
mkdir command doesn’t create nonexistent parent directories by default. For example, if you run “mkdir /
foo/bar/baz” and either the “/foo” or “/foo/bar” directories don’t exist, the mkdir command will fail.
Conversely, the “hadoop fs -mkdir /foo/bar/baz” command would succeed in this case (thereby simulating
the “-p” option to mkdir in UNIX).
For more information on each command, see the “File System Shell Guide” in the Hadoop
documentation (http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u3/
file_system_shell.html).
▪ Delete the directory input_old and all its contents
$ hadoop fs -rm -r input_old
You can think of the “-rm -r” command as “remove recursively.” Those with UNIX experience will recognize
this as equivalent to the “rm -r” command. Obviously, you need to be careful with this command as you
could accidentally delete all your data, even though HDFS file permissions will generally prevent you from
deleting someone else’s data.
NOTE: you may not necessarily lose all your data in case of such an accident, as Hadoop has a “trash”
directory for recently-deleted files (http://archive.cloudera.com/cdh/3/hadoop-0.20.2-
cdh3u3/hdfs_design.html#Space+Reclamation).
Chapter Topics

▪ Conclusion
Hands-on Exercise: Using HDFS
▪ In this Hands-On Exercise you will begin to get acquainted with the Hadoop
tools. You will manipulate files in HDFS, the Hadoop Distributed File System
▪ Please refer to the Hands-On Exercise Manual
Chapter Topics

▪ Conclusion
Key Points
▪ The core components of Hadoop
─ Data storage: Hadoop Distributed File System (HDFS)
─ Data processing: MapReduce
▪ How HDFS works
─ Files are divided into blocks
─ Blocks are replicated across nodes
▪ Command line access to HDFS
─ FsShell: hadoop fs
─ Sub-commands: -get, -put, -ls, -cat, etc.
Introduction to MapReduce
Chapter 4
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
▪ The concepts behind MapReduce
▪ How data flows through MapReduce stages
▪ Typical uses of Mappers
▪ Typical uses of Reducers
Chapter Topics
▪ MapReduce Overview
▪ Example: WordCount
▪ Mappers
▪ Reducers
▪ Conclusion
What Is MapReduce?
▪ MapReduce is a method for distributing a task across multiple nodes
▪ Each node processes data stored on that node
─ Where possible
▪ Consists of two phases:
─ Map
─ Reduce
Hadoop is a large-scale data processing framework. This implies two things: you have a way to store large
amounts of data (HDFS: already discussed) and you have a system for processing it (MapReduce: discussion
begins now).
Although MapReduce can run on data stored in filesystems other than HDFS (e.g. on data stored in a local
filesystem, commonly done during development), it works best with HDFS as they’re optimized to work
together. MapReduce “brings the computation to the data” (data locality), in contrast to how other large-
scale data processing systems were described in chapter 1.
Features of MapReduce
▪ Automatic parallelization and distribution
▪ Fault-tolerance
▪ A clean abstraction for programmers
─ MapReduce programs are usually written in Java
─ Can be written in any language using Hadoop Streaming
─ All of Hadoop is written in Java
▪ MapReduce abstracts all the ‘housekeeping’ away from the developer
─ Developer can simply concentrate on writing the Map and Reduce functions
These are the features you get “for free” by using Hadoop. You do not need to write code to handle
parellelization and distribution of jobs, detecting and handling failure, or even monitoring jobs. In fact, you
won’t even have to write code that reads your input data from files or writes your results to output files.
Hadoop does all of this for you, freeing you up to concentrate on the business logic in your Map and Reduce
functions. And as we’ll see later, even those functions are small and relatively simple to write.
NOTE: Hadoop Streaming allows you to write MapReduce in any language whose programs can be run from
a UNIX shell and which also supports reading from standard input and writing to standard error. These need
not be scripting languages – you can write such code in C or FORTRAN if you wish.
Key MapReduce Stages

▪ The Mapper
─ Each Map task (typically) operates on a single HDFS block
─ Map tasks (usually) run on the node where the block is
stored

▪ Shuffle and Sort
─ Sorts and consolidates intermediate data from all mappers
─ Happens after all Map tasks are complete and before
Reduce tasks start

▪ The Reducer
─ Operates on shuffled/sorted intermediate data (Map task
output)
─ Produces final output
The MapReduce Flow
The top part (Input File) is something you provide (i.e. it’s your input data that you loaded into HDFS).
The InputFormat splits the file up into multiple input splits…that is, sections of the file to go to different
mappers. A file split into three splits as shown here will result in three Mappers running on (up to) three
different nodes. (You can add detail if the class seems ready for it: The default input format splits the file
into splits corresponding to the HDFS blocks that comprise the file. This approach makes it easy for Hadoop
to figure out which node the run the Map tasks on: one of the ones the data is stored on. Handy! There are
other approaches in which splits and blocks don’t line up, we will discuss those more later in the class.)
The MapReduce Flow
The other thing the InputFormat does is decide how to parse the data in a split into “records” for the
mappers to process. It does this by creating a “record reader” for each split. Hadoop provides a variety of
InputFormats/RecordReaders to handle reading data in many formats (such as lines of text, tab-delimited
data, etc.), so you don’t need to write your own InputFormat most of the time. But if Hadoop doesn’t
provide support for a file format you need to support, you can write your own InputFormat. We’ll discuss
how to do so later in class.
The Mapper takes the divided up input and maps each “record” to key/value pairs. (covered in detail
shortly)
The key-value pairs emitted from the Mapper are collectively called “intermediate data” and are written
the local filesystem of the node running that Mapper.
The MapReduce Flow
The Partitioner is the class that determines which Reducer a given key should go to. The intermediate data
is also sorted, grouped and merged so that a given key and all values for that key are passed to the same
Reducer.
You do not generally need to write a Partitioner nor do you need to write code to sort or group the
intermediate data, although you can do so and we’ll look at why and how you can do this later in the
course.
The MapReduce Flow
You will supply the Reducer code, however (the use of “supply” rather than “write” here is intentional, as
Hadoop provides reusable classes in the org.apache.hadoop.mapreduce.lib package).
The MapReduce Flow
Just like the InputFormat creates Record Readers to read input data and present it as a series of key-value
pairs to the Mapper, the OutputFormat handles the other side. It creates a Record Writer to take a series of
key-value pairs and to write them out in a given format to an output file (usually an HDFS file).
The MapReduce Flow
In this chapter we focusing on the Mapper and Reducer parts. The others are optional – Hadoop includes
pre-configured components for most common situations.
Chapter Topics
▪ Mappers
▪ Reducers
▪ Conclusion
Example: Word Count

Result

Input Data aardvark 1
cat 1
mat 1
the cat sat on the mat on 2
the aardvark sat on the sofa sat 2
sofa 1
the 4
To better understand how MapReduce works, let’s consider a very simple example.
Given a set of files containing text, we want to count the occurrences of each word.
The input is an HDFS file. This example is a single, tiny file that would take up just a single block, but a real
world example is likely to be very large and be spread across numerous blocks. It can also be a set of files,
not just a single file.
Processing the whole file (or set of files) would be a “job”. Processing each individual section is a “task”.
Let’s take a look at each phase in order:
• mapping
• shuffling
• reducing
Example: The WordCount Mapper (1)
The input to a MapReduce job is typically one or more HDFS files. Remember that when a file is stored
in HDFS it is broken into blocks stored on different nodes. In this example, we are showing a single
MapReduce ‘task’ operating on a single block.
The first step is that the data in the block is processed by an Input Format component. In this example, we
are using a File Input Format which breaks up the file by lines. Each line is mapped into a key/value pairs,
where the key is the byte-offset within the file and the value is the text on that line. Different types of input
formats are available; this will be discussed later.
Example: The WordCount Mapper (2)
The input to a MapReduce job is typically one or more HDFS files. remember that when a file is stored
in HDFS it is broken into blocks stored on different nodes. In this example, we are showing a single
MapReduce ‘task’ operating on a single block.
The first step is that the data in the block is processed by an Input Format component. In this example, we
are using a File Input Format which breaks up the file by lines. Each line is mapped into a key/value pairs,
where the key is the byte-offset within the file and the value is the text on that line. Different types of input
formats are available; this will be discussed later.
The mapper calls its map function for each line, one at a time.
The mapper in this example goes through each line it is given (e.g. “The cat sat on the mat”) and outputs
a set of key/value pairs: the word and the number of occurrences. (In this example, because we are
just counting, the “value” is always 1. We will see an example later where we output something more
interesting.)
In this simple example, a single map task runs because it’s just a single block of data, but in a real world
application, multiple map tasks would run simultaneously, each processing a separate block. See next slide.
Mappers Run in Parallel
▪ Hadoop runs Map tasks on the node storing the data (when possible)
─ Minimizes network traffic
─ Many Mappers can run in parallel
Remember that one of the key principles of Hadoop is that data is distributed when it is loaded into HDFS,
not at runtime.
Our “aardvark” example was tiny but imagine a much larger data file, split into 2 blocks. When we run a
Hadoop job on the whole file, Hadoop will run three Map tasks, one for each block.
Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to
avoid network traffic. (This doesn’t always happen because all nodes storing a particular block may be
overloaded at a particular time.)
Multiple Mappers run in parallel, each processing a portion of the input data. The sort & shuffle step (on
the next slides) collects, sorts and consolidates all the output from all the Map tasks.
Example: WordCount Shuffle and Sort
Each instance of the mapper outputs a set of key/value pairs for each line of data it processes. The initial
output is in the order it was in the file. The next step is that all the output from the mapper is sorted,
combined, and stored into a local file on the node where the mapper ran. This is generally referred to as the
“intermediate data”.
There may be hundreds or thousands of such sets produced by mappers running on dozens or hundreds
of data nodes. Before running the reducers, Hadoop automatically merges and divides up this data into
“partitions”, each sorted by key (which in this example is the actual word being counted).
Example: SumReducer (1)
Each reducer works on a set of keys in the shuffled, sorted intermediate data, which you’ll recall is sorted
by key. (note that each individual reducer’s input is sorted by key, but there’s no sorting between reducers
(that is, Reducer 1’s data isn’t all “less than” Reducer 2’s data). This is the default behavior. All the data for
a single key will always go to the same reducer, but
A partitioner divides up the set of sorted keys, according to the number of available reducers (in this
example, 3). The Reducer tasks are independent, and may run on separate nodes in parallel, each
processing their set of data.
The final output of each reducer task is stored in a file on HDFS. The set of all the files together comprise
the final output for the job.
(This example is for Java. Streaming Reducers work slightly differently, we’ll cover that later.)
Example: SumReducer (2)

Let’s look at one specific reducer as an example. In our simple example, that reducer has been passed 4
keys, each with a set of values. For each key, it will call the reduce() method once.
For each key/value pair (e.g. “on”,(1,1)) the reduce method processes the data by summing the values
associated with the key, and outputting another key/value pair (e.g. “on”,2).
Note that SumReducer is quite generic…for any key, it simply adds up an associated list of integers, and
outputs the key and the sum. In WordCount, the “key” is a word we are counting, but this same reducer
could be used to process data for many different applications, without different code: the number of each
type of message in a log file (error, info, etc. would be the keys); the number of times of particular product
was returned for repair; the number of times a particular sequence of genes appears in a sample; etc.
This sort of basic statistical function is very common in Hadoop, and many complex questions can be
answered through a series of easily distributed basic functions. (Discussed more on the next slide.)
Why Do We Care About Counting Words?
▪ Word count is challenging over massive amounts of data
─ Using a single compute node would be too time-consuming
─ Number of unique words can easily exceed available memory
─ Would need to store to disk
▪ Statistics are simple aggregate functions
─ Distributive in nature
─ e.g., max, min, sum, count
▪ MapReduce breaks complex tasks down into smaller elements which can be
executed in parallel
▪ Many common tasks are very similar to word count
─ e.g., log file analysis
Why is the simple Word Count example relevant?

Because it typifies the characteristics of a Hadoopable big data problem.
Another Example: Analyzing Log Data
Let’s consider a more realistic example that does something very similar to word count.
Consider a hypothetical file-request log. Each line contains a lot of information, but in this example we care
about the file type (.jpg, .gif, etc.) and how long the request took to process.
We can use MapReduce to process a set of such files to determine what the average processing time for
each file type is.
We name our mapper FileTypeMapper – it emits a file type key and the number of milliseconds to process
the request for each line in the file.
When all the map tasks are complete, Hadoop shuffles and sorts the mapper output, so that each file key
type is associated with a list of time values.
This is the input to the AverageReducer, which calculates the average value for each file type key.
Chapter Topics
▪ Mappers
▪ Reducers
▪ Conclusion
MapReduce: The Mapper (1)
▪ The Mapper
─ Input: key/value pair
─ Output: A list of zero or more key value pairs
map(in_key, in_value) →
(inter_key, inter_value) list
intermediate key 1 value 1

input input
key value
… …
In case it is not clear at this point, the DataNode and TaskTracker processes run on the same machine (this
is how Hadoop achieves data locality).
I usually mention that its common for a Mapper to do one of three things: parsing, filtering or
transformation. We’ll see examples of each coming up.
MapReduce: The Mapper (2)
▪ The Mapper may use or completely ignore the input key
─ For example, a standard pattern is to read one line of a file at a time
─ The key is the byte offset into the file at which the line starts
─ The value is the contents of the line itself
─ Typically the key is considered irrelevant
▪ If the Mapper writes anything out, the output must be in the form of key/
value pairs
the 1

aardvark 1
sat 1
23 the aardvark sat on the sofa
on 1
the 1
sofa 1
The case in which an input key is likely to be relevant is when you are chaining Hadoop jobs together such
that the output of one job is the input to the next job. The jobs further down in the chain are likely to be
interested in the key produced by jobs further up the chain.
The WordMapper is an example of a mapper that ignores the input key.
Example Mapper: Upper Case Mapper
▪ Turn input into upper case (pseudo-code):
let map(k, v) =
emit(k.toUpper(), v.toUpper())

bugaboo an object of BUGABOO AN OBJECT OF

fear or alarm FEAR OR ALARM

mahout an elephant MAHOUT AN ELEPHANT
driver DRIVER

bumbershoot umbrella BUMBERSHOOT UMBRELLA
This is an example of a Mapper that transforms the data; it takes lowercase letters and transforms them to
uppercase. And although this is a simple example, perhaps you could see that it would be possible to use
this concept to do something more useful, like transform a product ID into a product name or turn an IP
address into a hostname or geographic region.
Example: dictionary definitions (the key is the word being defined; the value is the definition.)
Example Mapper: ‘Explode’ Mapper
▪ Output each input character separately (pseudo-code):
let map(k, v) =
foreach char c in v:
emit (k, c)
pi 3
pi .
pi 3.14
pi 1
pi 4

145 k
145 a
145 kale
145 l
145 e
This is another transformation Mapper, though this one demonstrates that, given a single key/value pair as
input, you can generate any number of key/value pairs as output.
Example Mapper: ‘Filter’ Mapper
▪ Only output key/value pairs where the input value is a prime number (pseudo-
code):
let map(k, v) =
if (isPrime(v)) then emit(k, v)
48 7 48 7
pi 3.14
5 12
foo 13 foo 13
This is an example of a filter Mapper. This simple example shows how we could weed out any non-prime
numbers from the input. A variation on this might be to filter out text which does (or does not) match some
pattern – this would be like the UNIX grep program, but unlike grep, it could let you operate over terabytes
of data spanning multiple machines.
Yet another variation might be to take every Nth record from input (where N might be 1,000,000, for
example). You could use this to produce a sample of input data, so that instead of operating on a terabyte
of data, you’re operating on a megabyte of data. This would let you test things more quickly while still using
“real world” data.
Example Mapper: Changing Keyspaces
▪ The key output by the Mapper does not need to be identical to the input key
▪ Example: output the word length as the key (pseudo-code):
let map(k, v) =
emit(v.length(), v)
001 hadoop 6 hadoop
002 aim 3 aim
003 ridiculous 10 ridiculous
This example demonstrates that the type used in the input key doesn’t have to be the same type as used in
the output key (the type of the input value likewise need not match the type of the output value).
In this example, we’re given text as our input key, but we use the length of the value as the output key.
This sort of thing is valuable when you want to examine the distribution of data; for example, to create a
histogram.
Example Mapper: Identity Mapper
▪ Emit the key,value pair (pseudo-code):
let map(k, v) =
emit(k,v)
bugaboo an object of bugaboo an object of

fear or alarm fear or alarm

mahout an elephant mahout an elephant
driver driver

bumbershoot umbrella bumbershoot umbrella
The identity mapper may seem at first glance to be trivial and therefore useless. Why would we need a
program that outputs what we input? This is actually very common, because it’s a way to get data into a
Hadoop job so that the other parts of Hadoop can operate on it: the key/value pairs will be sorted, shuffled,
partitioned, merged and reduced. This would be a straightforward way to filter out duplicates, for example;
all records with the same key will be consolidated and passed to a single reduce, which can then detect
duplicates, and possibly merge, filter or tag them.
The identity mapper is the default mapper. As covered in a later chapter, if you create a job and don’t
specify a mapper, the identity mapper will be used.
Chapter Topics
▪ Mappers
▪ Reducers
▪ Conclusion
Shuffle and Sort
▪ After the Map phase is over, all intermediate values for a given intermediate
key are grouped together
▪ Each key and value list is passed to a Reducer
─ All values for a particular intermediate key go to the same Reducer
─ The intermediate keys/value lists are passed in sorted key order
Now that we’ve seen some examples of what Mappers typically do, let’s look at its counterpart.
Although the keys are passed to a Reducer in sorted order, the values associated with those keys are in no
particular order.
It’s very common for a Reducer to run some sort of “aggregation” operation on the results produced by the
Mappers; for example, counting or averaging those results.
This example is from the earlier example of miliseconds for load time for each file type.
The Reducer
▪ The Reducer outputs zero or more final key/value pairs
─ In practice, usually emits a single key/value pair for each input key
─ These are written to HDFS
reduce(inter_key, [v1, v2, …]) →

(result_key, result_value)
gif 1231 gif 2614
3997

html 344 html 1498
891
788
Now that we’ve seen some examples of what Mappers typically do, let’s look at its counterpart.
Although the keys are passed to a Reducer in sorted order, the values associated with those keys are in no
particular order.
It’s very common for a Reducer to run some sort of “aggregation” operation on the results produced by the
Mappers; for example, counting or averaging those results.
Example Reducer: Sum Reducer
▪ Add up all the values associated with each intermediate key (pseudo-code):
let reduce(k, vals) =

sum = 0
foreach int i in vals:
sum += i
emit(k, sum)
the 1 the 4
SKU0021 34 SKU0021 61
19
Here’s an example of a aggregate operation: this Reducer receives a list of integer values associated with a
given key and simply adds them all up to produce a final result.
This is the reducer we used in the WordCount example.
The first example shown (the, [1,1,1,1]) is how we used it in WordCount.
But the second example shows how the exact same code can be used in other use cases, such as totaling
the number of products sold.
Example Reducer: Average Reducer
▪ Find the mean of all the values associated with each intermediate key
(pseudo-code):

sum = 0; counter = 0;
foreach int i in vals:
sum += i; counter += 1;
emit(k, sum/counter)
the 1 the 1
SKU0021 34 SKU0021 20.33
19
This is from the web log example.

More importantly, it shows how the same mapper output might be analyzed by different ways with
different reducer algorithms (sum in the last example, average in this one.)
Both are very typical in that they are common statistical functions.
Example Reducer: Identity Reducer
▪ The Identity Reducer is very common (pseudo-code):

foreach v in vals:
emit(k, v)
bow a knot with two loops bow a knot with two loops
and two loose ends and two loose ends
a weapon for shooting bow a weapon for shooting

arrows arrows
a bending of the head bow a bending of the head

or body in respect or body in respect

28 2 28 2
2 28 2
7 28 7
This illustrates the default Reducer, which makes use of what’s known in functional programming as “the
identity function.” This means that whatever was passed in as input is emitted back out as output again,
unchanged. Why might this be useful? You could use it to group words. For example, the word “foo”
might be found thousands of times across hundreds of input documents fed to the Mapper. But because a
Reducer will be pass a key and all the values for that key, each occurrence of “foo” will be grouped together
with all its associated values and passed to a single Reducer, and therefore written to a single output file.
This can be especially useful as input to further processing.
Chapter Topics
▪ Mappers
▪ Reducers
▪ Conclusion
Key Points
▪ A MapReduce program has two major developer-created components: a
Mapper and a Reducer
▪ Mappers map input data to intermediate key/value pairs
─ Often parse, filter, or transform the data
▪ Reducers process Mapper output into final key/value pairs
─ Often aggregate data using statistical functions
Hadoop Clusters and the Hadoop
Ecosystem
Chapter 5
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
Hadoop Clusters and the Hadoop Ecosystem
▪ The components of a Hadoop cluster
▪ How Hadoop jobs and tasks run on a cluster
▪ How a job’s data flows in a Hadoop cluster
▪ What other Hadoop Ecosystem projects exist
Chapter Topics
Ecosystem
▪ Hadoop Cluster Overview

▪ Hadoop Jobs and Tasks
▪ Hands-On Exercise: Running a MapReduce Job
▪ Other Hadoop Ecosystem Components
▪ Conclusion
Installing A Hadoop Cluster (1)
▪ Cluster installation is usually performed by a system administrator
─ Out of the scope of this course
─ Covered in Cloudera Administrator Training for Apache Hadoop
▪ Developers should understand how the components of a Hadoop cluster work
together
▪ Developers typically use Hadoop in pseudo-distributed mode
─ A single-machine “cluster”
─ All Hadoop daemons run on the same machine
─ Useful for testing
installation and system administration is outside the scope of this course.

Cloudera offers a training course for System Administrators specifically aimed at those responsible for
commissioning and maintaining Hadoop clusters.
At this point, it’s a good idea to mention the date and location of the next Admin class, as well as when the
Admin class will next run in the location where you’re currently teaching.
You should mention that the virtual machine you use in class is set up in pseudo-distributed mode
(however, Eclipse is set up in local job runner mode).
Developer machines typical configured in pseudo-distributed mode. This effectively creates a single-
machine cluster. All five Hadoop daemons are running on the same machine. Useful for testing code before
it is deployed to the real cluster
Installing A Hadoop Cluster (2)
▪ Difficult
─ Download, install, and integrate individual Hadoop
components directly from Apache

▪ Easier: CDH
─ Cloudera’s Distribution including Apache Hadoop
─ Vanilla Hadoop plus many patches, backports, bug fixes
─ Includes many other components from the Hadoop ecosystem

▪ Easiest: Cloudera Manager
─ Wizard-based UI to install, configure and manage a Hadoop
cluster
─ Included with Cloudera Standard (free) or Cloudera Enterprise
CDH not only gives you a production-ready version of Apache Hadoop, it also gives you the important tools
from the Hadoop ecosystem we’ll be discussing later in class (Hive, Pig, Sqoop, HBase, Flume, Oozie and
others). It’s both free and open source, so it’s definitely the easiest way to use Hadoop and its related tools.
Easiest way to download and install Hadoop, either for a full cluster or in pseudo-distributed mode, is by
using Cloudera’s Distribution, including Apache Hadoop (CDH).
Supplied as a Debian package (for Linux distributions such as Ubuntu), an RPM (for CentOS/RedHat
Enterprise Linux), and as a tarball
Full documentation available at http://cloudera.com/
Hadoop Cluster Terminology
▪ A Hadoop cluster is a group of computers working together
─ Usually runs HDFS and MapReduce
▪ A node is an individual computer in the cluster
─ Master nodes manage distribution of work and data to worker nodes
▪ A daemon is a program running on a node
─ Each performs different functions in the cluster

A cluster is a group of computers working together – the work is distributed across the cluster. As covered
in Basic Concepts, core Hadoop distributes two kinds of things – data and processing – and therefore a
typical Hadoop cluster has infrastructure for HDFS (to distribute the data) and MapReduce (to distribute the
processing).
It is technically possible to have a Hadoop cluster with just one or the other – but this is unusual and
outside the scope of this class.
NOTES: Daemon is pronounced just like the English word “demon” (the a-e ligature is also found in the
word ‘encyclopaedia’), though the pronunciation ‘DAY-mun’ is also common. Daemon basically means
“server process” (or more technically, a long-running process which is detached from any specific terminal).
Daemons of various types are commonly found on UNIX systems (for example: Apache is a daemon
for serving Web pages, Sendmail is daemon for sending e-mail, and so on). Daemons are typically run
automatically when the machine starts up via “init scripts.”
MapReduce v1 and v2 (1)
▪ MapReduce v1 (“MRv1” or “Classic MapReduce”)
─ Uses a JobTracker/TaskTracker architecture
─ One JobTracker per cluster – limits cluster size to about 4000 nodes
─ Slots on worker nodes designated for Map or Reduce tasks
▪ MapReduce v2 (“MRv2”)
─ Built on top of YARN (Yet Another Resource Negotiator)
─ Uses ResourceManager/NodeManager architecture
─ Increases scalability of cluster
─ Node resources can be used for any type of task
─ Improves cluster utilization
─ Support for non-MR jobs
MapReduce v1 and v2 (2)
▪ CDH4 includes MRv2 as a technology
preview
▪ CDH5 MRv2 is production-ready
─ Cloudera recommends using MRv2 in
CDH5
▪ CDH5 supports both MRv1 and MRv2
─ Running both on the same cluster is not
supported
▪ Migrating from MRv1 to MRv2
─ CDH offers full binary compatibility,
source compatibility in almost all cases
Key takeaway on this slide: Everything in THIS COURSE applies equally well to MR1 and MR2. The difference
between the two is invisible to developers. Only system administrators need to know about MR1 v. MR2.
MR2 and YARN were added to Hadoop in .20, but were not considered production-ready. CDH 4 included
YARN/MR2 as a “technology preview” but Cloudera discouraged customers from using them in production.
MR2/YARN in Hadoop 2 GA (which is Hadoop v. 2.2, released Oct 2013) are now considered production-
ready. This production-ready version is included in CDH 5 (currently in Beta as of this writing, Dec 2013).
So CDH4 customers can start exploring MR2 now, but should plan to update to CDH5 before going to
production.
MR1 continues to be supported in both CDH 4 and CDH 5. However, as of CDH 5, Cloudera officially
recommends using MR2.
Complete binary compatibility – programs compiled for MRv1 will run without recompilation
Source compatibility for almost all programs. Small number of exceptions noted here:
Migrating to MapReduce2 on YARN for Users
For more information if students ask:
This is true specifically for CDH, not for Hadoop in general.
The situation is different from users of vanilla Hadoop:
“Old” API (org.apache.hadoop.mapred)
Binary Compatibility – Most of programs written for MRv1 using the old API can run on MRv2 without re-
compilation
“New” API (org.apache.hadoop.mapreduce)
Not binary compatible – recompilation required
Full source compatibility – existing programs will work without re-writing but will need to be recompiled
“We made an investment in CDH4 to swallow the binary incompatible changes. So our CDH4 MR1 is
different than upstream MR1 - it already includes the API changes that make upstream MR2 incompatible
with MR1. People upgrading to CDH4 from CDH3 had to recompile, but they don’t now.”
Migrating to MapReduce2 on YARN for Users
Migrating toCopyright
MapReduce2 on YARN for Operators
© 2010–2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-9
Hadoop Daemons: MapReduce v1
▪ MRv1 daemons
─ JobTracker – one per cluster
─ Manages MapReduce jobs, distributes individual tasks to TaskTrackers
─ TaskTracker – one per worker node
─ Starts and monitors individual Map and Reduce tasks

Each daemon runs in its own Java Virtual Machine (JVM)

Also point out that there is a separation between “daemon” and “physical machine”; a physical machine
can run multiple daemons, as is obviously the case when run pseudo-distributed mode since all five
daemons run on one machine. But in real world clusters, that doesn’t happen. in small clusters, JT and NN
may run on the same node.
“High Availability” (HA) is an optional mode in which you can have a “warm standby” in case the active
master node goes down. HA is available for the JobTracker, Resource Manager (MR2) and NameNode
(HDFS). Diagrams in this course assume that you are running with HA mode.
Hadoop Daemons: MapReduce v2
▪ MRv2 daemons
─ ResourceManager – one per cluster
─ Starts ApplicationMasters, allocates resources on worker nodes
─ ApplicationMaster – one per job
─ Requests resources, manages individual Map and Reduce tasks
─ NodeManager – one per worker node
─ Manages resources on individual worker nodes
─ JobHistory – one per cluster
─ Archives jobs’ metrics and metadata
Note that you may run the JHS on the same node as the RM.
Hadoop Daemons: HDFS
▪ HDFS daemons
─ NameNode – holds the metadata for HDFS
─ Typically two on a production cluster: one active, one standby
─ DataNode – holds the actual HDFS data
─ One per worker node

Each daemon runs in its own Java Virtual Machine (JVM)

Chapter Topics
Ecosystem

▪ Conclusion
MapReduce Terminology
▪ A job is a ‘full program’
─ A complete execution of Mappers and Reducers over a dataset
▪ A task is the execution of a single Mapper or Reducer over a slice of data
▪ A task attempt is a particular instance of an attempt to execute a task
─ There will be at least as many task attempts as there are tasks
─ If a task attempt fails, another will be started by the JobTracker or
ApplicationMaster
─ Speculative execution (covered later) can also result in more task attempts
than completed tasks
Submitting A Job
When a client submits a Hadoop MapReduce job, the information about the job is packaged into an XML
file.
This file, along with the .jar file containing the actual program code, is handed to the JobTracker (MRv1)
or ResourceManager (MRv2).
Typically, this job submission is done via a short program, often called a “driver”, which you will write that
configures the job and then invokes a method that submits it to the cluster. We’ll see – and write – several
examples of this later in class, but for now we’ll just go over the concept.
What happens next depends on whether you are using MRv1 and MRv2. Let’s look at both in the next few
slides.
NOTE: This is covered in TDG in great detail in Chapter 6 (“How MapReduce Works”), starting on TDG 3e
page 189 (TDG 2e, 167).
A MapReduce v1 Cluster
This is a basic MR1 cluster.

As discussed in Basic Concepts
One JobTracker (two if running in High Availability mode – one primary, one standby)
One NameNode, plus either a standby NN if running in HA, or a Secondary (Checkpoint) NN if running in
non-HAD mode
Any number of worker nodes, each running a TaskTracker daemon (for MR) and a DataNode daemon (for
HDFS)
The master daemons will run on their own nodes (machines) for large clusters.
On very small clusters, the NameNode, JobTracker and Secondary NameNode daemons can all reside on a
single machine.
It is typical to put them on separate machines as the cluster grows beyond 20-30 nodes
Each “worker” or “worker” nodes will almost always run the DataNode and TaskTracker daemons. This
should make sense when you recall that Hadoop “brings the computation to the data” and does processing
on the same machines which store the data.
NOTE: In this context, JVM is synonymous with “a single Java process.”
Running a Job on a MapReduce v1 Cluster (1)
Now let’s show how a sample MR job would work. First, we “put” (upload) some data we want to work on
into HDFS as discussed in previous chapters. In this example, it’s a single file called “mydata” comprising
two blocks on node 1 and node 2.
(This is review – it was covered in Basic Concepts. Also students did this with shakespeare and the web
server access log in the exercises.)
Next we submit a job on the client, such as WordCount to process the mydata file. (As students did in the
exercises with shakespeare.)
The job is submitted to the Job Tracker which schedules the job’s tasks: map tasks (first) then the reduce
tasks (when the map tasks are complete).
The JobTracker communicates with the TaskTracker on the individual worker nodes to keep track of the
progress of the task.
When a TaskTracker receives a request to run a task, it instantiates a separate JVM for that task
TaskTracker nodes can be configured to run multiple tasks at the same time if the node has enough
processing power and memory
When all the tasks are completed, the job is complete and the client is notified.
(If students seem interested in the difference between MR1 and MR2, point out that in MR1 each worker
node is configured with a fixed number of map “slots” and a fixed number of reduce “slots”. So reducer
slots may be available, but unused if no reduce tasks are running – which means the resources reserved for
those slots is utilized at that moment.)
A MapReduce v2 Cluster
Here’s a basic MR2 cluster.

It looks similar at first…
Instead of a JobTracker we have a ResourceManager, and instead of TaskTrackers on each node we have
NodeManagers.
We also have a Job History server. This is necessary because unlike the JobTracker, the ResourceManager
only keeps information about jobs around for a little while (until they are older than the configurable
“retirement” time elapses). MapReduce job history is then stored by the Job History server. This is not an
important note in this slide sequence, but is included so that students will be aware of all demons that will
be running.
The command to start a job on MR2 is the same as MR1. In fact, developers and users won’t see much if
any difference at all.
However, instead of the RM figuring out what individual tasks are needed, and starting and tracking each
one, it instead starts an Application Master for the job.
Once started, the App Master figures out what tasks the job will require, and requests
“resources” (meaning memory and CPU cores) required. The Resource Manager will allocate (schedule)
“containers” for the application. The Application Master then starts and tracks the progress of the tasks.
When the tasks are complete, the App Master notifies the RM, which will then deallocate all the containers
(including the App Master itself) and notify the client that the job is complete.
(Again, if students are interested in MR1 vs MR2, point out here that instead of having designated “slots”
configured on each worker node, each node simply has “resources” which are available for the RM to
allocate, not caring whether it’s a map task, reduce task. In fact, the underlying YARN architecture doesn’t
even know if it’s a MapReduce job or some other kind of job.)
Job Data: Mapper Data Locality

When possible, Map
tasks run on a node
where a block of data
is stored locally

Otherwise, the Map
task will transfer
the data across
the network as it
processes that data
As described in the first chapter, Hadoop tries to avoid the overhead associated with copying data from
storage nodes to processing nodes by co-locating storage and processing responsibilities to worker nodes
which do both. Since Hadoop knows which nodes hold which data, it can further reduce network transfer
by trying to schedule computation on a node which already holds the data to be processed. This is known
as “rack awareness” or “data locality” (synonymous terms).
Common Question: How does Hadoop know about the topology of your network so it can schedule tasks
efficiently?
Answer: This is more of a topic for system administrators and is covered in depth in our Admin class. But
the short answer is that system administrator configures this using a script (see TDG 3e starting on the
bottom of page 299 (TDG 2e, 251)).
Job Data: Intermediate Data

Map task
intermediate data is
stored on the local
disk (not HDFS)
As described in the first chapter, Hadoop tries to avoid the overhead associated with copying data from
storage nodes to processing nodes by co-locating storage and processing responsibilities to worker nodes
which do both. Since Hadoop knows which nodes hold which data, it can further reduce network transfer
by trying to schedule computation on a node which already holds the data to be processed. This is known
as “rack awareness” or “data locality” (synonymous terms).
Common Question: How does Hadoop know about the topology of your network so it can schedule tasks
efficiently?
Answer: This is more of a topic for system administrators and is covered in depth in our Admin class. But
the short answer is that system administrator configures this using a script (see TDG 3e starting on the
bottom of page 299 (TDG 2e, 251)).
Job Data: Shuffle and Sort

There is no concept
of data locality for
Reducers

Intermediate data is
transferred across
the network to the
Reducers

Reducers write their
output to HDFS
Although the Reducers may run on the same physical machines as the Map tasks, there is no concept of
data locality for the Reducers
All Mappers will, in general, have to communicate with all Reducers
Is Shuffle and Sort a Bottleneck?
▪ It appears that the shuffle and sort phase is a bottleneck
─ The reduce method in the Reducers cannot start until all Mappers have
finished
▪ In practice, Hadoop will start to transfer data from Mappers to Reducers as
soon as the Mappers finish work
─ This avoids a huge amount of data transfer starting as soon as the last
Mapper finishes
─ The reduce method still does not start until all intermediate data has been
transferred and sorted
Important point: Because a Reducer will process all values for a given key, no Reducers can start reducing
until all Mappers have finished.
Although no Reducer can begin reducing until all Mappers are done Mapping, they can and do begin
copying intermediate data down from the Mappers as soon as Mappers finish. This helps to spread out the
network transfer operation and eliminate the obvious bottleneck that would occur if all Reducers started
copying data from all the Mappers as soon as the last Mapper completes. To be clear, the Reducers just
start copying data once each Mapper completes – it cannot start processing it until all the Mappers are
complete.
NOTE: When students submit jobs and see progress percentage for the Reducer tasks on the console or
in the Web UI, they will probably notice that the “Reduce %” starts incrementing before the “mapper %”
reaches 100%. This is because copying intermediate data is attributed to the “Reduce %” since the Reducers
pull this information (instead of the Mappers pushing it). In other words, a non-zero “Reduce %” just means
that at least one Mapper is complete (and that the Reducer has started copying data down, not that it has
necessarily started processing those).
Is a Slow Mapper a Bottleneck?
▪ It is possible for one Map task to run more slowly than the others
─ Perhaps due to faulty hardware, or just a very slow machine
▪ It would appear that this would create a bottleneck
─ The reduce method in the Reducer cannot start until every Mapper has
finished
▪ Hadoop uses speculative execution to mitigate against this
─ If a Mapper appears to be running significantly more slowly than the others,
a new instance of the Mapper will be started on another machine, operating
on the same data
─ A new task attempt for the same task
─ The results of the first Mapper to finish will be used
─ Hadoop will kill off the Mapper which is still running
Because no Reducers can start reducing until all Mappers have finished, one slow Mapper slows down
the entire job. This is why we have speculative execution. It is intended to help in the case where a
task is running on a machine that has performance problems (e.g. because a disk or network card is
starting to fail). At this point, you should mention the difference between “failed tasks” (e.g. an exception
occurred during processing) and “killed tasks” (e.g. the JT killed the slower of two identical tasks following
speculative execution, because the faster one already completed successfully).
NOTE: just as a slow Map task (called a “straggler”) can slow down the job, a slow Reducer can also prevent
the job from completing as quickly as it should. For this reason, speculative execution can also take place
with Reducers, although it is not pertinent to this slide. As will be discussed later in this class, speculative
execution will not always help you and may in fact slow things down. Such is the case when a slow Mapper
is not the result of performance problems on a given machine, but because it’s simply doing more work
(e.g. processing a larger chunk of data or doing a more intense calculation on a chunk of data) than the
other Mappers. In this case, running the same task on another machine isn’t going to help much. It is
therefore possible to turn off speculative execution on a per-job or cluster-wide basis. (See TDG 3e page
215 (TDG 2e, 183) for more information.)
Creating and Running a MapReduce Job
▪ Write the Mapper and Reducer classes
▪ Write a Driver class that configures the job and submits it to the cluster
─ Driver classes are covered in the “Writing MapReduce” chapter
▪ Compile the Mapper, Reducer, and Driver classes
$ javac -classpath `hadoop classpath` MyMapper.java

MyReducer.java MyDriver.java
▪ Create a jar file with the Mapper, Reducer, and Driver classes
$ jar cf MyMR.jar MyMapper.class MyReducer.class

MyDriver.class
▪ Run the hadoop jar command to submit the job to the Hadoop cluster
$ hadoop jar MyMR.jar MyDriver in_file out_dir
Review the steps that students will perform in the Running a MapReduce Job hands-on exercise.
You might get questions about the driver class, since we haven’t covered it yet. We will cover it extensively
in the “Writing MapReduce Programs” chapter.
For the hadoop jar command example, you should mention that MyDriver is the class name of the
driver class, in_file is the name of the input file that the mapper reads, and out_dir is the name of
the output folder from the MapReduce job.
If you have a class full of students who are experienced Java programmers and are eager to use Eclipse, you
might want to go over creating the jar file from Eclipse at this point. Eclipse usage is covered in the next
chapter, but the WordCount solution is available in the wordcount Eclipse project, and if they want to get
started with Eclipse early, you could cover the Eclipse slides here instead of in the next chapter.
Chapter Topics
Ecosystem

▪ Conclusion
Hands-On Exercise: Running A MapReduce Job
▪ In this Hands-On Exercise, you will run a MapReduce job on your pseudo-
distributed Hadoop cluster
Several instructors have reported students receiving a "Connection Refused" message when they try to run
the word count example for the first time.
If your students receive this message, it might be because the JobTracker daemon crashed. Check the status
of the JobTracker daemon (/etc/init.d/hadoop-0.20-mapreduce-jobtracker status) and
see if the status indicates that the process is dead but the PID file still exists. If this is the case, restarting
the JobTracker daemon (/etc/init.d/hadoop-0.20-mapreduce-jobtracker start) could
resolve the problem.
This problem might be related to low memory conditions on the VM. Another possible workaround is to
increase the RAM size for the VM and restart the machine. You can do this if the student lab system has
more than 2 GB of RAM.
Chapter Topics
Ecosystem

▪ Conclusion
The Hadoop Ecosystem (1)
Hadoop consists of two core components, which we already covered

The Hadoop Distributed File System (HDFS)
MapReduce Software Framework
There are many other projects based around core Hadoop
Often referred to as the ‘Hadoop Ecosystem’
Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
The Hadoop Ecosystem (2)
▪ Ecosystem projects may be
─ Built on HDFS and MapReduce
─ Built on just HDFS
─ Designed to integrate with or support Hadoop
▪ Most are Apache projects or Apache Incubator projects
─ Some others are not managed by the Apache Software Foundation
─ These are often hosted on GitHub or a similar repository
▪ Following is an introduction to some of the most significant projects
“Many other projects exist which use Hadoop core” – or help you to use Hadoop core (in the case of Sqoop,
Flume or Oozie).
Most of the projects are now Apache projects or Apache Incubator projects. Being in the Apache Incubator
doesn’t mean the project is unstable or unsafe to use in production, it just means that it’s a relatively new
project at Apache and the community around it is still being established. In most cases, Incubator projects
at Apache have been around for several years, either inside a specific company (like Cloudera or Yahoo!)
and typically hosted elsewhere (like GitHub) before being donated to Apache.
The name Hadoop comes from the name of a stuffed elephant toy that Doug Cutting’s son had. A lot of
projects in the Hadoop Ecosystem have unusual names and are often related to animals, and in particular,
elephants.
Hive and Pig
▪ Languages for querying and manipulating data
─ Higher level than MapReduce
▪ Interpreter runs on a client machine
─ Turns queries to MapReduce jobs
─ Submits jobs to the cluster
▪ Overview later in the course
─ Covered in detail in Cloudera Data Analyst Training: Using Pig, Hive, and
Impala with Hadoop
There’s a whole chapter on Hive and Pig coming up later in this course and Cloudera offers a 4-day course
that covers Hive in much greater detail (it’s a good idea to mention dates/locations of upcoming offerings).
Hive
▪ HiveQL
─ Very similar to SQL
▪ Sample Hive query:
SELECT stock.product, SUM(orders.purchases)

FROM stock JOIN orders
ON (stock.id = orders.stock_id)
WHERE orders.quarter = 'Q1'
GROUP BY stock.product;
Hive was initially created at Facebook when they found their existing data warehousing solution could not
scale to process as much data as they were generating (see TDG 3e, page 413 (TDG 2e, 365)).
The essential point is that Hive gives you an alternative way to analyze your data.
Instead of writing MapReduce to query the data (which might be 50 lines of code), you can write just a few
lines of HiveQL similar to what’s shown here. This makes analysis of data stored in your cluster available to
a much wider audience, since someone like a business analyst or DBA could write this SQL but isn’t trained
to write the equivalent Java code. But even if you are a programmer, Hive can save a lot of time and trouble
for certain types of analysis.
Pig
▪ PigLatin
─ A dataflow scripting language
▪ Sample Pig script:
stock = LOAD '/user/fred/stock' AS (id, item);

orders = LOAD '/user/fred/orders' AS (id, cost);
grpd = GROUP orders BY id;
totals = FOREACH grpd GENERATE group,
SUM(orders.cost) AS t;
result = JOIN stock BY id, totals BY group;
DUMP result;
Pig was developed at Yahoo and attempts to do more or less the same thing as Hive, it just goes about the
solution in a different way. Like Hive, it lets you define analysis using a high-level language (PigLatin in this
case) which ultimately gets turned into MapReduce jobs that analyze data in your cluster.
As you see, PigLatin is more of a procedural language (because we’re specifying a series of steps needed to
achieve some result) than Hive, but they both do similar things.
One thing that’s noteworthy here is that we’re defining the schema “on the fly” at the time the data is
being analyzed, rather than when the data is initially loaded into HDFS. Hive works the same way, but this
is quite different than an RDBMS since with an RDBMS you must design your schema up front by specifying
the order, name and type of each column before you can import any data.
Common question: If Hive and Pig both sort of do the same thing, why do both exist?
Answer: Why do both Perl and Ruby exist? Same reason -- they represent alternative solutions to a
problem, developed by different groups of people. We’ll talk more about this later in the course.
Impala
▪ High-performance SQL engine for vast amounts of data
─ Similar query language to HiveQL
─ 10 to 50+ times faster than Hive, Pig, or MapReduce
▪ Impala runs on Hadoop clusters
─ Data stored in HDFS
─ Does not use MapReduce
▪ Developed by Cloudera
─ 100% open source, released under the Apache software license
▪ We will investigate Impala later in the course
For more information on Impala, look at http://blog.cloudera.com/blog/2012/10/

cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
It *does not* use MapReduce. The impalad daemons run on each worker node,. You access it from a
command-line tool on a client machine. It leverages Hive’s metadata: you create Hive tables, and then
query those tables using Impala. Impala is an example of a non-MR application that can run together with
MR applications in a Hadoop cluster because of YARN, which supports any Hadoop app, not just MR apps.
(CDH5 includes a “preview” version of Impala running on YARN)
Impala was announced at the Strata + Hadoop World conference in New York City on October 24, 2012,
after which the beta version that had tested by many of Cloudera’s customers during the previous months
became available to the general public. Several additional beta versions followed until the GA (General
Availability; i.e. 1.0 production version) was released on May 1, 2013.
“Inspired by Google’s Dremel database” – Dremel is a distributed system for interactive ad-hoc queries
that was created by Google. Although it’s not open source, the Google team described it in a published
paper http://research.google.com/pubs/archive/36632.pdf. Impala is even more
ambitious than Dremel in some ways; for example, the published description of Dremel says that joins
are not implemented at all, while Impala supports the same inner, outer, and semi-joins that Hive does.
Impala development is led by Marcel Kornacker, who joined Cloudera to work on Impala in 2010 after
serving as tech lead for the distributed query engine component of Google’s F1 database http://
tiny.cloudera.com/dac15b.
Flume and Sqoop
▪ Flume imports data into HDFS as it is generated
─ Instead of batch-processing it later
─ For example, log files from a Web server

▪ Sqoop transfers data between RDBMSs and HDFS
─ Does this very efficiently via a Map-only MapReduce job
─ Supports JDBC, ODBC, and several specific databases
─ “Sqoop” = “SQL to Hadoop”
▪ We will investigate Flume and Sqoop later in the course
Flume and Sqoop are two totally different products with one thing in common: they both help you get data
into your Hadoop cluster.
Flume gets its name from “Log Flume” (which many will know as a popular water ride at amusement parks).
It lets you ingest data directly into your cluster in real time, instead of generating it to files and importing
those files into your cluster later. As the name implies, it is most often used for server logs (Web server,
e-mail server, UNIX syslog, etc.) but can be adapted to read data from lots of other sources, as will be
discussed later in the course.
Sqoop can be remembered as a contraction of “SQL-to-Hadoop” and is a tool for bringing data from a
relational database into Hadoop for analysis or for exporting data you’ve already got in Hadoop back to an
external database for further processing or analysis.
Oozie
▪ Oozie
─ Workflow engine for MapReduce jobs
─ Defines dependencies between jobs
▪ The Oozie server submits the jobs to the server in the correct sequence
▪ We will investigate Oozie later in the course

As you’ll see later in this course, it’s actually a pretty common practice to have the output of one
MapReduce job feed in as input to another MapReduce job, conceptually similar to how you might use a
series of simple UNIX utilities connected by pipes to accomplish some larger, more interesting job.
Oozie is a tool that will help you to define this sort of workflow for Hadoop jobs. It takes care of running the
jobs in the correct order, specifying the location of input and output, and letting you define what happens
when the job completes successfully or when it encounters an error.
We have an entire chapter on Oozie later in this course.
HBase
▪ HBase is the Hadoop database
▪ A ‘NoSQL’ datastore
▪ Can store massive amounts of data
─ Petabytes+
▪ High write throughput
─ Scales to hundreds of thousands of inserts per second
▪ Handles sparse data well
─ No wasted spaces for empty columns in a row
▪ Limited access model
─ Optimized for lookup of a row by key rather than full queries
─ No transactions: single row operations only
─ Only one column (the ‘row key’) is indexed
HBase was inspired by Google’s “BigTable” paper presented at OSDI in 2006 (http://
research.google.com/archive/bigtable-osdi06.pdf).
We don’t cover HBase in this class, but we do offer an entire class on it (mention relevant dates and
locations). Also, one of Cloudera’s Solution Architects, Lars George, literally wrote the book on HBase
(HBase: The Definitive Guide, published by O’Reilly). Chapter 13 in TDG 3e and TDG 2e also covers HBase.
HBase vs Traditional RDBMSs
RDBMS HBase
Data layout Row-oriented Column-oriented
Transactions Yes Single row only
Query language SQL get/put/scan (or use

Hive or Impala)
Security Authentication/Authorization Kerberos
Indexes Any column Row-key only
Max data size TBs PB+
Read/write throughput Thousands Millions

(queries per second)
Generally speaking, HBase is limited in terms of feature set when compared to an RDBMS, but very
compelling when it comes to scalability.
You might get questions about how HBase compares with NoSQL databases, for example, Cassandra:
• Cassandra is a distributed key-value store based on BigTable & Amazon’s Dynamo. HBase is a
distributed key-value store based on BigTable.
• Both use a column-family based data-model, like BigTable.
• Cassandra is eventually consistent (favors availability and partition tolerance, but with tunable
consistency options). HBase is strongly consistent (favors consistency and availability, but can survive a
network partition).
• Cassandra uses a decentralized P2P (master/master) communication model, based on Gossip. HBase
uses Zookeeper to manage state.
• Cassandra relies on local storage (not HDFS) and replicates data between nodes for fault tolerance.
HBase uses HDFS for storage and utilises its free replication for fault tolerance.
• Cassandra is a top-level Apache project open-sourced by Facebook in 2008. HBase has been a top-level
Apache project since 2010, a Hadoop sub-project since 2008.
• Cassandra is supported by Datastax,, HBase by Cloudera.
An interesting link comparing NoSQL databases: http://www.networkworld.com/news/
tech/2012/102212-nosql-263595.html
Mahout
▪ Mahout is a Machine Learning library written in Java
▪ Used for
─ Collaborative filtering (recommendations)
─ Clustering (finding naturally occurring “groupings” in data)
─ Classification (determining whether new data fits a category)
▪ Why use Hadoop for Machine Learning?
─ “It’s not who has the best algorithms that wins. It’s who has the most data.”
Clustering example: finding related news articles. Computer vision – grouping pixels that cohere into
objects.
Classification example: spam filtering. Given tumors identified as benign or malignant, classify new tumors
Since Machine Learning benefits from having lots of data, it stands to reason that it would work nicely on
systems designed to store lots of data (like Hadoop).
Chapter Topics
Ecosystem

▪ Conclusion
Key Points
▪ HDFS daemons
─ NameNode (master node)
─ DataNode (worker nodes)
─ (Secondary NameNode if not running High Availability HDFS)
▪ Key MapReduce daemons
─ MRv1: JobTracker (master node)
─ MRv1: TaskTracker (worker nodes)
─ MRv2: ResourceManager (master node)
─ MRv2: NodeManager (worker nodes)
─ MRv2: ApplicationMaster (worker node, one per job)
▪ Hadoop Ecosystem
─ Many projects built on, and supporting, Hadoop
─ Several will be covered later in the course
Not shown is the MR Job History daemon.
Writing a MapReduce Program in
Java
Chapter 6
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
Writing a MapReduce Program in Java
▪ Basic MapReduce API concepts
▪ How to write MapReduce drivers, Mappers, and Reducers in Java
▪ The differences between the old and new MapReduce APIs
Chapter Topics
Java
▪ Basic MapReduce API Concepts

▪ Writing MapReduce Applications in Java: The Driver
▪ Writing MapReduce Applications in Java: The Mapper
▪ Writing MapReduce Applications in Java: The Reducer
▪ Speeding up Hadoop Development by Using Eclipse
▪ Hands-On Exercise: Writing a MapReduce Program in Java
▪ Hands-On Exercise: More Practice With MapReduce Java Programs
▪ Differences Between the Old and New MapReduce APIs
▪ Conclusion
Review: The MapReduce Flow
In this chapter we will be focusing on the Mapper and Reducer parts. The others are optional – Hadoop
includes pre-configured components for most common situations.
Almost every job you write for Hadoop will have these three parts (Mapper, Reducer and Driver). These
three things are required to configure and execute the Map and Reduce code.
You may have additional parts too, mainly intended to optimize or test your code, or to handle custom file
formats, and we’ll cover those later in class.
A Sample MapReduce Program: WordCount
▪ In an earlier chapter, you ran a sample MapReduce program
─ WordCount, which counted the number of occurrences of each unique word
in a set of files
▪ In this chapter, we will examine the code for WordCount
─ This will demonstrate the Hadoop API

aardvark 1
cat 1
mat 1
the cat sat on the mat on 2
the aardvark sat on the sofa sat 2
sofa 1
the 4
In this chapter, we’re actually going to look at the code -- not just the concepts – behind a MapReduce
program for Hadoop.
Our MapReduce Program: WordCount
▪ To investigate the API, we will dissect the WordCount program we covered in
the previous chapter
▪ This consists of three portions
─ The driver code
─ Code that runs on the client to configure and submit the job
─ The Mapper
─ The Reducer
▪ Before we look at the code, we need to cover some basic Hadoop API concepts
You already understand the concept of how WordCount works in MapReduce from the previous chapter.
You’ve even run it in Hadoop during the last lab exercise.
Now we’re going to look at the Java code to see specifically how it’s implemented in Hadoop.
Getting Data to the Mapper
▪ The data passed to the Mapper is specified by an InputFormat
─ Specified in the driver code
─ Defines the location of the input data
─ Typically a file or directory
─ Determines how to split the input data into input splits
─ Each Mapper deals with a single input split
─ Creates a RecordReader object
─ RecordReader parses the input data into key/value pairs to pass to the
Mapper
We will cover input formats much more later. Key point to focus on here is that the choice of input format
determines the format of the records handed to the mapper.
InputFormat is an interface in Hadoop and there are many implementations in Hadoop (such as
TextInputFormat, which is the default, and many others described on an upcoming slide). Per the API
documentation, Hadoop relies on the InputFormat for a given job to:
1. Validate the input-specification of the job. This means that the input format will do checks such as
making sure the input path exists and is readable.
2. Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
This means that you will have one Mapper for each InputSplit (and since the number of input splits typically
corresponds to the size of a data block in HDFS, a 640 MB input file will most likely be divided into 10 blocks
of 64 MB each, assuming the default settings, so the job will process it using 10 Map tasks).
3. Provide the RecordReader implementation to be used to glean input records from the logical InputSplit
for processing by the Mapper. This means that the record reader knows how to break raw data from the
InputSplit into a series of key-value pairs that are passed to the Mapper’s Map method, one pair at a time.
More information on InputFormat can be found in TDG 3e on page 234 (TDG 2e, 198).
Example: TextInputFormat
▪ TextInputFormat
the cat sat on the mat</n>
─ The default the aardvark sat on
─ the sofa</n>
Creates LineRecordReader
…
objects
─ Treats each \n-terminated line
of a file as a value
─ Key is the byte offset of that
line within the file
key value
0 the cat sat on the mat
23 the aardvark sat on the

sofa
52 …
… …
The “\n” is the escape sequence in Java (and perl, C, C++, UNIX shell, etc.) for a newline character (ASCII
code 10). You should pronounce it in class as “newline” and not “backslash N” or anything else.
TextInputFormat: As was the case in WordCount, you often don’t care about the byte offset and ignore it.
Other Standard InputFormats
▪ FileInputFormat
─ Abstract base class used for all file-based InputFormats
▪ KeyValueTextInputFormat
─ Maps \n-terminated lines as ‘key [separator] value’
─ By default, [separator] is a tab
▪ SequenceFileInputFormat
─ Binary file of (key, value) pairs with some additional metadata
▪ SequenceFileAsTextInputFormat
─ Similar, but maps (key.toString(), value.toString())
FileInputFormat is not a class you will use directly; it’s an abstract base class (i.e. it’s there to simplify
creating file-based input formats by doing things like input path validation that are common to all
subclasses). The other four listed on this slide are specific implementations you can use directly. The first
two are text-based while the other two relate to a Hadoop-specific file format called Sequence File that is
discussed in depth later in class. This is by no means a complete list of input formats. For more examples, as
well as explanation for those listed here, read pages 237 and 245-251 in TDG 3e (TDG 2e, 201 and 209-215).
KeyValueTextInputFormat: Use the key.value.separator.in.input.line property to specify a
different separator (such as a comma for CSV files). See TDG 3e page 247 (TDG 2e, 211) for more info.
The purpose of the key.toString() and value.toString() methods in the last item is that it’s converting the
objects into their text representation, which can be useful in some cases.
Keys and Values are Objects
▪ Keys and values in Hadoop are Java Objects
─ Not primitives
▪ Values are objects which implement Writable
▪ Keys are objects which implement WritableComparable
Point #1: Both keys and values are Java objects. They are not Java primitives. In other words, you cannot
use things like int or float as keys or values in Hadoop. Instead, you use the corresponding Writable subclass
(such as IntWritable or FloatWritable) to “wrap” these values. The Writable interface is used in Hadoop’s
serialization process, allowing these values to be efficiently stored to or read from files, as well as be passed
across the network.
Point #2: Recall that Hadoop sorts the keys passed to the Reducer (values are in no particular order). Thus,
it must therefore be possible to sort the keys, and this is accomplished by making them Comparable as
well as Writable. Hadoop defines an interface called WritableComparable for this purpose. Objects used
as values (not keys) may also be WritableComparable, but since the values are not sorted, they are only
required to be Writable.
NOTE: Although Writable is Hadoop-specific, Comparable is an interface that is part of core Java
(java.lang.Comparable). Students who are experienced Java programmers will likely already be familiar with
Comparable since it’s used throughout Java for sorting things.
What is Writable?
▪ The Writable interface makes serialization quick and easy for Hadoop
▪ Any value’s type must implement the Writable interface
▪ Hadoop defines its own ‘box classes’ for strings, integers, and so on
─ IntWritable for ints
─ LongWritable for longs
─ FloatWritable for floats
─ DoubleWritable for doubles
─ Text for strings
─ Etc.
“box classes” is a reference to the “autoboxing” feature introduced in Java 1.5, which converts primitives
(like int or float) to and from their object wrapper types (like java.lang.Integer or java.lang.Float).
However, unlike with Java’s wrapper types, this conversion is not automatic in Hadoop. You create a
Writable instance by supplying a primitive in the constructor (e.g. IntWritable key = new IntWritable(5);) or
by calling the set(int) method on an existing instance. Similarly, you get the primitive value back again by
calling a method on that object (e.g. int value = key.get();).
NOTE: The map and reduce methods both require that the record you emit have both a key and value. In
some cases, either the key or value is unimportant to your algorithm, but you are still required to have one.
You should consider using NullWritable in these cases, as it’s basically a singleton placeholder object which
doesn’t actually store any data (and therefore it conserves storage space). Unlike the others, you do not
create it via the constructor, but rather call the static NullWritable.get() method to acquire it. See TDG 3e
page 102 (TDG 2e, 95) for more information.
Common question: Why doesn’t Hadoop use the normal Java serialization mechanism?
Answer: Because Hadoop requires efficiency much more than it requires a general purpose solution like
Java provides. This is described in greater detail on page 108 of TDG 3e (TDG 2e. 102), including quotes
from Hadoop creator Doug Cutting.
What is WritableComparable?
▪ A WritableComparable is a Writable which is also Comparable
─ Two WritableComparables can be compared against each other to
determine their ‘order’
─ Keys must be WritableComparables because they are passed to the
Reducer in sorted order
─ We will talk more about WritableComparables later
▪ Note that despite their names, all Hadoop box classes implement both
Writable and WritableComparable
─ For example, IntWritable is actually a WritableComparable
The first item is saying that interface WritableComparable extends the Writable interface by adding the
Comparable interface. Therefore, a class which implements WritableComparable (such as those mentioned
on the previous slide) is both Writable and Comparable. As such, it could be used to represent either keys
or values.
Chapter Topics
Java

▪ Conclusion
The Driver Code: Introduction
▪ The driver code runs on the client machine
▪ It configures the job, then submits it to the cluster
How deep your explanation is in the next few slides should be related to how much Java development
experience those in your class have.
The Driver: Complete Code (1)
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();

job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
file continues on the next slide
Although the driver code is relatively simple, it won’t all fit on one screen. Here’s the first half. We’ll see the
rest on the next screen, then we’ll go over it one section at a time over the slides that follow that one.
The Driver: Complete Code (2)
FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);

System.exit(success ? 0 : 1);
}
}
Here’s the remainder of the driver code from the previous screen. Now that you’ve seen it, we’ll look at
each part…
The Driver: Import Statements
1 import org.apache.hadoop.fs.Path;
2 import org.apache.hadoop.io.IntWritable;
3 import org.apache.hadoop.io.Text;
4 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
5 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
6 import org.apache.hadoop.mapreduce.Job; 1
7
8 public class WordCount {
9
10 public static void main(String[] args) throws Exception {
11
12 if (args.length != 2) {
13 System.out.printf("Usage: WordCount <input dir> <output dir>\n");
14 System.exit(-1);
15 }
16
17 Job job = new Job();
18 job.setJarByClass(WordCount.class);
19 job.setJobName("Word Count");
1 You will typically import these classes into every MapReduce job you write.
We will omit the import statements in future slides for brevity.
The “import” statements are how you tell Java which classes your program is going to use. Experienced (or
even novice) Java programmers already know about this, but if you have people that know C, this is vaguely
similar to an “include” statement. If you have perl programmers, it’s quite similar to a “use” (pronounced
like the verb, not the noun) or “require” statement.
At any rate, they can think of it as basically boilerplate code that will be found at the top of every program,
although the exact import lines will vary somewhat (e.g. you might import FloatWritable instead of
IntWritable). Your IDE (Eclipse, NetBeans,. IntelliJ IDEA, etc.) can generate these import statements for you
automatically, so you need not memorize these lines.
The Driver: Main Code
12 System.exit(-1);}
16
17 FileInputFormat.setInputPaths(job, new Path(args[0]));
18 FileOutputFormat.setOutputPath(job, new Path(args[1]));
19
20 job.setMapperClass(WordMapper.class);
21 job.setReducerClass(SumReducer.class);
22
23 job.setMapOutputKeyClass(Text.class);
24 job.setMapOutputValueClass(IntWritable.class);
25
26 job.setOutputKeyClass(Text.class);
27 job.setOutputValueClass(IntWritable.class);
28
29 boolean success = job.waitForCompletion(true);
30 System.exit(success ? 0 : 1);}
31 }
Now that we’ve removed the import statements for the sake of brevity, the entire Driver class now fits on
one screen. Let’s go over each part…
The Driver Class: main Method
9 public static void main(String[] args) throws Exception { 1
12 System.exit(-1);
13 }
17
20
23
1 The main method accepts two command-line arguments: the input and
output directories.
Java programmers (and C and C++ programmers) will know that the “main” method is a special method
which is executed when you run the class from the command line. Those without this background likely
won’t see it as anything special, so you should point this out by saying something like, “when you type ‘java
WordCount’ on the command line, Java will look for the main method and execute whatever code it finds in
that method.
The best practice is to extend the Configured class, implement the Tool interface, and use ToolRunner.
ToolRunner makes it possible to specify or override settings from the command line. This is more flexible
and generally preferred, since you don’t have to check out/modify/recompile your code. Getting started
without ToolRunner is slightly easier, but we will use ToolRunner starting in the “Delving Deeper Into the
Hadoop API” chapter.
The only reason you might not want to use ToolRunner is if you want to actually prevent others from
changing these settings (i.e. “locking it down” as a sysadmin might wish to do). In that case, just leave the
“extends Configured implements Tool” part out of the class definition and then put everything in the main
method of this class.
Sanity Checking The Job’s Invocation
12 System.exit(-1);
13 }1
17
20
23
1 The first step is to ensure we have been given two command-line arguments.
If not, print a help message and exit.
Here we just do some ‘sanity checking’ to make sure the person who runs the driver has provided us with
two arguments: an input and an output path. If they haven’t, we just print out a message that explains how
to properly invoke this program and then exit with a non-zero return value (which indicates to the UNIX
shell that the program didn’t complete successful).
Configuring The Job With the Job Object
12 System.exit(-1);
13 }
15 job.setJarByClass(WordCount.class); 1
17
20
23
1 To configure the job, create a new Job object. Identify the Jar which contains
the Mapper and Reducer by specifying a class in that Jar.
The Job object collects all the settings that tell Hadoop how the MapReduce job will execute. Most of what
we do in the driver class relates to configuring the Job. We’ll see some typical configuration settings on the
next few slides.
In addition, we’ll use the Job object to submit the job to Hadoop.
Creating a New Job Object
▪ The Job class allows you to set configuration options for your MapReduce job
─ The classes to be used for your Mapper and Reducer
─ The input and output directories
─ Many other options
▪ Any options not explicitly set in your driver code will be read from your
Hadoop configuration files
─ Usually located in /etc/hadoop/conf
▪ Any options not specified in your configuration files will use Hadoop’s default
values
▪ You can also use the Job object to submit the job, control its execution, and
query its state
Configuring the Job: Setting the Name
12 System.exit(-1);
13 }
16 job.setJobName("Word Count"); 1
17
20
23
1 Give the job a meaningful name.
The job name is just a human-readable display name that describes your job. Here we just have a line
of code that sets the job name to a string that contains the current class name (“WordCount”). In a real
application running on a multi-user cluster, you will probably want to give it a more meaningful name (like
“Third Quarter Sales Report Generator” or “Web Site Session Analysis Job”) which will better help others
identify it from among all the other jobs running on the cluster.
Configuring the Job: Specifying Input and Output Directories
12 System.exit(-1);
13 }
17
19 FileOutputFormat.setOutputPath(job, new Path(args[1])); 1
20
23
26
1 Next, specify the input directory from which data will be read, and the output
directory to which final output will be written.
We almost always deal with file-based input and output in Hadoop. We can specify any number of
input files or directories as input to the job. Hadoop gives us a lot of flexibility over how to specify input
data and we’ll look at this in more detail momentarily. The output path is a single directory (notice that
“setInputPaths” is plural, but “setOutputPath” is singular), because Hadoop takes care of writing one or
more output files in that directory based on other aspects of job configuration.
Configuring the Job: Specifying the InputFormat
▪ The default InputFormat (TextInputFormat) will be used unless you
specify otherwise
▪ To use an InputFormat other than the default, use e.g.
job.setInputFormatClass(KeyValueTextInputFormat.class)
Recall that TextInputFormat was the input format in which the value was a line of text and the key was the
byte offset at which that line began in the file. KeyValueTextInputFormat was the “tab-delimited” format
mentioned earlier.
Configuring the Job: Determining Which Files To Read
▪ By default, FileInputFormat.setInputPaths() will read all files from
a specified directory and send them to Mappers
─ Exceptions: items whose names begin with a period (.) or underscore (_)
─ Globs can be specified to restrict input
─ For example, /2010/*/01/*
▪ Alternatively, FileInputFormat.addInputPath() can be called
multiple times, specifying a single file or directory each time
▪ More advanced filtering can be performed by implementing a PathFilter
─ Interface with a method named accept
─ Takes a path to a file, returns true or false depending on whether or
not the file should be processed
Those familiar with UNIX will understand that files whose names begin with a dot are treated specially (they
are not shown by the ‘ls’ command in UNIX, by default). Files whose names begin with an underscore are
used as “flags” in Hadoop (e.g. there is a _SUCCESS file created when a job completes without error).
Hadoop’s support for globs (matching patterns) are fairly extensive and basically mirror those available in a
modern UNIX shell like bash. See TDG 3e pages 65-66 (TDG 2e, 60-61) for more explanation and examples.
Configuring the Job: Specifying Final Output With OutputFormat
▪ FileOutputFormat.setOutputPath() specifies the directory to which
the Reducers will write their final output
▪ The driver can also specify the format of the output data
─ Default is a plain text file
─ Could be explicitly written as
job.setOutputFormatClass(TextOutputFormat.class)
▪ We will discuss OutputFormats in more depth in a later chapter
Configuring the Job: Specifying the Mapper and Reducer Classes
12 System.exit(-1);
13 }
17
20
22 job.setReducerClass(SumReducer.class); 1
23
1 Give the Job object information about which classes are to be instantiated
as the Mapper and Reducer.
Default Mapper and Reducer Classes
▪ Setting the Mapper and Reducer classes is optional
▪ If not set in your driver code, Hadoop uses its defaults
─ IdentityMapper
mahout an elephant driver mahout an elephant driver
─ IdentityReducer
bow a knot with two loops bow a knot with two loops
and two loose ends and two loose ends
a weapon for shooting bow a weapon for shooting

arrows arrows
a bending of the head bow a bending of the head

or body in respect or body in respect
These were mentioned earlier, but important to note that these are provided classes that are used by
default.
Configuring the Job: Specifying the Intermediate Data Types
20
23
25 job.setMapOutputValueClass(IntWritable.class); 1
26
29
31 System.exit(success ? 0 : 1);
32 }
33 }
1 Specify the types for the intermediate output keys and values produced by
the Mapper.
Careful: many of the methods available in Job have confusingly similar names. Since they often take the
same parameters, it can be easy to call the wrong one by mistake, leading to errors that could be difficult
to track down. As an example, these two methods differ only in that one says “key” and the other says
“value”; both take a single Class object as a parameter.
Note that if the classes of the intermediate output key and value are identical to the reducer’s output
key and value, you need not call the setMapOutputKeyClass and setMapOutputValueClass
methods. These methods are not called in the sample solutions for the word count example, and comments
in the code explain why the code omits the calls.
Configuring the Job: Specifying the Final Output Data Types
20
23
26
28 job.setOutputValueClass(IntWritable.class); 1
29
31 System.exit(success ? 0 : 1);
32 }
33 }
1 Specify the types for the Reducer’s output keys and values.
As another example of how easy it is to get the method names confused, note the similarity between
“setMapOutputKeyClass” (specifies type of Mapper output key) and “setOutputKeyClass” (specifies type
of Reducer output key). The way to keep these straight is to think about the Reducer as the class that will
write out the final output, hence the method name doesn’t have the word “reduce” in it.
Running The Job (1)
20
23
26
29
31 System.exit(success ? 0 : 1); 1
32 }
33 }
1 Start the job and wait for it to complete. Parameter is a Boolean, specifying
verbosity: if true, display progress to the user. Finally, exit with a return
code.
waitForCompletion is a method on the Job class that runs the job based on the configuration we have
supplied. It is synchronous, meaning that control will not move to the next line of code until the job is
done running. It polls the Job Tracker for progress information and prints this to the console until the job is
complete.
Running The Job (2)
▪ There are two ways to run your MapReduce job:
─ job.waitForCompletion()
─ Blocks (waits for the job to complete before continuing)
─ job.submit()
─ Does not block (driver code continues as the job is running)
▪ The client determines the proper division of input data into InputSplits, and
then sends the job information to the JobTracker daemon on the cluster
The submit method is asynchronous; it does not wait for the job to complete before executing the next
line of code. You will therefore either call waitForCompletion or submit based on your needs.
For more information on the other points, see page 190 of TDG 3e (TDG 2e, 168).
Reprise: Driver Code
12 System.exit(-1);}
16
19
22
25
28
30 System.exit(success ? 0 : 1);}
31 }
Now, here is the whole driver class again. Now that we’ve gone over each piece, you should be able to
understand the whole thing. Other drivers are usually just a variation on this, usually with just a few more
configuration options. We’ll be talking about many common ones throughout the class.
Chapter Topics
Java

▪ Conclusion
WordCount Mapper Review
The Mapper: Complete Code
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<LongWritable, Text, Text,

IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {

if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));

}
}
}
}
Here’s the entire Mapper class. As with the driver, we will go over each part in the next few slides.
The Mapper: import Statements
import org.apache.hadoop.mapreduce.Mapper; 1
public class WordMapper extends Mapper<LongWritable, Text, Text,

IntWritable> {
@Override


file edited for space
1 You will typically import java.io.IOException, and the
org.apache.hadoop classes shown, in every Mapper you write. We will omit
the import statements in future slides for brevity.
The Mapper: Main Code
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>

{
@Override


}
}
}
}
Once we’ve moved the import statements off the slide for brevity, we see that our Mapper is under ten
lines of code, not including the curly braces.
Before we examine it line-by-line in the next few slides, point out that there is absolutely no code for fault
tolerance or even file I/O here – Hadoop takes care of all these things for you.
The Mapper: Class Declaration (1)
public class WordMapper extends Mapper 1 <LongWritable, Text, Text, IntWritable>

{
@Override


}
}
}
}
1 Mapper classes extend the Mapper base class.
The Mapper: Class Declaration (2)
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> 1
{
@Override


}
}
}
}
1 Specify generic types that declare four type parameters: the input key and
value types, and the output (intermediate) key and value types. Keys must be
WritableComparable, and values must be Writable.
You will remember from the pseudocode that the Mapper take a key/value pair as input and outputs
some number of key/value pairs. Here we must specify the data types of those key/value pairs. It’s easy to
remember when you break it down: the first two are the input and the other two are the output. Within
each pair, the first is the key and the other is the value.
As a programmer, you get to choose what the output key and value types are (e.g. use whatever makes
sense for what you need to do), but the input key and value are determined by whatever InputFormat you
specified. Since we didn’t specify the InputFormat in our Driver class, we get the default (TextInputFormat),
which gives us the byte offset (LongWritable) as the key and the line of text (Text) as the value.
NOTE: The ability to specify type parameters in Java using this angle bracket syntax is relatively new. It was
one of the major changes introduced in Java 1.5 (first released to production in late 2004). In Java, this is
called “generics” but in case you have C++ programmers, it’s basically Java’s version of what they know as
“templates” in C++.
This class doesn’t explicitly explain how Generics work, but if your whole class is new to the idea, you may
wish to take a moment to explain.
The map Method

{
@Override
throws IOException, InterruptedException 1 {


}
}
}
}
1 The map method is passed a key, a value, and a Context object. The
Context is used to write the intermediate data. It also contains information
about the job’s configuration.
On this slide, the LongWritable parameter represents the input key and the Text parameter next to it
represents the input value. The Context gathers the output key/value pairs, so the first type specified for it
is the output key type while the second is the output value type.
This is a good time to point out that the parameterized types are specified in several places and
they have to match up. The input key and value are specified in both the class signature (“extends
Mapper<LongWritable, Text” and the map method’s signature “map(LongWritable key, Text value” –
point to these on the slide and emphasize that they need to match up (you’ll get a compiler error if they
don’t match). The output key and value are specified in the class signature only. Furthermore, the output
key and value type are also specified in the Driver class – those must match too! If students think that’s
unnecessary duplication, they’re right but it’s a function of how generics were implemented in Java (to
maintain backwards compatibility, using “erasure”): the type information is thrown away at compile time
and is therefore unavailable to the Driver class at runtime. But since it’s needed at runtime, you must
specify it explicitly. This is a shortcoming of Java’s design, not Hadoop’s.
The map Method: Processing The Line (1)

{
@Override
String line = value.toString(); 1


}
}
}
}
1 value is a Text object, so retrieve the string it contains.
A Text object is not a string, it’s just a Writable object that holds a string. Per the API documentation, a Text
object’s toString() method will return the string it contains.
NOTE: The reference to “the API documentation” is intentional. All objects have a toString method in
Java (because this method is defined in java.lang.Object, from which all objects descend). Generally
speaking, an object’s toString() method just provides an arbitrary string representation that is useful
for debugging purposes, but that value isn’t necessarily guaranteed to be consistent and its bad form to
assume otherwise. The API documentation for the Text object, however, specifically says that the method
will “convert text back to string” and thus it is safe to rely on.
The map Method: Processing The Line (2)

{
@Override

if (word.length() > 0) { 1

}
}
}
}
1 Split the string up into words using a regular expression with non-
alphanumeric characters as the delimiter, and then loop through the words.
These two lines and the one just after it are really the core business logic of your Map function. The rest is
Java’s verbosity and a bunch of curly braces.
The map Method: Outputting Intermediate Data

{
@Override

context.write(new Text(word), new IntWritable(1)); 1
}
}
}
}
1 To emit a (key, value) pair, call the write method of the Context object.
The key will be the word itself, the value will be the number 1. Recall that
the output key must be a WritableComparable, and the value must be a
Writable.
Once again, note that there’s no I/O code here. We just call a method to write our key/value pairs and
Hadoop takes care of the rest.
Reprise: The map Method

{
@Override


}
}
}
}
And that’s it – the rest is just curly braces.
Chapter Topics
Java

▪ Conclusion
WordCount Review: SumReducer

The Reducer: Complete Code
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>

{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
int wordCount = 0;
for (IntWritable value : values) {

wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));

}
}
Like the Mapper, the Reducer is relatively short and fits on one screen even with all the import statements.
We will go over each part in the next few slides…
The Reducer: Import Statements
import org.apache.hadoop.mapreduce.Reducer; 1

{
@Override
int wordCount = 0;

}

1 As with the Mapper, you will typically import java.io.IOException, and
the org.apache.hadoop classes shown, in every Reducer you write. We will
omit the import statements in future slides for brevity.
The Reducer: Main Code
{
@Override
int wordCount = 0;

}

}
}
The Reducer: Class Declaration
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> 1
{
@Override
int wordCount = 0;

}

}
}
1 Reducer classes extend the Reducer base class. The four generic type
parameters are: input (intermediate) key and value types, and final output
key and value types.
This is quite similar to the class definition for the Mapper: we have a pair of types for the input key and
value, and another pair of types for the output key and value. In fact, the only difference is that we
implement Reducer rather than Mapper.
The reduce Method
{
@Override
throws IOException, InterruptedException { 1
int wordCount = 0;

}

}
}
1 The reduce method receives a key and an Iterable collection of objects

(which are the values emitted from the Mappers for that key); it also receives
a Context object.
And this is similar to the Mapper too. The main difference is that instead of a receiving a single key/value
pair as input like the Mapper does, we receive a single key and an iterator that allows us to step through
each value associated with that key.
NOTE: point out that the types in the class definition and the reduce method signature must match up,
as you did with the Mapper. Also mention that, as with the Mapper, the output key/value types are also
specified in the Driver class so those must match too.
The reduce Method: Processing The Values
{
@Override
int wordCount = 0;

wordCount += value.get(); 1
}

}
}
1 We use the Java for-each syntax to step through all the elements in the
collection. In our example, we are merely adding all the values together. We
use value.get() to retrieve the actual numeric value each time.
These few lines of code make up the business logic of your reducer.
The reduce Method: Writing The Final Output
{
@Override
int wordCount = 0;

}
context.write(key, new IntWritable(wordCount)); 1
}
}
1 Finally, we write the output key-value pair to HDFS using the write method
of our Context object.
This is the same thing we did in the mapper to gather up our output key/value pairs.
Reprise: The Reducer Main Code
{
@Override
int wordCount = 0;

}

}
}
Here’s the whole Reducer again, without import statements.

Common question: Can you put the combine the Driver, Mapper and Reducer in one Java class (or one
source file).
Answer: Yes, but unless you have a lot of Java programming experience, I don’t recommend it.
Follow-up: OK, I have a lot of Java experience. How do I do it?
Answer: The typical way is for your Mapper and Reducer to be separate static inner classes within your
outer Driver class. It is absolutely essential that the inner classes are declared ‘static’ – if you don’t do this,
you’ll still be able to compile without any warning, but the job will fail with an error that does not make the
root cause very obvious to you. For this reason, we really don’t recommend doing this in class.
Ensure Types Match (1)
▪ Mappers and Reducers declare input and output type parameters
▪ These must match the types used in the class
This pattern is true for both mapper and reducer, though only mapper is shown.
Note that an error here will be caught by the compiler (which is the advantage of using Generics.) If you
tried to call write with, say, Text/LongWritable, the compiler would complain.
Ensure Types Match (2)
▪ Output types must also match those set in the driver
Of course, the mapper output types must match the reducer input types, too.
A mismatch of type here would NOT be caught by the compiler. This would be detected at run time and
cause your Hadoop job to throw an error. For instance, if the output value class were set to LongWritable,
but the actual mapper sent an IntWritable, the error would be
java.io.IOException: Type mismatch in value from map:
expected org.apache.hadoop.io.LongWritable, received
org.apache.hadoop.io.IntWritable.
So be careful.
Chapter Topics
Java

▪ Conclusion
Integrated Development Environments
▪ There are many Integrated Development Environments (IDEs) available
▪ Eclipse is one such IDE
─ Open source
─ Very popular among Java developers
─ Has plug-ins to speed development in several different languages
▪ Our Hands-On Exercises support the use of Eclipse
─ Eclipse on the VM is preconfigured for Java and Hadoop
─ An Eclipse project is provided for each Java API exercise
─ Hands-on Exercise manual has instructions for using Eclipse in class
▪ If you would prefer to write your code this week using a terminal-based editor
such as vi, we certainly won’t stop you!
─ But using Eclipse can dramatically speed up your development
Eclipse is the IDE with the greatest market share for Java developers, followed by NetBeans (open source
project sponsored by Oracle, and formerly Sun Microsystems) and IntelliJ IDEA (pronounced as “in-TELL-ih
JAY followed by the English word ‘idea’”; this comes from a European company called JetBrains).
The terminal-based editor ‘vi’ (pronounced as two separate letters “vee eye”) is found on every version of
UNIX and Linux.
Eclipse is installed and configured on the VMs used in class, but in case a student asks about setting it up
on their own computer (e.g. back at the office), it can be downloaded at http://www.eclipse.org/
downloads/. The number of distinct versions of Eclipse available at the download site is confusing to
many users, but the version they need is called “Eclipse IDE for Java Developers” (that is verbatim; don’t be
confused by items with very similar names on that page).
There is a plug-in for Hadoop development but it is problematic and at this time doesn’t work with CDH.
The plug-in is not installed on student machines. You might want to mention that we are not using it in this
course.
Chapter Topics
Java

▪ Conclusion
Hands-On Exercise: Writing a MapReduce Program in Java
▪ In this exercise, you will write a MapReduce job that reads any text input and
computes the average length of all words that start with each character.
Chapter Topics
Java

▪ Conclusion
Hands-On Exercise: More Practice with MapReduce Java
Programs
▪ In this exercise, you will analyze a log file from a web server to count the
number of hits made from each unique IP address.
Chapter Topics
Java

▪ Conclusion
What Is The Old API?
▪ When Hadoop 0.20 was released, a ‘New API’ was introduced
─ Designed to make the API easier to evolve in the future
─ Favors abstract classes over interfaces
▪ Some developers still use the Old API
─ Until CDH4, the New API was not absolutely feature-complete
▪ All the code examples in this course use the New API
This topic is discussed in further detail in TDG 3e on pages 27-30 (TDG 2e, 25-27).
NOTE: The New API / Old API is completely unrelated to MRv1 (MapReduce in CDH3 and earlier) / MRv2
(next-generation MapReduce, which uses YARN, which will be available along with MRv1 starting in CDH4).
Instructors are advised to avoid confusion by not mentioning MRv2 during this section of class, and if asked
about it, to simply say that it’s unrelated to the old/new API and defer further discussion until later.
New API vs. Old API: Some Key Differences (1)
New API Old API

import org.apache.hadoop.mapreduce.* import org.apache.hadoop.mapred.*
Driver code: Driver code:
Configuration conf = new Configuration(); JobConf conf = new JobConf(Driver.class);

Job job = new Job(conf); conf.setSomeProperty(...);
job.setJarByClass(Driver.class); ...
job.setSomeProperty(...); JobClient.runJob(conf);
...
job.waitForCompletion(true);
Mapper: Mapper:
public class MyMapper extends Mapper { public class MyMapper extends MapReduceBase
implements Mapper {
public void map(Keytype k, Valuetype v,
Context c) { public void map(Keytype k, Valuetype v,
... OutputCollector o, Reporter r) {
c.write(key, val); ...
} o.collect(key, val);
} }
}
Emphasize that they should look for classes in mapreduce package, not mapred. This is particularly
important when using Eclipse: if you use a class name that hasn’t been imported, it will offer to import
it for you, and give you a choice between mapreduce and mapred. If you choose mapred for some and
mapreduce for others, this will cause errors. Your program must be either all new API or all old API.
On this slide, you should point out the similarities as well as the differences between the two APIs. You
should emphasize that they are both doing the same thing and that there are just a few differences in how
they go about it.
You can tell whether a class belongs to the “Old API” or the “New API” based on the package name. The old
API contains “mapred” while the new API contains “mapreduce” instead. This is the most important thing
to keep in mind, because some classes/interfaces have the same name in both APIs. Consequently, when
you are writing your import statements (or generating them with the IDE), you will want to be cautious and
use the one that corresponds whichever API you are using to write your code.
The functions of the OutputCollector and Reporter object have been consolidated into a single Context
object. For this reason, the new API is sometimes called the “Context Objects” API (TDG 3e, page 27 or TDG
2e, page 25).
NOTE: The “Keytype” and “Valuetype” shown in the map method signature aren’t actual classes defined in
Hadoop API. They are just placeholders for whatever type you use for key and value (e.g. IntWritable and
Text). Also, the generics for the keys and values are not shown in the class definition for the sake of brevity,
but they are used in the new API just as they are in the old API.
New API vs. Old API: Some Key Differences (2)
New API Old API
Reducer: Reducer:
public class MyReducer extends Reducer { public class MyReducer extends MapReduceBase
implements Reducer {
public void reduce(Keytype k,
Iterable<Valuetype> v, Context c) { public void reduce(Keytype k,
for(Valuetype eachval : v) { Iterator<Valuetype> v,
// process eachval OutputCollector o, Reporter r) {
c.write(key, val); while(v.hasnext()) {
} // process v.next()
} o.collect(key, val);
} }
}
}
setup(Context c) (See later) configure(JobConf job)
cleanup(Context c) (See later) close()
MRv1 vs MRv2, Old API vs New API
▪ There is a lot of confusion about the New and Old APIs, and MapReduce
version 1 and MapReduce version 2
▪ The chart below should clarify what is available with each version of
MapReduce
Old API New API
MapReduce v1 ✔ ✔
MapReduce v2 ✔ ✔
▪ Summary: Code using either the Old API or the New API will run under MRv1
and MRv2
Key point: The choice between MR1 and MR2 is determined when cluster is installed and configured, not by
the developer. Developers choose between old API and new API.
Because this is a developer course and not an administrator course, 99% of what we cover in this class
involves the API (in which we will focus exclusively on the new API), and will run equally well no matter
what version of MR the code is deployed on. (the 1% is in occasional references to how job tracking is done,
which is offered for informational/contextual purposes only.)
Chapter Topics
Java

▪ Conclusion
Key Points
▪ InputFormat
─ Parses input files into key/value pairs
▪ WritableComparable, Writable classes
─ “Box” or “wrapper” classes to pass keys and values
▪ Driver
─ Sets InputFormat and input and output types
─ Specifies classes for the Mapper and Reducer
▪ Mapper
─ map() method takes a key/value pair
─ Call Context.write() to output intermediate key/value pairs
▪ Reducer
─ reduce() method takes a key and iterable list of values
─ Call Context.write() to output final key/value pairs
Writing a MapReduce Program
Using Streaming
Chapter 7
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
Writing a MapReduce Program Using Streaming
▪ How to write MapReduce programs using Hadoop Streaming
Chapter Topics
Writing a MapReduce Program Using
Streaming
▪ Writing Mappers and Reducers with the Streaming API

▪ Optional Hands-On Exercise: Writing a MapReduce Streaming Program
▪ Conclusion
The Streaming API: Motivation
▪ Many organizations have developers skilled in languages other than Java,
such as
─ Ruby
─ Python
─ Perl
▪ The Streaming API allows developers to use any language they wish to write
Mappers and Reducers
─ As long as the language can read from standard input and write to standard
output
Input for your Mappers and Reducers comes from the standard input stream (STDIN) and you send the
output to the standard output stream (STDOUT). These concepts should be familiar to UNIX (and even DOS)
users, as well as most programmers (the likely exceptions being those who know only SQL, Visual Basic or
Javascript). Hadoop gets its name because you’re dealing with “streams” of input and output.
Status reporting and other features generally accessed via the Reporter object in the Hadoop Java API are
generally done by writing specially-constructed messages to the standard error stream (STDERR).
There is some coverage of Hadoop Streaming in TDG 3e (pages 36-40) (TDG 2e, 33-37), but there is much
more in Chuck Lam’s Hadoop in Action book. Additionally, the Hadoop Streaming guide at the Apache site
(http://hadoop.apache.org/common/docs/r0.20.203.0/streaming.html) provides a
lot of detail.
The Streaming API: Advantages and Disadvantages
▪ Advantages of the Streaming API:
─ No need for non-Java coders to learn Java
─ Fast development time
─ Ability to use existing code libraries
▪ Disadvantages of the Streaming API:
─ Performance
─ Primarily suited for handling data that can be represented as text
─ Streaming jobs can use excessive amounts of RAM or fork excessive numbers
of processes
─ Although Mappers and Reducers can be written using the Streaming API,
Partitioners, InputFormats etc. must still be written in Java
Fast development time: this is for two reasons. The first (and more important) is that languages like perl,
Python and Ruby don’t require the developer to compile them before execution (you simply run the source
code). Therefore, this saves a lot of time as compared to the Java development cycle of checkout/modify/
compile/package/deploy when you need to change something. The second reason is that these languages
are often “higher level” and less verbose than Java, so they often take less time to develop.
Ability to use existing code libraries: You may already have a language-specific library you want to use in
your MapReduce code (such as a statistics library in Python or a text-processing library in perl). You don’t
want to have to find Java equivalents, or worse yet, rewrite those libraries in Java. With Hadoop Streaming,
you don’t have to.
Hadoop Streaming does has disadvantages: The biggest is performance. Streaming jobs run about 15% -
25% slower than the equivalent Java code (one reason for this is the cost of sending all the data through
streams). References: http://code.google.com/p/hadoop-stream-mapreduce/wiki/
Performance and http://stackoverflow.com/questions/1482282/java-vs-python-
on-hadoop.
Another problem with Streaming is that there are some things (such as creating a custom Partitioner) that
simply cannot be done with Streaming (you have to write the Partitioner in Java). It’s also only suited for
handling data that can be represented as text. Finally, since the Mapper and Reducer code run outside of
the JVM, they are not subject to the resource limitations of the JVM. As such, it’s possible (and fairly easy)
to write a streaming job that spins out of control by either creating a lot of processes (i.e. a “fork bomb”) or
by using an excessive amount of RAM.
How Streaming Works
▪ To implement streaming, write separate Mapper and Reducer programs in the
language(s) of your choice
─ They will receive input via stdin
─ They should write their output to stdout
▪ If TextInputFormat (the default) is used, the streaming Mapper just
receives each line from the file on stdin
─ No key is passed
▪ Mapper and Reducer output should be sent to stdout as
─ key [tab] value [newline]
▪ Separators other than tab can be specified
The third point is saying that output from both the Mapper and the Reducer should be sent to STDOUT as
key/value pairs separated by a tab character, and that you write a newline after each such record.
The way to specify a different separator character is by setting the “stream.map.output.field.separator”
property with the desired character as the value. For an example, see the “Customizing How Lines are
Split into Key/Value Pairs” section in the Hadoop Streaming Guide (http://hadoop.apache.org/
common/docs/r0.20.203.0/streaming.html#Hadoop+Comparator+Class).
Streaming: Example Mapper
▪ Example streaming wordcount Mapper:
#!/usr/bin/env perl
while (<>) { # Read lines from stdin
chomp; # Get rid of the trailing newline
(@words) = split /\W+/; # Create an array of words
foreach $w (@words) { # Loop through the array
print "$w\t1\n"; # Print out the key and value
}
}
Sample code is in ~training_materials/developer/exercises/wordcount/perl_solution/wcmapper.pl

This is an example of the word count Mapper in the perl programming language. As you recall from the
pseudocode and the Java implementation, this Mapper just splits the line of text into words, and then
outputs that word as the key and the literal 1 as the value.
NOTE: For instructors who don’t know perl…
The first line is called the “shebang” line (pronounced “shhh-BANG”) and tells the UNIX shell what program
to use to interpret the script (in this case, it will use perl).
The next line reads one line at a time from standard input.
The next line removes the newline character from that line of input.
The next line splits the line up into into an array of words, based on a pattern (called a “regular expression”)
that matches one or more whitespace characters.
The next line iterates over each word in the array
The next line prints out that word, then a tab character (\t), then the literal value 1, then a newline (\n)
character.
The rest is just closing up the curly braces opened earlier in the program.
Streaming Reducers: Caution
▪ Recall that in Java, all the values associated with a key are passed to the
Reducer as an Iterable
▪ Using Hadoop Streaming, the Reducer receives its input as one key/value pair
per line
▪ Your code will have to keep track of the key so that it can detect when values
from a new key start
There’s no convenient way to pass a single key and a list of values on one line, so Hadoop works around this
by repeating the key for each value on STDIN. For example, there is no good way to supply data to STDIN
like this:
mykeyA: [valueA1, valueA2, valueA3]
mykeyB: [valueB1, valueB2]
So Hadoop instead writes this:
mykeyA valueA1
mykeyA valueA2
mykeyA valueA3
mykeyB valueB1
mykeyB valueB2
You therefore have to keep track of the current key (e.g. mykeyA) so you can detect when it changes (e.g.
to mykeyB).
Streaming: Example Reducer
▪ Example streaming wordcount Reducer:
#!/usr/bin/env perl
$sum = 0;
$last = "";
while (<>) { # read lines from stdin
($key, $value) = split /\t/; # obtain the key and value
$last = $key if $last eq ""; # first time through
if ($last ne $key) { # has key has changed?
print "$last\t$sum\n"; # if so output last key/value
$last = $key; # start with the new key
$sum = 0; # reset sum for the new key
}
$sum += $value; # add value to tally sum for key
}
print "$key\t$sum\n"; # print the final pair
Sample code is in ~training_materials/developer/exercises/wordcount/perl_solution/wcreducer.pl
Launching a Streaming Job
▪ To launch a Streaming job, use e.g.:
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/\

streaming/hadoop-streaming*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapScript.pl \
-reducer myReduceScript.pl \
-file mycode/myMapScript.pl \
-file mycode/myReduceScript.pl
▪ Many other command-line options are available

─ See the documentation for full details
▪ Note that system commands can be used as a Streaming Mapper or Reducer
─ For example: awk, grep, sed, or wc
The name of the hadoop-streaming JAR file contains the version number of Hadoop, so the asterisk lets you
prevent having to specify it explicitly.
Common question: Why do you have to specify the Mapper and Reducer scripts twice? (i.e. once in the -
mapper argument and later in the -file argument)?
Answer: The -mapper argument says what command to run as the mapper for your job. The -file argument
says that this file should be copied throughout your cluster.
Follow-up: So why isn’t that the default?
Answer: You can also use system binaries like ‘grep’ or ‘sort’ or ‘wc’ for your Mapper or Reducer, and since
those already exist on every node in the cluster (i.e. in /usr/bin), there is no need to copy them.
NOTE: The -file argument is copying things into the Distributed Cache, which is discussed in greater detail
later in class.
Chapter Topics
Streaming

▪ Conclusion
Hands-On Exercise: Writing a MapReduce Streaming Program
▪ In this exercise, you will implement the Average Word Length program in a
scripting language of your choice (e.g. Perl, Python, etc.)
Chapter Topics
Streaming

▪ Conclusion
Key Points
▪ The Hadoop Streaming API allows you to write Mappers and Reducers in any
language
─ Data is read from stdin and written to stdout
▪ Other Hadoop components (InputFormats, Partitioners, etc.) still require Java
Unit Testing MapReduce
Programs
Chapter 8
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
Unit Testing MapReduce Programs
▪ What unit testing is, and why you should write unit tests
▪ What the JUnit testing framework is, and how MRUnit builds on the JUnit
framework
▪ How to write unit tests with MRUnit
▪ How to run unit tests
Chapter Topics
▪ Unit Testing
▪ The JUnit and MRUnit testing frameworks
▪ Writing unit tests with MRUnit
▪ Running Unit Tests
▪ Hands-On Exercise: Writing Unit Tests with the MRUnit Framework
▪ Conclusion
An Introduction to Unit Testing
▪ A ‘unit’ is a small piece of your code
─ A small piece of functionality
▪ A unit test verifies the correctness of that unit of code
─ A purist might say that in a well-written unit test, only a single ‘thing’ should
be able to fail
─ Generally accepted rule-of-thumb: a unit test should take less than a second
to complete
There are various kinds of testing, including unit testing, performance testing, regression testing (verifying
that no previously fixed bugs were re-introduced) and integration testing (verifying that a set of software
components work together). Unit testing is the foundation of all of these.
A good way to explain the concept is to ask the students to imagine building a deck (or porch) for their
house. In order for it to be sturdy, all the boards obviously have to fit together and be properly attached
(integration testing analogy). But even the most well-built structure is only as good as the weakest of its
individual parts, so it’s essential to prove that your boards aren’t rotten and that your nails aren’t rusted
(unit testing analogy). Similarly, a Java program is primarily composed of a series of method calls, so before
we can verify that they work together as expected, we must ensure that each works independently as
expected. That’s what unit testing does.
Unit testing generally works by asserting that, given some known input, you will get a certain expected
output (or perhaps an expected error). For a simple example, imagine that you have a method that takes
an array of numbers and returns their sum. You would likely have a unit test which asserts that, given the
input [3, 5, 4] this method will return 12. This may seem like a trivial example (and to some degree, it is)
but there are still many things that could go wrong. Does the program omit the first or last element (the
‘off-by-one’ bug common to people who aren’t used to zero-based indexing used in Java)? Does it handle
zero or negative numbers properly? How does it handle numbers defined in scientific notation? What if
the method is defined to return an int value, but you pass in multiple elements of Integer.MAX_VALUE,
thus overflowing the value an int can hold? You would likely want to have unit tests to verify that your code
works as expected in all of these cases. Given this simple example, it should be clear that unit testing is
even more helpful for less trivial code.
Why Write Unit Tests?
▪ Unit testing provides verification that your code is functioning correctly
▪ Much faster than testing your entire program each time you modify the code
─ Fastest MapReduce job on a cluster will take many seconds
─ Even in pseudo-distributed mode
─ Even running in LocalJobRunner mode will take several seconds
─ LocalJobRunner mode is discussed later in the course
─ Unit tests help you iterate faster in your code development
Unit testing is unfortunately a controversial topic among some Java developers. Although it seems
counterintuitive that writing more code (the program plus its unit tests) will save you time versus writing
just the program itself, most people with unit testing experience will tell you that it really does work out
that way. Although the development process takes longer – you have to write more code, after all – this
is a one-time cost. The debugging process will take significantly less time when you have unit tests, and
debugging is usually an ongoing cost because bugs aren’t just found before you release your program, they
are also found periodically in production systems.
And once you invest in your code by creating unit tests, you will benefit from them later. They often help
you to detect unintended consequences which arise from a change in one section of code that wind up
breaking another section of code – exactly the type of thing that can be hard to diagnose without unit tests.
Perhaps most valuable of all is the ability to set up a Continuous Integration server like Jenkins, Hudson or
CruiseControl which monitors your source control system and runs all the unit tests every time a developer
commits code. This helps detect errors as soon as they’re introduced, when they are easier (and therefore
less expensive) to fix.
As valuable as unit tests are in regular Java development, they are even more valuable in Hadoop because
they allow you a very quick turnaround time for verifying that your code works as expected. Submitting
even the simplest job to your cluster will take several seconds to complete, while it’s easy to run a unit test
right from your IDE in a fraction of a second.
Chapter Topics
▪ Unit Testing
▪ Conclusion
Why MRUnit?
▪ JUnit is a popular Java unit testing framework
▪ Problem: JUnit cannot be used directly to test Mappers or Reducers
─ Unit tests require mocking up classes in the MapReduce framework
─ A lot of work
▪ MRUnit is built on top of JUnit
─ Works with the mockito framework to provide required mock objects
▪ Allows you to test your code from within an IDE
─ Much easier to debug
JUnit is a general-purpose unit testing framework for Java -- there’s nothing specific to Hadoop about it.
The hooks for testing Hadoop code are provided by MRUnit (pronounced as “M R unit” not “mister unit”).
Mocking refers to the process of creating code to simulate something you’d need to interact with in your
test. If you were trying to test a Mapper, for example, you would need to mock up an InputSplit to handle
input so the Mapper produced the correct results. While it’s certainly possible to do that, it would be extra
work. MRUnit provides mock objects so you don’t need to write them yourself.
JUnit is not the only unit testing framework for Java, but it is certainly the most well-known.
A good introduction to JUnit for instructors who are not yet familiar with it is here (http://
pub.admc.com/howtos/junit4x/).
All major IDEs (Eclipse, NetBeans, IntelliJ IDEA) have support for running JUnit tests from within the IDE.
You can run a single test or multiple tests in one run (you can even run all the tests for an entire project in
one run). When you execute a unit test in one of these IDEs, the IDE will display a green bar if the unit test
passed or a red bar if it failed.
JUnit Basics (1)
▪ @Test
─ Java annotation
─ Indicates that this method is a test which JUnit should execute
▪ @Before
─ Java annotation
─ Tells JUnit to call this method before every @Test method
─ Two @Test methods would result in the @Before method being called
twice
We’re using JUnit 4 in class, although earlier versions should also work.
Annotations are a feature introduced in Java 1.5, so developers whose experience is only with older
versions of Java might not be familiar with them. Support for annotations is a new feature of JUnit 4 (earlier
versions of JUnit identified test methods by their name, which was required to begin with ‘test’).
The @Before annotation is used for per-test setup. In a method identified by @Before, you will typically
do things like initialize variables used by multiple tests to some desired state. Likewise, there is a @After
annotation which is run following each test (so you could do some tear-down procedure), but this is rarely
needed and seldom used. As the slide explains, these methods are run for each test method (of which there
are usually several in a single Java source file for the unit test). There are also annotations (@BeforeClass
and @AfterClass) which will run once for the entire class (rather than once for each test method in that
class), but it’s somewhat rare to need to do that (and usually a sign that your tests are not independent
enough from one another).
JUnit Basics (2)
▪ JUnit test methods:
─ assertEquals(), assertNotNull() etc
─ Fail if the conditions of the statement are not met
─ fail(msg)
─ Explicitly fails the test with the given error message
▪ With a JUnit test open in Eclipse, run all tests in the class by going to
Run → Run
▪ Eclipse also provides functionality to run all JUnit tests in your project
▪ Other IDEs have similar functionality
JUnit defines a number of assertions. If you want to ensure that two values are equal to one another, use
the “assertEquals” method. For example, if you created a Calculator class that has a method to add two
numbers together, you might define a test like this:
@Test
public void verifyAddingTwoPositiveNumbers() {
assertEquals(10, myCalculator.add(3, 7));
}
It is customary that the expected value is the first of these two arguments. There are two other things to
be aware of when testing equality in Java. The first is that floating point values (float or double primitives)
cannot be precisely represented so you have to call a three-argument version of assertEquals to compare
them (the third argument is a tolerance value; if the difference between the first two arguments exceeds
this tolerance, the test will fail). The other problem relates to how Java deals with object equality. This
is covered in detail in introductory Java texts, but the synopsis is that Java makes a distinction between
object equality (the Object.equals() method, verified in JUnit with assertEquals()) and object identity (the ==
operator, verified in JUnit with the assertSame() method). This may be a stumbling block for programmers
whose experience is in C++ instead of Java.
JUnit: Example Code
import static org.junit.Assert.assertEquals;
import org.junit.Before;
import org.junit.Test;
public class JUnitHelloWorld {

protected String s;
@Before
public void setup() {
s = "HELLO WORLD";
}
@Test
public void testHelloWorldSuccess() {
s = s.toLowerCase();
assertEquals("hello world", s);
}
// following will fail even if testHelloWorldSuccess is run first
@Test
public void testHelloWorldFail() {
assertEquals("hello world", s);
}
}
Note that the first import line contains the ‘static’ keyword. This is a new feature added in Java 1.5 which
allows us to import static methods from a class so we can use them as if they were defined locally (this is
why we can call assertEquals in the tests instead of having to call Assert.assertEquals).
Point out that this class contains one setup method (identified by the @Before annotation) and two tests
(identified by the @Test annotation). Also, this is just basic JUnit stuff – we haven’t introduced MRUnit or
anything specific to Hadoop yet.
Unlike in JUnit 3.x and earlier versions, our test case does not have to implement or extend anything from
JUnit – it’s just a regular Java class.
testHelloWorldFail will fail because the @Before method is called just before the test – it is called before
each test (and thus resets the state of the string which was modified in the earlier test).
Chapter Topics
▪ Unit Testing
▪ Conclusion
Using MRUnit to Test MapReduce Code
▪ MRUnit builds on top of JUnit
▪ Provides a mock InputSplit and other classes
▪ Can test just the Mapper, just the Reducer, or the full MapReduce flow
MRUnit was developed by Cloudera, but donated to the Apache project. You can find the MRUnit project at
the following URL: http://mrunit.apache.org.
There are some things (like a Partitioner) for which there is not yet support for testing in MRUnit. In
these cases, you will just use regular JUnit methods for testing them (i.e. instantiate a Partitioner, call its
getPartition() method by passing in a key, value and a partitioner count, then use assertEquals to ensure
that you get the expected value back). You should not cover this except in response to a student question.
MRUnit: Example Code – Mapper Unit Test (1)
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.Test;
public class TestWordCount {

MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
@Before
public void setUp() {
WordMapper mapper = new WordMapper();
mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper); }
@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest(); }
}
This slide shows a complete unit test for the Mapper from the WordCount example. The next several slides
explain it one part at a time.
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.Test; 1

@Before
mapDriver.setMapper(mapper);
}
@Test
...
1 Import the relevant JUnit classes and the MRUnit MapDriver class as we will
be writing a unit test for our Mapper. We will omit the import statements in
future slides for brevity.
MapDriver<LongWritable, Text, Text, IntWritable> mapDriver; 1
@Before
}
@Test
mapDriver.runTest();
}
}
1 MapDriver is an MRUnit class (not a user-defined driver).
Here you are declaring a variable mapDriver which will be instantiated in the setUp method (next slide)
and then used in the testMapper method. Since there is only one test method in this class, you could
have done all of this inside the testMapper method, but doing it as shown here makes it easy to add new
tests to the class. You might, for example, have one test that verifies that it parses expected input correctly
and another test that verifies it throws an exception when passed invalid input (like a null value or empty
string).
Just as when you wrote the Mapper itself, you use generics to define the input key type (LongWritable,
in this case), the input value type (Text), the output key type (Text) and the output value type
(IntWritable).
@Before
mapDriver.setMapper(mapper); 1
}
@Test
mapDriver.runTest();
}
}
1 Set up the test. This method will be called before every test, just as with
JUnit.
The first line is creating an instance of your Mapper class (which you wrote separately).
The second line is instantiating the MapDriver that you declared above (i.e. described on the previous
slide). The MapDriver is a MRUnit class that lets us test a Mapper.
The third line tells the MapDriver which Mapper we want to test.
Again, this is all called before each test is run. If we were doing multiple tests in a single class (which is
common), then this ensures that one test isn’t able to affect the outcome of another test because we’re
resetting the state here.
@Before
}
@Test
mapDriver.runTest(); 1
}
}
1 The test itself. Note that the order in which the output is specified is
important – it must match the order in which the output will be created by the
Mapper.
This test can be summarized by saying that, given a line of text “cat dog” the Mapper should emit output
two key/value pairs. The first will have the key “cat” with the value 1, while the second will have the key
“dog” with the value 1. The runTest method executes the Mapper given this input and verifies we get
back the result we expect.
MRUnit Drivers (1)
▪ MRUnit has a MapDriver, a ReduceDriver, and a MapReduceDriver
▪ Methods to specify test input and output:
─ withInput
─ Specifies input to the Mapper/Reducer
─ Builder method that can be chained
─ withOutput
─ Specifies expected output from the Mapper/Reducer
─ Builder method that can be chained
─ addOutput
─ Similar to withOutput but returns void
You can use the withInput method (potentially many times in succession) to specify input to your mapper.
The “builder” method means that this method returns a new object, so you can chain them together.
The withOutput methods work similarly, except that you are specifying what you expect the Mapper to
generate as output. The order in which output is defined matters – a test will fail if the order of the values
emitted by the mapper does not match the order in which you specified them using calls to the withOutput
(or addOutput) method.
The addOutput method does the same think as withOutput, but does not return anything, therefore cannot
be using for chaining. Therefore, this statement:
mapDriver.withOutput(new LongWritable(1), new Text(“cat dog”)).withOutput(new LongWritable(1), new
Text(“foo bar”));
does the same thing as these two statements:
mapDriver.addOutput(new LongWritable(1), new Text(“cat dog”));
mapDriver.addOutput(new LongWritable(1), new Text(“foo bar”));
MRUnit Drivers (2)
▪ Methods to run tests:
─ runTest
─ Runs the test and verifies the output
─ run
─ Runs the test and returns the result set
─ Ignores previous withOutput and addOutput calls
▪ Drivers take a single (key, value) pair as input
▪ Can take multiple (key, value) pairs as expected output
▪ If you are calling driver.runTest() or driver.run() multiple times,
call driver.resetOutput() between each call
─ MRUnit will fail if you do not do this
Both runTest and run execute the test; the difference is that runTest also verifies results while the run
method leaves verification up to you. You’ll generally call runTest when verifying a Mapper or a Reducer,
but the run method is better when the Mapper isn’t meant to generate output (such a unit test for the
Map-only counter lab.)
The fourth point is talking about the case in which you call driver.runTest multiple times in the same test
(in other words, without the setUp method being called again to reset the state). In this case, you need to
reset the state manually using driver.resetOutput() before you call driver.runTest() again or the test will fail.
However, a better strategy is usually to split a complex test like this into multiple tests which each test one
specific part of what the original test tried to do.
This slide is oriented towards testing Mappers, which receive a single key/value pair as input and emit zero
or more key/values as output. Reducers, on the other hand, take as input a single key and a corresponding
collection of all values for that key. As with the MapDriver, there are multiple equivalent ways to specify
input for a reducer in a test, but the easiest to understand is this:
// we are simulating input with key: ‘foo’ and value [1, 1, 1]
List<IntWritable> values = new ArrayList<IntWritable>();
values.add(new IntWritable(1));
reducerDriver.withInput(new Text(“foo”), values);
To test for no output:, you can omit the withOutput call:
mapDriver.withInput(new LongWritable(1), new Text("")); mapDriver.runTest();
MRUnit Conclusions
▪ You should write unit tests for your code!
▪ We recommend writing unit tests in Hands-On Exercises in the rest of the
course
─ This will help greatly in debugging your code
Chapter Topics
▪ Unit Testing
▪ Conclusion
Running Unit Tests From Eclipse
Compiling and Running Unit Tests From the Command Line
$ javac -classpath `hadoop classpath`:\

/home/training/lib/mrunit-0.9.0-incubating-hadoop2.jar:. *.java
$ java -cp `hadoop classpath`:/home/training/lib/\

mrunit-0.9.0-incubating-hadoop2.jar:. \
org.junit.runner.JUnitCore TestWordCount
JUnit version 4.8.2

...
Time: 0.51
OK (3 tests)
The command on the slide runs the TestWordCount application from the command line.
If you want to demo this, you must compile all three Java programs in the mrunit/sample_solution
directory. Use the following CLASSPATH:
`hadoop classpath`:/home/training/lib/mrunit-0.9.0-incubating-
hadoop2.jar:.
• `hadoop classpath` brings in the CLASSPATH required for all Hadoop compiles
• /home/training/lib/mrunit-0.9.0-incubating-hadoop2.jar has the classes
needed for MRUnit
• . is needed for TestWordCount.java to pick up the Mapper and Reducer from the current
working directory
Once the classes have been compiled, you can run the command in the slide.
Chapter Topics
▪ Unit Testing
▪ Conclusion
Hands-On Exercise: Writing Unit Tests With the MRUnit
Framework
▪ In this Hands-On Exercise, you will practice creating unit tests
Chapter Topics
▪ Unit Testing
▪ Conclusion
Key Points
▪ Unit testing is important
▪ MRUnit is a framework for MapReduce programs
─ Built on JUnit
▪ You can write tests for Mappers and Reducers individually, and for both
together
▪ Run tests from the command line, Eclipse, or other IDE
▪ Best practice: always write unit tests!
Delving Deeper into the
Hadoop API
Chapter 9
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
Delving Deeper into the Hadoop API
▪ How to use the ToolRunner class
▪ How to decrease the amount of intermediate data with Combiners
▪ How to set up and tear down Mappers and Reducers using the setup and
cleanup methods
▪ How to access HDFS programmatically
▪ How to use the distributed cache
▪ How to use the Hadoop API’s library of Mappers, Reducers, and Partitioners
Chapter Topics
▪ Using the ToolRunner Class

▪ Setting Up and Tearing Down Mappers and Reducers
▪ Hands-On Exercise: Using ToolRunner and Passing Parameters
▪ Decreasing the Amount of Intermediate Data with Combiners
▪ Optional Hands-On Exercise: Using a Combiner
▪ Accessing HDFS Programmatically
▪ Using the Distributed Cache
▪ Using the Hadoop API’s Library of Mappers, Reducers and Partitioners
▪ Conclusion
Why Use ToolRunner?
▪ You can use ToolRunner in MapReduce driver classes
─ This is not required, but is a best practice
▪ ToolRunner uses the GenericOptionsParser class internally
─ Allows you to specify configuration options on the command line
─ Also allows you to specify items for the Distributed Cache on the command
line (see later)
Why would you want to be able to specify these options on the command line instead of putting the
equivalent code in your Driver class? Because it’s faster and more flexible, since you won’t have to check
out your code from source control, modify it, compile it and then build a new JAR. This is especially helpful
when you want to run the same job each time with slight variations (to test different optimizations, for
example). This also allows you to, for example, submit jobs to different clusters by specifying a different
NameNode and JobTracker.
How to Implement ToolRunner: Complete Driver (1)
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCount extends Configured implements Tool {

int exitCode =
ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(exitCode);
}
public int run(String[] args) throws Exception {

System.out.printf(
Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1; }
file continued on next slide
How to Implement ToolRunner: Complete Driver (2)
Job job = new Job(getConf());


job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

return success ? 0 : 1;
}
}
How to Implement ToolRunner: Imports
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner; 1

int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args);
}
...
1 Import the relevant classes. We omit the import statements in future slides
for brevity.
How to Implement ToolRunner: Driver Class Definition
public class WordCount extends Configured implements Tool { 1

}
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1;}
job.setJarByClass(WordCount.class); job.setJobName("Word Count");

...
1 The driver class implements the Tool interface and extends the Configured
class.
How to Implement ToolRunner: Main Method
public static void main(String[] args) throws Exception { 1
int exitCode = ToolRunner.run(new Configuration(),

new WordCount(), args);
}

System.out.printf(
return -1;
}
...
1 The driver main method calls ToolRunner.run.
How to Implement ToolRunner: Run Method

}
public int run(String[] args) throws Exception { 1
System.out.printf(
return -1;
}

...
1 The driver run method creates, configures, and submits the job.
ToolRunner Command Line Options
▪ ToolRunner allows the user to specify configuration options on the command
line
▪ Commonly used to specify Hadoop properties using the -D flag
─ Will override any default or site properties in the configuration
─ But will not override those set in the driver code
$ hadoop jar myjar.jar MyDriver \

-D mapred.reduce.tasks=10 myinputdir myoutputdir
▪ Note that -D options must appear before any additional program arguments
▪ Can specify an XML configuration file with -conf
▪ Can specify the default filesystem with -fs uri
─ Shortcut for –D fs.default.name=uri
According to TDG 2e page 137, the space between the -D and the property name is required when
specifying a Hadoop property. TDG 3e page 153 states that this is no longer the case with recent versions
of CDH or Apache Hadoop (see HADOOP-7325 for details). However, the position in the command line at
which you specify the -D is significant. Consider these two examples:
# This one sets a Hadoop property named ‘zipcode’ because the -D option follows the Hadoop command.
# The ‘args’ array inside the program will have two elements [‘foo’, ‘bar’]
$ hadoop MyProgram –D zipcode=90210 foo bar
# This one does not set a Hadoop property named ‘zipcode’, because the -D option follows the program
arguments,
# so this information is interpreted as additional program arguments.
# The ‘args’ array inside the program will have four elements [‘foo’, ‘bar’, ‘-D’, ‘zipcode=90210’]
$ hadoop MyProgram foo bar –D zipcode=90210
DEPRECATED CONFIGURATION OPTIONS: CDH4 uses MR1…therefore it uses old/deprecated configuration
names, contrary to what prior versions of this class said. This applies only to MapReduce configuration
settings (e.g. mapred.reduce.tasks works, mapreduce.job.reduces does not). HDFS configuration settings
work either way (e.g. dfs.block.size and dfs.blocksize both work)
For a list of the properties deprecated in CDH 4, refer to http://hadoop.apache.org/docs/
current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html or to
Hadoop Operations by Eric Sammer, Appendix A.
Chapter Topics

▪ Conclusion
The setup Method
▪ It is common to want your Mapper or Reducer to execute some code before
the map or reduce method is called for the first time
─ Initialize data structures
─ Read data from an external file
─ Set parameters
▪ The setup method is run before the map or reduce method is called for the
first time
public void setup(Context context)
The setup method is a lifecycle method available in both the Mapper or Reducer that lets you run
arbitrary code after your Mapper or Reducer has been created but before it processes the first record. If
your class has people with Java-based Web development experience, you might mention that this is similar
in concept to the Servlet’s init method.
Later in class, we will discuss how to join data sets together at which point the need for the setup method
will be more clear.
The cleanup Method
▪ Similarly, you may wish to perform some action(s) after all the records have
been processed by your Mapper or Reducer
▪ The cleanup method is called before the Mapper or Reducer terminates
public void cleanup(Context context) throws

IOException, InterruptedException
The cleanup method is the counterpart to setup. It’s called at the end of your Mapper’s or Reducer’s
lifecycle, after it has processed the last record. You can use it for closing any resources you might have
opened in your setup method.
In case it’s not clear at this point in the class, you should emphasize that “lifecycle of a Mapper” means the
lifecycle of a single Mapper, not of the entire Map phase collectively. Since a Mapper processes a single
InputSplit (and since an InputSplit generally corresponds to one block in HDFS), the Task Tracker
spawns a Mapper to process ~ 64 MB of data (assuming the default HDFS block size). The setup method
will be called before the first record is processed. The map method in that Mapper is called once for each
record in that 64 MB split. Once all records in that split have been processed, the cleanup method is
called and this specific Mapper instance exits.
Passing Parameters
public class MyDriverClass {
public int main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.setInt ("paramname",value);
Job job = new Job(conf);
...
return success ? 0 : 1;
}
}
public class MyMapper extends Mapper {
public void setup(Context context) {

Configuration conf = context.getConfiguration();
int myParam = conf.getInt("paramname", 0);
...
}
public void map...
}
You can use a Hadoop Configuration object to pass values from the driver to the mapper or reducer.
In this example, the configuration is instantiated in the driver, then passed as a parameter when the Job
is instantiated. Then, in the Mapper’s setup method, the Configuration is accessed through the
Context, and the parameter is retrieved by using the Configuration.getInt method.
Setting and retrieving parameters through the Configuration object is similar in concept to the Java
Preferences API. You should use this for setting small (“lightweight”) values like numbers and relatively
short strings. The corresponding “get” method has a default value (0 in this example) which is used if the
parameter was never set (once again, just as the Java Preferences API does).
There are methods for setting various types of values, including int, long, float, boolean, enums and Strings.
Why might you want to pass a parameter? Because the same Mapper or Reducer may be used in multiple
situations. A parameter can allow the driver code to determine how the mapper works. Example: A
DateFilter mapper that filters out log data that occurred before a certain date. The date would be specified
as a parameter.
Chapter Topics

▪ Conclusion
Hands-On Exercise: Using ToolRunner and Passing
Parameters
▪ In this Hands-On Exercise, you will practice
─ Using ToolRunner to implement a driver
─ Passing a configuration parameter to a Mapper
Chapter Topics

▪ Conclusion
The Combiner
▪ Often, Mappers produce large amounts of intermediate data
─ That data must be passed to the Reducers
─ This can result in a lot of network traffic
▪ It is often possible to specify a Combiner
─ Like a ‘mini-Reducer’
─ Runs locally on a single Mapper’s output
─ Output from the Combiner is sent to the Reducers
▪ Combiner and Reducer code are often identical
─ Technically, this is possible if the operation performed is commutative and
associative
─ Input and output data types for the Combiner/Reducer must be identical
Data locality is for Mappers only – it doesn’t apply to Reducers. Since every Mapper can potentially
generate records with any key, every Reducer must generally pull data from every Mapper.
Combiners reduce not only network traffic, but also disk I/O.
Commutative means that “order doesn’t matter” (for example, 3 + 6 yields the same result as 6 + 3 because
addition is commutative. Conversely, division is not commutative, so 6 / 3 yields a different result than 3 /
6).
Associative means that “grouping doesn’t matter” (for example, (3 + 6 + 8) + (5 + 4) yields the same result
as (3 + 6) + (8 + 5 + 4) because addition is associative.
You can re-use your Reducer as a Combiner for WordCount because addition is both commutative and
associative. A good example of where you cannot do this is when your Reducer is calculating an average,
because:
average(3, 6, 8, 5, 4) = 5.2
While
average( average(3, 6, 8), average (5, 4) ) = average (5.67, 4.5) = 5.085
The Combiner
▪ Combiners run as
part of the Map
phase
▪ Output from the
Combiners is passed
to the Reducers
WordCount Revisited
Recap: the Mapper is parsing a line of text into individual words and emitting each word along with a literal
value 1. The data from each mapper is partitioned, shuffled and sorted and passed to reducers.
The Reducer is iterating over each value and summing them all up.
On the next slide, we’ll recap what the data looks like as it flows through the Mapper and the Reducer.
After that, we’ll see how a Combiner can help.
NOTE: Combiners are discussed in TDG 3e on pages 33-36 (TDG 2e, 30-32).
WordCount With Combiner
You should emphasize that the Combiner runs locally on the Mapper, therefore reducing the amount of
data the Reducer needs to transfer (because it’s been “collapsed” to a more concise representation of, for
example, 4 instead of [1, 1, 1, 1]).
Data locality is for Mappers only – it doesn’t apply to Reducers. Since every Mapper can potentially
generate records with any key, every Reducer must generally pull data from every Mapper.
Combiners reduce not only network traffic, but also disk I/O.
Writing a Combiner
▪ The Combiner uses the same signature as the Reducer
─ Takes in a key and a list of values
─ Outputs zero or more (key, value) pairs
─ The actual method called is the reduce method in the class
reduce(inter_key, [v1, v2, …]) →

(result_key, result_value)
There is no Combiner interface. A Combiner simply uses the Reducer interface (and thus, does its work in
the reduce method defined in that interface).
Combiners and Reducers
▪ Some Reducers may be used as Combiners
─ If operation is associative and commutative, e.g., SumReducer
─ Some Reducers cannot be used as a Combiner, e.g., AverageReducer
A reducer may be used as a combiner if the operation being performed is associative and commutative.
Commutative means that “order doesn’t matter” (for example, 3 + 6 yields the same result as 6 + 3 because
addition is commutative. Conversely, division is not commutative, so 6 / 3 yields a different result than 3 /
6).
Associative means that “grouping doesn’t matter” (for example, (3 + 6 + 8) + (5 + 4) yields the same result
as (3 + 6) + (8 + 5 + 4) because addition is associative.
You can re-use your Reducer as a Combiner for WordCount because addition is both commutative and
associative. A good example of where you cannot do this is when your Reducer is calculating an average,
because:
average(3, 6, 8, 5, 4) = 5.2
While
average( average(3, 6, 8), average (5, 4) ) = average (5.67, 4.5) = 5.085
Specifying a Combiner
▪ Specify the Combiner class to be used in your MapReduce code in the driver
─ Use the setCombinerClass method, e.g.:
job.setCombinerClass(SumReducer.class);
▪ Input and output data types for the Combiner and the Reducer for a job must
be identical
▪ VERY IMPORTANT: The Combiner may run once, or more than once, on the
output from any given Mapper
─ Do not put code in the Combiner which could influence your results if it runs
more than once
There is no Combiner interface. A Combiner simply uses the Reducer interface (and thus, does its work in
the reduce method defined in that interface).
In case it’s not clear, this slide is saying “not only can your Combiner use the same logic as your Reducer,
your Combiner can literally use the same class as your Reducer. In other words (provided the calculation in
your Reducer is both associative and commutative), your Driver might look like this:
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(SumReducer.class);
conf.setCombinerClass(SumReducer.class); // same exact class used for both Combiner and Reducer here
Chapter Topics

▪ Conclusion
Optional Hands-On Exercise: Using a Combiner
▪ In this Hands-On Exercise, you will practice using a Combiner
Chapter Topics

▪ Conclusion
Accessing HDFS Programmatically
▪ In addition to using the command-line shell, you can access HDFS
programmatically
─ Useful if your code needs to read or write ‘side data’ in addition to the
standard MapReduce inputs and outputs
─ Or for programs outside of Hadoop which need to read the results of
MapReduce jobs
▪ Beware: HDFS is not a general-purpose filesystem!
─ Files cannot be modified once they have been written, for example
▪ Hadoop provides the FileSystem abstract base class
─ Provides an API to generic file systems
─ Could be HDFS
─ Could be your local file system
─ Could even be, for example, Amazon S3
The need to access HDFS directly is somewhat rare in practice, but might be useful if you need to integrate
with some legacy system. Generally, using the “hadoop fs” command, or FuseDFS, or Hoop, or Sqoop, or
Flume is a better approach. Still, it’s helpful to know this low-level access is possible and to have an idea of
how it works.
A table describing available filesystems in Hadoop is on pages 52-53 of TDG 3e (TDG 2e, 48).
The FileSystem API (1)
▪ In order to use the FileSystem API, retrieve an instance of it

FileSystem fs = FileSystem.get(conf);
▪ The conf object has read in the Hadoop configuration files, and therefore
knows the address of the NameNode
▪ A file in HDFS is represented by a Path object
Path p = new Path("/path/to/my/file");
Actually, the Path object can represent either a file or a directory in HDFS (the FileStatus object
corresponding to that path has an isDir() method that can be used to tell them apart, if needed).
Like UNIX, files are not required to have a file extension (hence the file is this example is just ‘file’ rather
than ‘file.txt’), although they can have a file extension if desired. HDFS paths use UNIX-style addressing
conventions (i.e. forward slashes instead of backslashes).
The FileSystem API (2)
▪ Some useful API methods:
─ FSDataOutputStream create(...)
─ Extends java.io.DataOutputStream
─ Provides methods for writing primitives, raw bytes etc
─ FSDataInputStream open(...)
─ Extends java.io.DataInputStream
─ Provides methods for reading primitives, raw bytes etc
─ boolean delete(...)
─ boolean mkdirs(...)
─ void copyFromLocalFile(...)
─ void copyToLocalFile(...)
─ FileStatus[] listStatus(...)
The use of OutputStreams to writing to files and InputStreams for reading from files should be familiar
to any Java programmer. As with java.io.File, the delete and mkdirs methods return a boolean to denote
success. Unlike with Java, the delete method accepts a boolean argument to denote whether the delete
should be recursive (i.e. delete all files and subdirectories of the directory referenced in the delete
operation). Java programmers will find this a welcome surprise, because implementing recursive deletion
for directories in Java is tedious.
The listStatus method returns an array of FileStatus objects. If called with a path that represents a file, the
array will have one element (i.e. for just that file). If called with a path that represents a directory, there will
be one FileStatus object in the array for each item in that directory. The FileStatus object gives access to
metadata such as user/group ownership, permissions, file size, replication factor and timestamps.
Coverage of the HDFS API can be found in TDG 3e on pages 55-67 (TDG 2e, 51-62).
The FileSystem API: Directory Listing
▪ Get a directory listing:
Path p = new Path("/my/path");

FileStatus[] fileStats = fs.listStatus(p);
for (int i = 0; i < fileStats.length; i++) {

Path f = fileStats[i].getPath();
// do something interesting
}
The FileSystem API: Writing Data
▪ Write data to a file

Path p = new Path("/my/path/foo");
FSDataOutputStream out = fs.create(p, false);
// write some raw bytes

out.write(getBytes());
// write an int
out.writeInt(getInt());
...
out.close();
The boolean argument to fs.create is specifying whether or not to overwrite an existing file.
Chapter Topics

▪ Conclusion
The Distributed Cache: Motivation
▪ A common requirement is for a Mapper or Reducer to need access to some
‘side data’
─ Lookup tables
─ Dictionaries
─ Standard configuration values
▪ One option: read directly from HDFS in the setup method
─ Using the API seen in the previous section
─ Works, but is not scalable
▪ The Distributed Cache provides an API to push data to all slave nodes
─ Transfer happens behind the scenes before any task is executed
─ Data is only transferred once to each node, rather
─ Note: Distributed Cache is read-only
─ Files in the Distributed Cache are automatically deleted from slave nodes
when the job finishes
Using the DistributedCache or using Configuration parameters (like conf.setInt(“param”, 5) as discussed

earlier) are better approaches than using the HDFS API for reading side data.
Reading data from HDFS in every Mapper is not scalable because the data will be replicated to three
machines (by default) while your Mapper might be running on hundreds of nodes. Thus, it will require many
network transfers to read this data (i.e. not just once per machine, but once per InputSplit). Conversely,
using the DistributedCache will cause the TaskTracker to copy the data locally before it starts the Map task
so this transfer happens just once per machine.
NOTE: DistributedCache can also be used for Reducers, but the scalability limitations of reading side data
are more apparent (and thus more easily described) with Mappers.
This topic is discussed in greater detail on pages 289-295 of TDG 3e (TDG 2e, 253-257).
Using the Distributed Cache: The Difficult Way
▪ Place the files into HDFS
▪ Configure the Distributed Cache in your driver code

DistributedCache.addCacheFile(new URI("/myapp/lookup.dat"),conf);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"),conf);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz",conf));
─ .jar files added with addFileToClassPath will be added to your

Mapper or Reducer’s classpath
─ Files added with addCacheArchive will automatically be
dearchived/decompressed
The main point here is that Distributed Cache can handle plain files (e.g. the .dat file) as well as archive
files in various formats, including ZIP, JAR (“Java Archive”), UNIX “tar” (“tape archive”) and UNIX “tar gzip”
files (tar files which have been compressed using the gzip command). The block in blue illustrates the code
you’d call (i.e. in your Driver) to copy files of various types into the distributed cache. Note that the method
name varies depending on what type of file you’re copying and how it should be handled (plain files use
‘addCacheFile’, JAR files use ‘addFileToClassPath’ and other archive files use ‘addCacheArchive’).
Note that the calls here differ from the examples in TDG 3e and from the Javadoc for the Job class. If you
are using new API with MR 1 (which we are in this version of this course), you cannot call methods like
addCacheFile on a Job object. This is the single example in this entire course where new API is not
supported and we are forced to show a deprecated API.
NOTE: JAR files are a special case – they generally represent code rather than data, so they get added to
the classpath for your job. If your Mapper or Reducer is dependent on some external JAR library at runtime
(such as a database driver or numerical analysis package), you can use this to make it available when
your job runs. However, the -libjars command-line option described on the next screen is a much easier
equivalent to hardcoding this in your Driver code.
Using the DistributedCache: The Easy Way
▪ If you are using ToolRunner, you can add files to the Distributed Cache directly
from the command line when you run the job
─ No need to copy the files to HDFS first
▪ Use the -files option to add files
hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...
▪ The -archives flag adds archived files, and automatically unarchives them
on the destination machines
▪ The -libjars flag adds jar files to the classpath
This is yet another good reason to use ToolRunner…
Accessing Files in the Distributed Cache
▪ Files added to the Distributed Cache are made available in your task’s local
working directory
─ Access them from your Mapper or Reducer the way you would read any
ordinary local file
File f = new File("file_name_here");
NOTE: This file is a normal Java file (java.io.File), not anything Hadoop-specific. You’d access it just as you
would any other file in normal Java code.
Chapter Topics

▪ Conclusion
Reusable Classes for the New API
▪ The org.apache.hadoop.mapreduce.lib.*/* packages contain a
library of Mappers, Reducers, and Partitioners supporting the new API
▪ Example classes:
─ InverseMapper – Swaps keys and values
─ RegexMapper – Extracts text based on a regular expression
─ IntSumReducer, LongSumReducer – Add up all values for a key
─ TotalOrderPartitioner – Reads a previously-created partition file
and partitions based on the data from that file
─ Sample the data first to create the partition file
─ Allows you to partition your data into n partitions without hard-coding
the partitioning information
▪ Refer to the Javadoc for classes available in your version of CDH
─ Available classes vary greatly from version to version
Chapter Topics

▪ Conclusion
Key Points
▪ Use the ToolRunner class to build drivers
─ Parses job options and configuration variables automatically
▪ Override Mapper and Reducer setup and cleanup methods
─ Set up and tear down, e.g. reading configuration parameters
▪ Combiners are ‘mini-reducers’
─ Run locally on Mapper output to reduce data sent to Reducers
▪ The FileSystem API lets you read and write HDFS files programmatically
▪ The Distributed Cache lets you copy local files to worker nodes
─ Mappers and Reducers can access directly as regular files
▪ Hadoop includes a library of predefined Mappers, Reducers, and Partitioners
Practical Development Tips and
Techniques
Chapter 10
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
Practical Development Tips and Techniques
▪ Strategies for debugging MapReduce code
▪ How to test MapReduce code locally using LocalJobRunner
▪ How to write and view log files
▪ How to retrieve job information with counters
▪ Why reusing objects is a best practice
▪ How to create Map-only MapReduce jobs
Chapter Topics
Techniques
▪ Strategies for Debugging MapReduce Code

▪ Testing MapReduce Code Locally Using LocalJobRunner
▪ Hands-On Exercise: Testing with LocalJobRunner
▪ Writing and Viewing Log Files
▪ Optional Hands-On Exercise: Logging
▪ Retrieving Job Information with Counters
▪ Reusing Objects
▪ Creating Map-only MapReduce jobs
▪ Hands-On Exercise: Using Counters and a Map-Only Job
▪ Conclusion
Introduction to Debugging
▪ Debugging MapReduce code is difficult!
─ Each instance of a Mapper runs as a separate task
─ Often on a different machine
─ Difficult to attach a debugger to the process
─ Difficult to catch ‘edge cases’
▪ Very large volumes of data mean that unexpected input is likely to appear
─ Code which expects all data to be well-formed is likely to fail
Debugging in distributed computing is generally difficult, but it’s even harder with Hadoop. You cannot
generally predict which machine in the cluster is going to process a given piece of data, so you cannot
attach a debugger to it. Even if you examine the logs to determine which machine processed it earlier, there
is no guarantee that machine will process it when the job runs again. You may not have access needed to
debug on the cluster anyway (i.e. a firewall may prevent you from connecting on the debugger port from
your workstation).
Data is often corrupted (e.g. because of bad disks or because binary data is transferred via FTP in ASCII
mode). It’s also often sloppy – not all data fits neatly into predefined format and sometimes people make
mistakes entering. Your code needs to be flexible enough to handle bad data (for example, by identifying
non-conforming data and skipping it).
Common-Sense Debugging Tips
▪ Code defensively
─ Ensure that input data is in the expected format
─ Expect things to go wrong
─ Catch exceptions
▪ Start small, build incrementally
▪ Make as much of your code as possible Hadoop-agnostic
─ Makes it easier to test
▪ Write unit tests
▪ Test locally whenever possible
─ With small amounts of data
▪ Then test in pseudo-distributed mode
▪ Finally, test on the cluster
Much of this is good advice for programming in general, but even more relevant when dealing with
Hadoop.
The point about starting small and building incrementally is especially important with processing data. You
should not write a MapReduce job and test it first on 50GB of data – if you do, you may find (hours or days
later, once it has finished) that there’s a small bug. It’s better to test on a small subset of that data (perhaps
50 MB) first, then try it at scale when you are sure it works as expected.
The point about making your code Hadoop-agnostic is that very little of your code should be tied to
Hadoop, thus you can test it easily without Hadoop dependencies. For example, if your mapper is going to
parse individual IP addresses from lines in Web server log files, you could write a utility method that does
this (e.g. with a method signature like “public String parseAddress(String logLine)”), as you can easily test
this with JUnit. Your mapper would simply take the value it was passed (i.e. the line from the log file) and
invoke your utility method to parse the IP address, thereby separating the actual parsing technique from
your mapper and making it easier to test.
One reason to use local job runner mode in development is that it gives you faster turnaround cycle (you
can test things more quickly that way, since you are dealing with local data and it’s easy to run jobs directly
from the IDE). However, this won’t catch certain types of errors (such as setting static global values, which
doesn’t transcend JVMs, as described earlier). Therefore, before submitting the job to the real cluster, you
should submit it in pseudo-distributed mode. Running it both ways will help you to isolate a problem with
your code from a problem with your cluster.
Testing Strategies
▪ When testing in pseudo-distributed mode, ensure that you are testing with a
similar environment to that on the real cluster
─ Same amount of RAM allocated to the task JVMs
─ Same version of Hadoop
─ Same version of Java
─ Same versions of third-party libraries
Consistency is key – set up your environment to match the real cluster in every way possible. This means,
for example, not just that both are running “Oracle JDK 6” but even down to details like “Both are running
64-bit Oracle JDK 1.6.0_27-b07”
Using virtual machines is a good way to maintain a close simulation of OS, Hadoop settings, Java versions,
etc. as on your cluster. These VMs can easily be copied and shared with other members of your team.
Although running production clusters in virtual machines is not recommended, it’s fine for pseudo-
distributed environments.
As an aside, Hadoop is extremely taxing on JVMs (way more than most Java programs are), so some
versions of Java are known to work well with Hadoop and some are known to work poorly with Hadoop.
Information on which Java versions to pick based on the experience of others can be found on the Hadoop
Wiki (http://wiki.apache.org/hadoop/HadoopJavaVersions).
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
Testing Locally (1)
▪ Hadoop can run MapReduce in a single, local process
─ Does not require any Hadoop daemons to be running
─ Uses the local filesystem instead of HDFS
─ Known as LocalJobRunner mode
▪ This is a very useful way of quickly testing incremental changes to code
Recap: There are three modes of operation: local job runner mode (1 java process on one machine),
pseudo-distributed (many java processes on one machine), and fully-distributed (many java processes
across many machines).
If you want, you can show students how to determine a client’s operation mode. First, show the hadoop
classpath command, and explain that this classpath is set when the hadoop command runs. The
classpath is built dynamically by the /usr/lib/hadoop/libexec/hadoop-config.sh script,
which is invoked when the hadoop command runs.
Notice that the first component of the classpath is the Hadoop configuration directory. The default
mapreduce.jobtracker.address and fs.defaultFS (or mapred.job.tracker and
fs.default.name) configuration fields are stored there, in the mapred-site.xml and core-
site.xml files, respectively. The values of these two fields determines the operation mode. If they
reference the local machine and port numbers (defaults are 8020 and 8021), Hadoop is configured for
pseudo-distributed mode. If they reference a remote machine, Hadoop is configured to run on a cluster.
If they are not specified, or if they reference the local host and file system, Hadoop is configured for
LocalJobRunner mode.
Testing Locally (2)
▪ To run in LocalJobRunner mode, add the following lines to the driver code:

conf.set("mapred.job.tracker", "local");
conf.set("fs.default.name", "file:///");
─ Or set these options on the command line if your driver uses ToolRunner
-fs is equivalent to -D fs.default.name
-jt is equivalent to -D maprep.job.tracker
─ e.g.
$ hadoop jar myjar.jar MyDriver -fs=file:/// -jt=local \

indir outdir
By setting the configuration values as shown in the sample code on the slide, you can override the default
settings for these configuration values. Because students’ VMs are configured for pseudo-distributed mode,
these overrides are necessary if students want to run the hadoop jar command in LocalJobRunner
mode. (More on how this works with the Eclipse deployment on students’ system on the next slide.
Note that LocalJobRunner mode is the default mode in which Hadoop clients run Hadoop programs. If the
mapreduce.jobtracker.address and fs.defaultFS configuration values were not set on the
VMs, the hadoop jar command would run programs in LocalJobRunner mode.
Note that you can use the -jt and -fs flags on the command line to set the file system and job tracker
properties (i.e. instead of using -D as described here).
A beginner mistake that wouldn’t be caught in local job runner mode would include trying to set a static
value, because as discussed earlier it would be set and accessed in the same JVM (unlike in a distributed
mode)
DEPRECATED CONFIGURATION OPTIONS: CDH4 uses MR1…therefore it uses old/deprecated configuration
names, contrary to what prior versions of this class said. This applies only to MapReduce configuration
settings (e.g. mapred.reduce.tasks works, mapreduce.job.reduces does not). HDFS configuration settings
work either way (e.g. dfs.block.size and dfs.blocksize both work)
For a list of the properties deprecated in CDH 4, refer to http://hadoop.apache.org/docs/
current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html or to
Hadoop Operations by Eric Sammer, Appendix A.
Testing Locally (3)
▪ Some limitations of LocalJobRunner mode:
─ Distributed Cache does not work
─ The job can only specify a single Reducer
─ Some ‘beginner’ mistakes may not be caught
─ For example, attempting to share data between Mappers will work,
because the code is running in a single JVM
A beginner mistake that wouldn’t be caught in local job runner mode would include trying to set a static
value, because as discussed earlier it would be set and accessed in the same JVM (unlike in a distributed
mode)
LocalJobRunner Mode in Eclipse (1)
▪ Eclipse on the course VM runs Hadoop code in LocalJobRunner mode from
within the IDE
─ This is Hadoop’s default behavior when no configuration is provided
▪ This allows rapid development iterations
─ ‘Agile programming’
▪ Specify a Run Configuration
You should demo how to run a job in LocalJobRunner mode in Eclipse here.
Show normal and debug mode (breakpoints, values of variables, etc.).
▪ Select Java Application, then select the New button
▪ Verify that the Project and Main Class fields are pre-filled correctly
▪ Specify values in the Arguments tab
─ Local input and output files
─ Any configuration options needed when your job runs
▪ Define breakpoints if desired

▪ Execute the application in run mode or debug mode
Note that these input and output folders are local, not HDFS folders as when running in regular mode.
▪ Review output in the Eclipse console window
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
Hands-On Exercise: Testing with LocalJobRunner
▪ In this Hands-On Exercise you will run a job using LocalJobRunner both on the
command line and in Eclipse
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
Before Logging: stdout and stderr
▪ Tried-and-true debugging technique: write to stdout or stderr
▪ If running in LocalJobRunner mode, you will see the results of
System.err.println()
▪ If running on a cluster, that output will not appear on your console
─ Output is visible via Hadoop’s Web UI
In local JobRunner mode, you will see not only System.err.println (standard error), but also
System.out.println (standard output) printed to your console.
Discussion of Hadoop’s Web Uis is coming up next…
Aside: The Hadoop Web UI
▪ All Hadoop daemons contain a Web server
─ Exposes information on a well-known port
▪ Most important for developers is the JobTracker Web UI
─ http://<job_tracker_address>:50030/
─ http://localhost:50030/ if running in pseudo-distributed mode
▪ Also useful: the NameNode Web UI
─ http://<name_node_address>:50070/
NOTE: Be advised that the VM is basically a guest operating system running on top of your normal
operating system. Therefore, you cannot type “http://localhost:50030/” in the browser (e.g.
Internet Explorer) on your normal operating system (e.g. Windows XP) and expect to reach the JobTracker
in your VM. You need to launch the browser inside the VM and type the URL in its address bar.
Aside: The Hadoop Web UI (cont’d)
▪ Your instructor will now demonstrate the JobTracker UI
Following the “All” link under logs can help you to diagnose why jobs fail (and what data they were
processing when that failure occurred).
Logging: Better Than Printing
▪ println statements rapidly become awkward
─ Turning them on and off in your code is tedious, and leads to
errors
▪ Logging provides much finer-grained control over:
─ What gets logged
─ When something gets logged
─ How something is logged
Logging With log4j
▪ Hadoop uses log4j to generate all its log files
▪ Your Mappers and Reducers can also use log4j
─ All the initialization is handled for you by Hadoop
▪ Add the log4j.jar-<version> file from your CDH distribution to your
classpath when you reference the log4j classes
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
class FooMapper implements Mapper {

private static final Logger LOGGER =
Logger.getLogger (FooMapper.class.getName());
...
}
Hadoop mappers and reducers use the Log4J framework for logging.
You can use this block of code in your own Mapper and Reducer classes (or Partitioner, Combiner, etc.). The
only thing you will need to change is the name of the logger (i.e. from FooMapper to whatever your class is
called). The rest of this code can be seen as boilerplate.
Note that the parameter passed to getLogger is a string, which is the name you assign to that logger. A
common pattern is to create a new logger for each class by passing the class name. (For convenience,
there’s also a version of getLogger that just takes a class, as shown in the example above, and the name of
the class will be used.) However, it is not required to have the name of the logger be a class name. In some
cases, you want a whole set of related classes to use the same Logger settings. For example, if you have a
set of classes related to customer orders, you might name the logger “myapp.orders”, and have all related
classes will use getLogger(“myapp.orders”);
A Hadoop driver, like any other Java application, can also use the Log4J logging framework. If you insert
Log4J logging statements into driver code, the logging framework looks for Log4J properties in the following
order:
1. Set dynamically in the driver code

2. Defined in a file specified by the -Dlog4j.configuration argument on the command line
3. In a log4j.properties file in the classpath
Unlike Mappers and Reducers, drivers do not use the Log4J configuration in the /etc/conf/
log4j.properties file (unless that file is specified using the -Dlog4j.configuration argument
or placed on the classpath).
Note that there is a log4j.properties file in the hadoop-core.jar file, and that this file is in
$HADOOP_CLASSPATH. So the configuration in this file ends up being the default Log4J configuration for
drivers unless explicitly overridden. With this configuration, look for driver logger output on standard
output.
Logging With log4j (cont’d)
▪ Simply send strings to loggers tagged with severity levels:
LOGGER.trace("message");
LOGGER.debug("message");
LOGGER.info("message");
LOGGER.warn("message");
LOGGER.error("message");
▪ Beware expensive operations like concatenation

─ To avoid performance penalty, make it conditional like this:
if (LOGGER.isDebugEnabled()) {
LOGGER.debug("Account info:" + acct.getReport());
}
The least significant level shown here is “trace” and the most significant is “error”
The essential point of logging is that you log a message at some level and this system is configured to
handle messages based on some threshold. For example, if you call log.info(“my message”) and the system
is configured to write messages at ‘debug’ then your message will be written. Conversely, if the system
was configured to write messages at ‘warn’ then the message would be discarded. Therefore, logging
decouples the process of writing messages from the process of displaying messages. A production system
will likely be configured to log at the ‘info’ or ‘warn’ levels most of the time, but you can change this to
‘debug’ (for example) when actively trying to track down a problem as this will make more messages visible
and therefore provide additional insight.
String concatenation has historically been an expensive operation in Java because Strings are immutable
and thus concatenation can imply object creation (programmers who don’t know Java may find this
surprising). Although modern compilers optimize for this in most cases, it can still be a concern. The advice
given here is for the general case of expensive operations; string concatenation is one case, but this slide
actually shows two expensive operations. The other is the acct.getReport() method call, based on the
reasonable assumption that this method could take a while to complete.
When calling log methods that perform expensive operations, you should wrap them in a conditional block
that checks whether the level you’re logging at is currently enabled. Failure to do so is a common source
of performance problems, because the log message is evaluated (and therefore the expensive operations
take place) regardless of whether the log message is ultimately discarded based on the configured log level.
Putting them in an “if” statement that checks the current log level, as shown here, avoids this potential
problem. This is even more important when your log statement is called in a loop.
NOTE: For background on String concatenation in Java (http://www.javapractices.com/topic/
TopicAction.do?Id=4)
log4j Configuration
▪ Node-wide configuration for log4j is stored in /etc/hadoop/conf/
log4j.properties
▪ Override settings for your application in your own log4j.properties
─ Can change global log settings with hadoop.root.log property
─ Can override log level on a per-class basis, e.g.
log4j.logger.org.apache.hadoop.mapred.JobTracker=WARN
log4j.logger.com.mycompany.myproject.FooMapper=DEBUG
▪ Or set the level programmatically:
LOGGER.setLevel(Level.WARN);
This slide is illustrating that you can configure logging for your own classes of for Hadoop’s classes
independently. In fact, you can also change logging globally, per-package or per-class (see Log4J
documentation for details on configuration). All of this is done by editing a configuration file
(log4j.properties).
In the upper blue box, the first line demonstrates that you can set the log level for one of Hadoop’s classes
(the JobTracker in this case) while the line below it shows an example using a class you’ve written (in this
case, a Mapper class).
Setting Logging Levels for a Job
▪ You can tell Hadoop to set logging levels for a job using configuration
properties
─ mapred.map.child.log.level
─ mapred.reduce.child.log.level
▪ Examples
─ Set the logging level to DEBUG for the Mapper
$ hadoop jar myjob.jar MyDriver \

–Dmapred.map.child.log.level=DEBUG indir outdir
─ Set the logging level to WARN for the Reducer
$ hadoop jar myjob.jar MyDriver \

–Dmapred.reduce.child.log.level=WARN indir outdir
Setting command line properties like this works if you are using ToolRunner.
Where Are Log Files Stored?
▪ Log files are stored on the machine where the task attempt ran
─ Location is configurable
─ By default:
/var/log/hadoop-0.20-mapreduce/
userlogs/${task.id}/syslog
▪ You will often not have ssh access to a node to view its logs
─ Much easier to use the JobTracker Web UI
─ Automatically retrieves and displays the log files for you
NOTE: the name “syslog” in the path shown above has nothing to do with UNIX syslog. It’s just an
unfortunate choice of name.
ssh is “secure shell” (a program for logging into a shell on a remote machine; kind of a modern equivalent
to the old UNIX telnet program). Most system administrators will not provide ssh access to all hadoop users
of the cluster.
Restricting Log Output
▪ If you suspect the input data of being faulty, you may be tempted
to log the (key, value) pairs your Mapper receives
─ Reasonable for small amounts of input data
─ Caution! If your job runs across 500GB of input data, you could
be writing up to 500GB of log files!
─ Remember to think at scale…
▪ Instead, wrap vulnerable sections of code in try {...} blocks
─ Write logs in the catch {...} block
─ This way only critical data is logged
Actually, if you are processing 500GB of data and logging it all using something like:
logger.info(“Here is my key: “ + key + “ and here is my value “ + value);
You may log more than 500GB (because you’re writing “Here is my key” plus “and here is my value” in
addition to the String representation of the key and value.
Aside: Throwing Exceptions
▪ You could throw exceptions if a particular condition is met
─ For example, if illegal data is found
throw new RuntimeException("Your message here");
▪ Usually not a good idea

─ Exception causes the task to fail
─ If a task fails four times, the entire job will fail
Note that RuntimeException is a type of “unchecked exception” which means it need not be declared in
advance.
Common question: If I throw an Exception in my mapper, wouldn’t re-running that task three more times
always fail, and thus make the whole job fail?
Answer: Maybe. The data is replicated three times by default. If the data you copied into HDFS was bad,
then all replicas of that data will be bad and the whole job will fail as you describe (which is OK, because
then you can locate the bad data and deal with it). But if the data you loaded into HDFS was OK, maybe the
problem is that one of the replicas got corrupted (e.g. due to a failing disk)
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
Optional Hands-On Exercise: Logging
▪ In this Hands-On Exercise you will change logging levels for a job and add
debug log output to a Mapper
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
What Are Counters? (1)
▪ Counters provide a way for Mappers or Reducers to pass aggregate values
back to the driver after the job has completed
─ Their values are also visible from the JobTracker’s Web UI
─ And are reported on the console when the job ends
▪ Very basic: just have a name and a value
─ Value can be incremented within the code
▪ Counters are collected into Groups
─ Within the group, each Counter has a name
▪ Example: A group of Counters called RecordType
─ Names: TypeA, TypeB, TypeC
─ Appropriate Counter can be incremented as each record is read in the
Mapper
Counters are helpful when you are keeping statistics on the data you are processing (for example, counting
the number of bad records or the different types of records you are processing).
What Are Counters? (2)
▪ Counters can be set and incremented via the method
context.getCounter(group, name).increment(amount);
▪ Example:
context.getCounter("RecordType","A").increment(1);
Recall that the Context object is passed in to the map method for your Mapper and to the reduce method
of your reducer.
There is only a method to increment the counter value, not a corresponding method to decrement it. That
said, it appears that you can increment it by a negative value but it’s not clear whether this is intentional so
it’s best not to rely on it.
Retrieving Counters in the Driver Code
▪ To retrieve Counters in the Driver code after the job is complete, use code like
this in the driver:
long typeARecords =
job.getCounters().findCounter("RecordType","A").getValue();
long typeBRecords =
job.getCounters().findCounter("RecordType","B").getValue();
Although String values are shown here, it is also possible (and perhaps preferable) to use Java enum values
for the group and counter names.
In this example, the group name is “RecordType” and the counter names are “A” (for the first statement)
and “B” (for the second statement).
Counters: Caution
▪ Do not rely on a counter’s value from the Web UI while a job is running
─ Due to possible speculative execution, a counter’s value could appear larger
than the actual final value
─ Modifications to counters from subsequently killed/failed tasks will be
removed from the final count
During speculative execution, the same task is running twice, so counter values may be artificially inflated.
Hadoop sorts this all out in the end, but you cannot assume the counter values shown in the Web UI are
accurate while the job is still running.
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
Reuse of Objects is Good Practice (1)
▪ It is generally good practice to reuse objects
─ Instead of creating many new objects
▪ Example: Our original WordCount Mapper code

{
@Override

context.write(new Text(word), new IntWritable(1)); 1 }
}
}
}
1 Each time the map() method is called, we create a new Text object and a
new IntWritable object.
Standard Java best practice is not to create many new objects if you can avoid it – it adds to heap usage,
and can cause performance penalties. In the example here, we’re creating a new Text object and a new
IntWritable object each time around the code.
▪ Instead, this is better practice:

{
private final static IntWritable one = new IntWritable(1);
private Text wordObject = new Text(); 1
@Override

wordObject.set(word);
context.write(wordObject, one); }
}
}
}
1 Create objects for the key and value outside of your map() method
▪ Instead, this is better practice:

{
private final static IntWritable one = new IntWritable(1);
private Text wordObject = new Text();
@Override

wordObject.set(word);
context.write(wordObject, one); 1 }
}
}
}
1 Within the map() method, populate the objects and write them out. Hadoop
will take care of serializing the data so it is perfectly safe to re-use the
objects.
What we mean by the second sentence in the box is that the data will be written to a buffer in memory,
and then to disk – so re-using the objects won’t cause problems.
It’s worth pointing out to students that although this is better practice, it turns out that in production it
really won’t speed things up very much – people have done tests which show that although it helps a little,
it’s not enough to really worry about.
Object Reuse: Caution!
▪ Hadoop re-uses objects all the time
▪ For example, each time the Reducer is passed a new value, the same object is
reused
▪ This can cause subtle bugs in your code
─ For example, if you build a list of value objects in the Reducer, each element
of the list will point to the same underlying object
─ Unless you do a deep copy
This is a subtle but common bug. People create an array or list of values, by adding the ’new’ object to the
end of the list each time through the iterable. But because the same object is being reused every time, it
ends up that each element of the list is pointing to exactly the same thing!
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
Map-Only MapReduce Jobs
▪ There are many types of job where only a Mapper is needed
▪ Examples:
─ Image processing
─ File format conversion
─ Input data sampling
─ ETL
ETL = Extract, Transform and Load (the general process of taking information from one system and
importing it into another system).
An interesting example of how you might do image processing and/or file format conversion in Hadoop
is (http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-
computing-fun/).
Creating Map-Only Jobs
▪ To create a Map-only job, set the number of Reducers to 0 in your Driver code
job.setNumReduceTasks(0);
▪ Call the Job.setOutputKeyClass and Job.setOutputValueClass

methods to specify the output types
─ Not the Job.setMapOutputKeyClass and
Job.setMapOutputValueClass methods
▪ Anything written using the Context.write method in the Mapper will be
written to HDFS
─ Rather than written as intermediate data
─ One file per Mapper will be written
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
Hands-On Exercise: Using Counters and a Map-Only Job
▪ In this Hands-On Exercise you will write a Map-Only MapReduce job which
uses Counters
Chapter Topics
Techniques

▪ Reusing Objects
▪ Conclusion
Key Points
▪ LocalJobRunner lets you test jobs on your local machine
▪ Hadoop uses the Log4J framework for logging
▪ Reusing objects is a best practice
▪ Counters provide a way of passing numeric data back to the driver
▪ Create Map-only MapReduce jobs by setting the number of Reducers to zero
Partitioners and Reducers
Chapter 11
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
▪ How to write custom Partitioners
▪ How to determine how many Reducers are needed
Chapter Topics
▪ How Partitioners and Reducers Work Together

▪ Determining the Optimal Number of Reducers for a Job
▪ Writing Custom Partitioners
▪ Hands-On Exercise: Writing a Partitioner
▪ Conclusion
Review: The Big Picture
This is the same diagram we’ve been using, here to remind students of where partitioners are in the big
picture.
What Does the Partitioner Do?
▪ The Partitioner determines which Reducer each intermediate key and its
associated values goes to
getPartion:
(inter_key, inter_value, num_reducers)→ partition
The default number of Reducers is 1, so in this case no Partitioner is used (all keys go to this one Reducer).
If there are multiple Reducers, the Partitioner determines to which Reducer a given key should be sent.
Exactly how this is done is up to the Partitioner. The default Partitioner (HashPartitioner) tries to evenly
distribute the keys across all the available Reducers; thus, if there are 5 Reducers, each should get about
20% of the keys. NOTE: This assumes the objects used as keys has a good hashCode implementation (which
most objects do).
Example: WordCount with One Reducer
To understand exactly how a partitioner works, let’s take a closer look at our word count example.
In this example, we assume we have one reducer, which is the default. All the jobs we’ve run so far in class
had a single reducer. We will talk shortly about how to determine how many reducers there should be, and
set that number appropriately.
After a Map task runs on each block of data, the output of that task is sorted and stored on disk. (Actually,
it’s on stored to disk whenever the memory buffer fills up past a configurable threshold percentage, but
that’s not necessary for this discussion.)
There’s only a single reducer in this example. When all the Mapper tasks are complete, the Hadoop will
merge the sorted mapper output and pass it to the Reducer.
Example: WordCount with Two Reducers
In the real world, you would rarely run with a single reducer – the reducer becomes a bottleneck in your
system.
Let’s imagine here we have two reducers (an unrealistically small number for our unrealistically small
example data.)
This means Hadoop needs to divide up the data between the reducers. It does this by calling a partitioner
to partition the data.
If you want to get into more detail here you can mention that the mapper doesn’t write directly to disk;
instead it writes to a memory buffer that periodically spills to temporary disk files when they get too full.
When the buffer spills to disk, the data is partitioned and sorted as it is written. When the map job is
complete, the temporary files are then merged into the results shown in the diagram: sorted partitions. The
reducers then copy the partitions they are responsible for from the various slave nodes and merge them
together by key. Then it groups together all the values associated with a key, and passes
The Default Partitioner
▪ The default Partitioner is the HashPartitioner
─ Uses the Java hashCode method
─ Guarantees all pairs with the same key go to the same Reducer
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
The default Partitioner (HashPartitioner) tries to evenly distribute the keys across all the available Reducers;
thus, if there are 5 Reducers, each should get about 20% of the keys. NOTE: This assumes the objects used
as keys has a good hashCode implementation (which most objects do).
(hashCode is a method on all Java Objects.)
Chapter Topics

▪ Conclusion
How Many Reducers Do You Need?
▪ An important consideration when creating your job is to determine the
number of Reducers specified
▪ Default is a single Reducer
▪ With a single Reducer, one task receives all keys in sorted order
─ This is sometimes advantageous if the output must be in completely sorted
order
─ Can cause significant problems if there is a large amount of intermediate
data
─ Node on which the Reducer is running may not have enough disk space to
hold all intermediate data
─ The Reducer will take a long time to run
An example of maintaining sorted order globally across all reducers was given earlier in the course when
Partitioners were introduced.
NOTE: worker nodes are configured to reserve a portion (typically 20% - 30%) of their available disk space
for storing intermediate data. If too many Mappers are feeding into too few reducers, you can produce
more data than the reducer(s) could store. That’s a problem.
At any rate, having all your mappers feeding into a single reducer (or just a few reducers) isn’t spreading the
work efficiently across the cluster.
Jobs Which Require a Single Reducer
▪ If a job needs to output a file where all keys are listed in sorted order, a single
Reducer must be used
▪ Alternatively, the TotalOrderPartitioner can be used
─ Uses an externally generated file which contains information about
intermediate key distribution
─ Partitions data such that all keys which go to the first Reducer are smaller
than any which go to the second, etc
─ In this way, multiple Reducers can be used
─ Concatenating the Reducers’ output files results in a totally ordered list
Use of the TotalOrderPartitioner is described in detail on pages 274-277 of TDG 3e (TDG 2e, 237-241). It
is essentially based on sampling your keyspace so you can divide it up efficiently among several reducers,
based on the global sort order of those keys.
Jobs Which Require a Fixed Number of Reducers
▪ Some jobs will require a specific number of Reducers
▪ Example: a job must output one file per day of the week
─ Key will be the weekday
─ Seven Reducers will be specified
─ A Partitioner will be written which sends one key to each Reducer
But beware that this can be a naïve approach. If processing sales data this way, business-to-business
operations (like plumbing supply warehouses) would likely have little or no data for the weekend since they
will likely be closed. Conversely, a retail store in a shopping mall will likely have far more data for a Saturday
than a Tuesday.
Jobs With a Variable Number of Reducers (1)
▪ Many jobs can be run with a variable number of Reducers
▪ Developer must decide how many to specify
─ Each Reducer should get a reasonable amount of intermediate data, but not
too much
─ Chicken-and-egg problem
▪ Typical way to determine how many Reducers to specify:
─ Test the job with a relatively small test data set
─ Extrapolate to calculate the amount of intermediate data expected from the
‘real’ input data
─ Use that to calculate the number of Reducers which should be specified
The upper bound on the number of reducers is based on your cluster (machines are configured to have
a certain number of “reduce slots” based on the CPU, RAM and other performance characteristics of the
machine). The general advice is to choose something a bit less than the max number of reduce slots to
allow for speculative execution.
Jobs With a Variable Number of Reducers (2)
▪ Note: you should take into account the number of Reduce slots likely to be
available on the cluster
─ If your job requires one more Reduce slot than there are available, a second
‘wave’ of Reducers will run
─ Consisting just of that single Reducer
─ Potentially doubling the amount of time spent on the Reduce phase
─ In this case, increasing the number of Reducers further may cut down the
time spent in the Reduce phase
─ Two or more waves will run, but the Reducers in each wave will have to
process less data
One factor in determining the reducer count is the reduce capacity the developer has access to (or the
number of "reduce slots" in either the cluster or the user’s pool). One technique is to make the reducer
count a multiple of this capacity. If the developer has access to N slots, but they pick N+1 reducers, the
reduce phase will go into a second "wave" which will cause that one extra reducer to potentially double the
execution time of the reduce phase. However, if the developer chooses 2N or 3N reducers, each wave takes
less time, but there are more "waves", so you don’t see a big degradation in job performance if you need a
second wave (or more waves) due to an extra reducer, a failed task, etc.
Suggestion: draw a picture on the whiteboard that shows reducers running in waves, showing cluster slot
count, reducer execution times, etc. to tie together the explanation of performance issues as they have
been explained in the last few slides:
1. 1 reducer will run very slow on an entire data set

2. Setting the number of reducers to the available slot count can maximize parallelism in one reducer wave.
However, if you have a failure then you’ll run the reduce phase of the job into a second wave, and that
will double the execution time of the reduce phase of the job.
3. Setting the number of reducers to a high number will mean many waves of shorter running reducers.
This scales nicely because you don’t have to be aware of the cluster size and you don’t have the cost of a
second wave, but it might be more inefficient for some jobs.
Chapter Topics

▪ Conclusion
Custom Partitioners (1)
▪ Sometimes you will need to write your own Partitioner
▪ Example: your key is a custom WritableComparable which contains a pair
of values (a, b)
─ You may decide that all keys with the same value for a need to go to the
same Reducer
─ The default Partitioner is not sufficient in this case
We cover exactly this case later in the course when we discuss secondary sort. The essential point here is
that sometimes the Map and Reduce tasks both take a single object as a key, but for some algorithms you
might wish to use two objects as a key. Consider the case in which you need to calculate product sales by
region and year. You want to use both the region and the year as the key (the sale amount would be the
value), but the map function requires a single key and a single value. The typical workaround is to create a
Pair object (a composite key), which accepts both the region and year in its constructor and then use that
Pair object as the key:
Pair example1 = new Pair( new Text(“Europe”), new IntWritable(2011));
Pair example2 = new Pair( new Text(“North America”), new IntWritable(2011));
Pair example3 = new Pair( new Text(“Asia”), new IntWritable(2011));
Pair example4 = new Pair( new Text(“Europe”), new IntWritable(2012));
Unfortunately, although the first and last example have the same reqion, they won’t necessarily go to
the same Reducer under the default HashPartitioner. This is because these two objects don’t necessarily
generate the same hash code (they have different years, and this is likely a significant field). One solution
to this problem is to create a custom Partitioner which determines which Reducer to use by examining only
the region field.
In this case, you simply create a Pair object which accepts two objects, and then use this Pair as the key.
However, consider the
Custom Partitioners (2)
▪ Custom Partitioners are needed when performing a secondary sort (see later)
▪ Custom Partitioners are also useful to avoid potential performance issues
─ To avoid one Reducer having to deal with many very large lists of values
─ Example: in our word count job, we wouldn’t want a single Reducer dealing
with all the three- and four-letter words, while another only had to handle
10- and 11-letter words
One common goal of creating a custom Partitioner is to spread the load evenly, though this requires some
knowledge about the data you are processing and isn’t always attainable.
For example, if you want to generate retails sales reports based on month, it might be natural to have
twelve Reducers and create a simple Partitioner that returns a value between 0 and 11 based on the month
in which a given sale occurred. The problem is that retail sales aren’t usually distributed evenly -- stores in
American shopping malls, for example, do much more business in December than in February. Likewise, ice
cream parlors generally do more business in summer months than winter.
See TDG 3e page 254 (TDG 2e, 218) for more discussion on considerations when designing a Partitioner.
Creating a Custom Partitioner
1. Create a class that extends Partitioner
2. Override the getPartition method
─ Return an int between 0 and one less than the number of Reducers
─ e.g., if there are 10 Reducers, return an int between 0 and 9
import org.apache.hadoop.mapreduce.Partitioner;
public class MyPartitioner<K,V> extends Partitioner<K,V> {
@Override
//determine reducer number between 0 and numReduceTasks-1
//...
return reducer;
}
}
Common question: What happens if you exceed the allowed range for the return value in your getPartition
method (e.g. < 0 or >= number of Reducers)?
Answer: In modern versions of Hadoop, this will cause an IOException. Interestingly, it wasn’t always the
case. Prior to 0.18.0, this implementation error in your Partitioner would cause some of the intermediate
data to simply not be processed (HADOOP-2919). As always, you should write unit tests to make sure your
code works as expected!
You might want to point out that there is a limited number of Reducer slots in a cluster. The maximum
number of Reducer slots is configured by the cluster administrator. The number of slots can affect
developer decisions when partitioning data into pre-set numbers. What if someone wanted sales by week?
"Sure, no problem, just make 52 partitions! Oh, wait, you mean my job is blocking 3 other jobs now? Oops!”
The impact of there being a maximum number of Reducer slots in a cluster is also discussed in the next
chapter, in the section about determining the optimal number of Reducers for a job.
Using a Custom Partitioner
▪ Specify the custom Partitioner in your driver code
job.setPartitionerClass(MyPartitioner.class);
Aside: Setting up Variables for your Partitioner (1)
▪ If you need to set up variables for use in your partitioner, it should implement
Configurable
▪ If a Hadoop object implements Configurable, its setConf() method will
be called once, when it is instantiated
▪ You can therefore set up variables in the setConf() method which your
getPartition() method will then be able to access
We are adding this discussion to help people out when they do the hands-on exercise, because in that
exercise they need to set up a HashMap for the Partitioner and it would be horrible to do that each time
getPartition() is called.
If your class/object extends Partitioner, then the setConf() method is called once, when it’s instantiated. So
within that we can set up variables. And because we’re implementing the interface, we also have to write a
getConf() method, of course.
Aside: Setting up Variables for your Partitioner (2)
class MyPartitioner extends Partitioner<K, V> implements Configurable {
private Configuration configuration;

// Define your own variables here
@Override
public void setConf(Configuration configuration) {
this.configuration = configuration;
// Set up your variables here
}
@Override
public Configuration getConf() {
return configuration;
}
// Use variables here
}
}
Chapter Topics

▪ Conclusion
Hands-On Exercise: Writing a Partitioner
▪ In this Hands-On Exercise, you will write code which uses a Partitioner and
multiple Reducers
Chapter Topics

▪ Conclusion
Key Points
▪ Developers need to consider how many Reducers are required for a job
▪ Partitioners divide up intermediate data to pass to Reducers
▪ Write custom Partitioners for better load balancing
─ getPartition method returns an integer indicating which Reducer to
send the data to
Data Input and Output
Chapter 12
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
▪ How to create custom Writable and WritableComparable implementations
▪ How to save binary data using SequenceFile and Avro data files
▪ How to implement custom InputFormats and OutputFormats
▪ What issues to consider when using file compression
Chapter Topics
▪ Creating Custom Writable and WritableComparable Implementations

▪ Hands-On Exercise: Implementing a Custom WritableComparable
▪ Saving Binary Data Using SequenceFiles and Avro Data Files
▪ Issues to Consider When Using File Compression
▪ Hands-On Exercise: Using SequenceFiles and File Compression
▪ Implementing Custom InputFormats and OutputFormats
▪ Conclusion
Data Types in Hadoop
Writable Defines a de/serialization protocol.

Every data type in Hadoop is a
Writable
WritableComparable Defines a sort order. All keys must be

WritableComparable
IntWritable Concrete classes for different data

LongWritable types
Text
…
The value used in a mapper or reducer must be Writable, because this data must be saved to disk and
may be sent between machines. Hadoop defines its own serialization mechanism and Writables are
fundamental to how it works. But a key not only must be Writable, it must also be Comparable because
keys are passed to a reducer in sorted order. The Comparable interface in Java defines a general purpose
mechanism for sorting objects, so the WritableComparable interface defined in Hadoop states that an
object that implements can be both serialized/deserialized and sorted.
‘Box’ Classes in Hadoop
▪ Hadoop’s built-in data types are ‘box’ classes
─ They contain a single piece of data
─ Text: String
─ IntWritable: int
─ LongWritable: long
─ FloatWritable: float
─ etc.
▪ Writable defines the wire transfer format
─ How the data is serialized and deserialized
“Box” class in this context means “wrapper” class (each common type in Java such as int, long, float,
boolean, String, etc. has a corresponding Writable (actually, WritableComparable) implementation in
Hadoop designed to hold a variable of that type.
Creating a Complex Writable
▪ Example: say we want a tuple (a, b) as the value emitted from a Mapper
─ We could artificially construct it by, for example, saying
Text t = new Text(a + "," + b);

...
String[] arr = t.toString().split(",");
▪ Inelegant
▪ Problematic
─ If a or b contained commas, for example
▪ Not always practical
─ Doesn’t easily work for binary objects
▪ Solution: create your own Writable object
It is relatively common to need to pass two objects simultaneously as the key (or value), but the API doesn’t
allow for this. We can work around this by trying to stuff two items in a single object and then using that as
the key (or value). For example, if we want to use a product name and a region as the key, we could pack
these into a single string in which the values were separated by some delimiter, like this:
String product = “Chocolate chip cookies”;
String region = “North America”;
String pair = product + “,” + region; // pair = “Chocolate chip cookies, North America”
You could later split it based on the delimiter to retrieve the two values. This is kind of a hack (we may
not have a good String-based representation for certain data) and would fail if the product name already
contained that delimiter, as illustrated here:
String product = “Cookies, chocolate chip”;
String region = “North America”;
String pair = product + “,” + region; // pair = “Cookies, chocolate chip, North America”
Because splitting this on the delimiter would now give us a product name “Cookies” and a region name
“chocolate chip, North America” (assuming we were only looking for the first comma). We can better
achieve our goal by creating a Java class (a “pair” or “tuple”) that is designed to hold two objects and then
use the pair object as the key (or value).
The Writable Interface
public interface Writable {

void readFields(DataInput in);
void write(DataOutput out);
}
▪ The readFields and write methods will define how your custom object
will be serialized and deserialized by Hadoop
▪ The DataInput and DataOutput classes support
─ boolean
─ byte, char (Unicode: 2 bytes)
─ double, float, int, long,
─ String (Unicode or UTF-8)
─ Line until line terminator
─ unsigned byte, short
─ byte array
This should seem quite familiar to anyone who has ever worked with Java serialization.
Unicode is a standard for representing character data which replaces the older ASCII system. Unicode
can be used to represent characters outside the Latin alphabet, such as Chinese or Arabic. Java strings
always use Unicode. UTF-8 is simply an efficient way of representing Unicode data (it stores as few bytes
as possible to represent each character, which is particularly effective in English text which you’re using
mostly characters from ASCII anyway). None of this should be new to experienced Java programmers, but
programmers from other languages might not be familiar with it.
As will be discussed momentarily, byte arrays can be used to store binary data (such as a custom object or a
photograph).
A Sample Custom Writable: DateWritable
class DateWritable implements Writable {

int month, day, year;
// Constructors omitted for brevity
public void readFields(DataInput in) throws IOException {

this.month = in.readInt();
this.day = in.readInt();
this.year = in.readInt();
}
public void write(DataOutput out) throws IOException {

out.writeInt(this.month);
out.writeInt(this.day);
out.writeInt(this.year);
}
}
It is essential that fields are read in the same order they are written – failure to do so won’t necessarily
cause an exception or compiler error (it wouldn’t in this case because all three fields are of the same type),
but these mistakes can be very hard to track down later.
For the example on the slide, DateWritable objects are writable date objects.
What About Binary Objects?
▪ Solution: use byte arrays
▪ Write idiom:
─ Serialize object to byte array
─ Write byte count
─ Write byte array
▪ Read idiom:
─ Read byte count
─ Create byte array of proper size
─ Read byte array
─ Deserialize object
To serialize an object to a byte array, you can use the java.io.ByteArrayOutputStream class, like this:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(baos);
oos.writeObject(myBinaryObject);
byte[] serializedAsByteArray = baos.toByteArray();
// now you can write it out as described above (the number of bytes is obtained via the array’s length
property)
To read it back in later, use the java.io.ByteArrayInputStream class and do essentially the reverse. All of this
is basic Java I/O and not specific to Hadoop.
The reason you are advised to write out the array length in the second step of the write process is so you’ll
know how big of an array to create when you read it back in later.
WritableComparable
▪ WritableComparable is a sub-interface of Writable
─ Must implement compareTo, hashCode, equals methods
▪ All keys in MapReduce must be WritableComparable
The topic of how to properly implement equals and hashCode methods is deceptively complex, but
generally outside the scope of this class. The essential points to emphasize in class are that all fields in an
object which the developer considers important should be evaluated in the equals and hashCode methods
and that most IDEs (and specifically, Eclipse) can generate these method implementations correctly for you
if you simply specify which fields are important.
A thorough discussion of equals and hashCode can be found in Joshua Bloch’s excellent Effective Java book
(http://www.amazon.com/gp/product/0321356683/). This book is essential reading for Java
programmers of any experience level.
Making DateWritable a WritableComparable (1)
class DateWritable implements WritableComparable<DateWritable> {

int month, day, year;
// Constructors omitted for brevity
public void readFields (DataInput in) . . .
public void write (DataOutput out) . . .
public boolean equals(Object o) {

if (o instanceof DateWritable) {
DateWritable other = (DateWritable) o;
return this.year == other.year && this.month == other.month
&& this.day == other.day;
}
return false;
}
In this slide and the next slide, we extend our previous example to make the date WritableComparable
(instead of just Writable, as before). This allows us to use a date object as a key or value, rather than simply
a value.
Explain to students that the readFields and write methods would be identical to the methods shown
in the Writable example several slides back.
The equals method should evaluate whichever fields are considered important; in this case, all three
fields (year, month, and day) are considered important and are being evaluated.
NOTE: an example of an “unimportant” field might be a field which is used to cache a result from a complex
calculation or a timestamp that tracks when such a value was last calculated.
Making DateWritable a WritableComparable (2)
public int compareTo(DateWritable other) {

// Return -1 if this date is earlier
// Return 0 if dates are equal
// Return 1 if this date is later
if (this.year != other.year) {
return (this.year < other.year ? -1 : 1);
} else if (this.month != other.month) {
return (this.month < other.month ? -1 : 1);
} else if (this.day != other.day) {
return (this.day < other.day ? -1 : 1);
}
return 0;
}
public int hashCode() {

int seed = 163; // Arbitrary seed value
return this.year * seed + this.month * seed + this.day * seed;
}
}
The compareTo method is implemented so that the year is compared first, then the month, then the
day. This is standard Java programming and not specific to Hadoop. You might need to go over the ternary
operator if students are not familiar with it.
An important optimization to make in compareTo is to return a value as quickly as possible. You should
compare fields that are likely to be different first and you should make less expensive comparisons, such as
primitive fields like int or boolean, before making more expensive comparisons, like objects or arrays.
There’s no need to compare other fields once you’ve found the first differing value – just return -1 or 1 as
appropriate.
Like the equals method, the hashCode method should make use of fields that are considered
important. In this example, calculations against the year, month, and date all figure in to the hash code. The
number 163 in the example is arbitrary; there is nothing magical about it.
Using Custom Types in MapReduce Jobs
▪ Use methods in Job to specify your custom key/value types
▪ For output of Mappers:
job.setMapOutputKeyClass()
job.setMapOutputValueClass()
▪ For output of Reducers:
job.setOutputKeyClass()
job.setOutputValueClass()
▪ Input types are defined by InputFormat

─ Covered later
Chapter Topics

▪ Conclusion
Hands-On Exercise: Implementing a Custom
WritableComparable
▪ In this exercise you will implement a custom WritableComparable type that
holds two Strings
─ You will test the type in a simple job that counts occurrences of first name/
last name pairs
Chapter Topics

▪ Conclusion
What Are SequenceFiles?
▪ SequenceFiles are files containing binary-encoded key-value pairs
─ Work naturally with Hadoop data types
─ SequenceFiles include metadata which identifies the data types of the key
and value
▪ Actually, three file types in one
─ Uncompressed
─ Record-compressed
─ Block-compressed
▪ Often used in MapReduce
─ Especially when the output of one job will be used as the input for another
─ SequenceFileInputFormat
─ SequenceFileOutputFormat
SequenceFiles are described in TDG 3e from pages 130-137 (TDG 2e, 116-123).
This file format is a good choice when the keys and/or values in your MapReduce jobs cannot be
represented in text format (for example, object graphs, images or other binary data).
Although it is possible to read and write them using a Java API (how to read them is illustrated on the
next screen), the easier way is to simply configure the JobConf (i.e. in your driver class) to use the
SequenceFileOutputFormat to write the files and then use SequenceFileOutputFormat as the input file for a
subsequent job to read them back in.
Directly Accessing SequenceFiles
▪ It is possible to directly access SequenceFiles from your code:
Configuration config = new Configuration();

SequenceFile.Reader reader =
new SequenceFile.Reader(FileSystem.get(config), path, config);
Text key = (Text) reader.getKeyClass().newInstance();

IntWritable value = (IntWritable) reader.getValueClass().newInstance();
while (reader.next(key, value)) {

// do something here
}
reader.close();
This example shows how to read a sequence file. Writing one is similar (see TDG 3e pages 131-132 (TDG 2e,
117-118) for an example).
The getKeyClass / getValueClass lines are needed to create key and value objects of the correct type.
Problems With SequenceFiles
▪ SequenceFiles are useful but have some potential problems
▪ They are only typically accessible via the Java API
─ Some work has been done to allow access from other languages
▪ If the definition of the key or value object changes, the file becomes
unreadable
Experienced Java programmers won’t be surprised by this slide since these are problems with Java
serialization too.
An Alternative to SequenceFiles: Avro
▪ Apache Avro is a serialization format which is becoming a popular alternative
to SequenceFiles
─ Project was created by Doug Cutting, the creator of Hadoop
▪ Self-describing file format
─ The schema for the data is included in the file itself
▪ Compact file format
▪ Portable across multiple languages
─ Support for C, C++, Java, Python, Ruby and others
▪ Compatible with Hadoop
─ Via the AvroMapper and AvroReducer classes
Common question: Why is there an AvroMapper and AvroReducer? Why isn’t there just AvroInputFormat
and AvroOutputFormat?
Answer: because Avro deals with object graphs rather than key/value pairs and therefore doesn’t
fit into Hadoop’s map and reduce methods. The AvroMapper and AvroReducer classes come
from Avro, rather Hadoop itself. An example of how to use Avro with Hadoop is here (http://
www.datasalt.com/2011/07/hadoop-avro/).
Avro is described in TDG 3e on pages 110-130 (TDG 2e, 103-116).
Chapter Topics

▪ Conclusion
Hadoop and Compressed Files
▪ Hadoop understands a variety of file compression formats
─ Including GZip
▪ If a compressed file is included as one of the files to be processed, Hadoop
will automatically decompress it and pass the decompressed contents to the
Mapper
─ There is no need for the developer to worry about decompressing the file
▪ However, GZip is not a ‘splittable file format’
─ A GZipped file can only be decompressed by starting at the beginning of the
file and continuing on to the end
─ You cannot start decompressing the file part of the way through it
Compression is covered on pages 83-92 of TDG 3e (77-86 of TDG 2e).
Non-Splittable File Formats and Hadoop
▪ If the MapReduce framework receives a non-splittable file (such as a GZipped
file) it passes the entire file to a single Mapper
▪ This can result in one Mapper running for far longer than the others
─ It is dealing with an entire file, while the others are dealing with smaller
portions of files
─ Speculative execution could occur
─ Although this will provide no benefit
▪ Typically it is not a good idea to use GZip to compress MapReduce input files
But because a non-splittable file is passed in its entirety to a single mapper, this can create a bottleneck.
Splittable Compression Formats: LZO
▪ One splittable compression format is LZO
▪ Because of licensing restrictions, LZO cannot be shipped with Hadoop
─ But it is easy to add
─ See https://github.com/cloudera/hadoop-lzo
▪ To make an LZO file splittable, you must first index the file
▪ The index file contains information about how to break the LZO file into splits
that can be decompressed
▪ Access the splittable LZO file as follows:
─ In Java MapReduce programs, use the LzoTextInputFormat class
─ In Streaming jobs, specify -inputformat com.hadoop.
mapred.DeprecatedLzoTextInputFormat on the command line
“Licensing restrictions” here means that LZO is made available by its developer (a person in Germany
named Markus Oberhummer) under the GNU General Public License (GPL). It is not a restriction that
Cloudera has put into place (i.e. it is definitely not something proprietary). Rather, the GPL is an open
source license, but it is incompatible with the Apache license under which Hadoop is distributed.
Both licenses have their advantages and disadvantages, but since they can be a divisive topic for
programmers, you should avoid discussing this in any detail. The important point is that the LZO license
doesn’t allow it to be shipped with Hadoop but it is open source and easily acquired on your own, if desired.
An LZO file must be preprocessed with an indexer to make it a splittable file. Here’s an example of a
command that indexes a file named big_file.lzo:
hadoop jar /path/to/your/hadoop-lzo.jar
com.hadoop.compression.lzo.LzoIndexer big_file.lzo
Splittable Compression for SequenceFiles and Avro Files
Using the Snappy Codec
▪ Snappy is a relatively new compression codec
─ Developed at Google
─ Very fast
▪ Snappy does not compress a SequenceFile and produce, e.g., a file with a
.snappy extension
─ Instead, it is a codec that can be used to compress data within a file
─ That data can be decompressed automatically by Hadoop (or other
programs) when the file is read
─ Works well with SequenceFiles, Avro files
▪ Snappy is now preferred over LZO
All compression algorithms are a tradeoff between space (how much smaller can you make something) and
time (how long does making it smaller take).
Snappy doesn’t compress data as thoroughly as other algorithms (such as BZip2) but it does compress
quickly. As such, it’s a good tradeoff between space and time.
For more information on Snappy, see (http://www.cloudera.com/blog/2011/09/snappy-
and-hadoop/).
Sequence files compressed with Snappy will not have a .snappy extension because Snappy is used to
compress individual blocks of data within the file rather than compressing the entire file itself [see:
http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/]. The codec used
for compression is stored within the sequence file format itself [see: http://wiki.apache.org/
hadoop/SequenceFile].
While it’s possible to have Hadoop compress text files produced as job output with Snappy, this is not
advised because Snappy is non-splittable (so it cannot be processed efficiently in subsequent jobs) nor can
it be easily decompressed from the command line. However, this is how you could invoke a job to produce
text output compressed with Snappy:
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-*jar wordcount -
Dmapred.output.compress=true -
Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec /inputdir /outputdir
Compressing Output SequenceFiles With Snappy
▪ Specify output compression in the Job object
▪ Specify block or record compression
─ Block compression is recommended for the Snappy codec
▪ Set the compression codec to the Snappy codec in the Job object
▪ For example:
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.compress.SnappyCodec;
. . .
job.setOutputFormatClass(SequenceFileOutputFormat.class);
FileOutputFormat.setCompressOutput(job,true);
FileOutputFormat.setOutputCompressorClass(job,SnappyCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);
This slide supports the “Using SequenceFiles and File Compression” lab. Students will need to add the
above statements to their driver to compress the SequenceFile.
The recommendation to use block compression for Snappy comes from the following URL: https://
ccp.cloudera.com/display/CDHDOC/Snappy+Installation. Note that Snappy is
preinstalled in CDH, so students will not need to perform the installation steps documented at this URL.
If the driver uses ToolRunner you can also set these values as command line parameters:
New API (not yet supported by CDH?)
mapreduce.output.fileoutputformat.compress
mapreduce.output.fileoutputformat.compress.codec
mapreduce.output.fileoutputformat.compress.type
mapreduce.output.fileoutputformat.outputdir
Old API
mapred.output.compress
mapred.output.compression.codec
maprec.output.compression.type
Chapter Topics

▪ Conclusion
Hands-On Exercise: Using Sequence Files and File
Compression
▪ In this Hands-On Exercise, you will explore reading and writing uncompressed
and compressed SequenceFiles
Chapter Topics

▪ Conclusion
Review: The MapReduce Flow
In this chapter we are going to take a closer look at Input Formats and Output Formats.
A Closer Look at Input Formats
▪ An Input Format has two main jobs
─ Split the input data into Input Splits
─ Create a RecordReader object for
each split
The previous diagram over-simplifies how input format works. Let’s take a closer look here.
So far we’ve been focusing on using data in HDFS files, but that’s not the only possible data source, as we
will learn later (e.g. databases). And we’ve been using the default InputFormat, which is TextInputFormat,
which splits text files into InputSplits that correspond to the HDFS blocks that make up the file. This is
common, and is very efficient because it allows the job tracker to run the Map tasks on a node that is
holding the corresponding block. But it isn’t the only approach. There can be splits that span HDFS blocks;
HDFS blocks that span splits; and input sources that aren’t HDFS files at all.
The job of an InputFormat class is to determine how (and whether) to split the input data, and to generate
a new RecordReader object to read from each split. (The InputFormat is therefore a RecordReader “factory”
which is a common paradigm in job programming.)
Record Readers

▪ Each InputSplit has a
RecordReader object
▪ The RecordReader parses the
data from the InputSplit into
“records”
─ Key/value pairs
Most Common InputFormats
▪ Most common InputFormats:
─ TextInputFormat
─ KeyValueTextInputFormat
─ SequenceFileInputFormat
▪ Others are available
─ NLineInputFormat
─ Every n lines of an input file is treated as a separate InputSplit
─ Configure in the driver code by setting:
mapred.line.inputformat.linespermap
─ MultiFileInputFormat
─ Abstract class that manages the use of multiple files in a single task
─ You must supply a getRecordReader() implementation
A number of these input formats are described in greater detail in TDG 3e on pages 245-251 (TDG 2e,
209-215). A table showing their inheritance hierarchy can be found on page 237 of TDG 3e (TDG 2e, 201).
The first three on these slides have been discussed previously in the course, so you only need to give any
significant explanation on the last two.
CDH 4: mapreduce.input.lineinput.linespermap
How FileInputFormat Works
▪ All file-based InputFormats inherit from FileInputFormat
▪ FileInputFormat computes InputSplits based on the size of each file, in
bytes
─ HDFS block size is used as upper bound for InputSplit size
─ Lower bound can be specified in your driver code
─ This means that an InputSplit typically correlates to an HDFS block
─ So the number of Mappers will equal the number of HDFS blocks of input
data to be processed
This material is covered on pages 234-244 of TDG 3e (TDG 2e, 198-200).
Writing Custom InputFormats
▪ Extend FileInputFormat as a starting point
▪ Write your own custom RecordReader
▪ Override the getRecordReader method in FileInputFormat
▪ Override isSplittable if you don’t want input files to be split
─ Method is passed each file name in turn
─ Return false for non-splittable files
By subclassing FileInputFormat, you’ll save yourself a lot of work. It will take care of details like verifying the
input path for you.
What RecordReaders Do
▪ InputSplits are handed to the RecordReaders
─ InputSplit is specified by the path, starting position offset, length
▪ RecordReaders must:
─ Ensure each (key, value) pair is processed
─ Ensure no (key, value) pair is processed more than once
─ Handle (key, value) pairs which are split across InputSplits
See also this FAQ from the Hadoop Wiki (http://wiki.apache.org/hadoop/

FAQ#How_do_Map.2BAC8-
Reduce_InputSplit.27s_handle_record_boundaries_correctly.3F)
Custom Input Format Example: Fixed Width Columns
Let’s look at a simple example. So far, we’ve been dealing exclusively with text input which is line oriented
– a “record” is a single line which is terminated by a newline character. This is the sort of format that
TextFileInputFormat (and, more specifically, its record reader LineRecordReader) expects. This is an
example of a very different format. The file contents are still text, but there are no delimiters or line
terminators. Instead, each record is fixed width – exactly 50 bytes. Within the record, each field is also a
fixed width: the first field representing the record ID is 7 bytes, the second is a last name (25 bytes), then
first name (10 byte), and birth date (8 bytes).
We need a custom InputFormat for this, because the standard Hadoop InputFormats don’t handle
undelimited text input like this.
NOTE: The code to implement this example is in the inputformat project/example package on the student
VMs.
Example: ColumnTextInputFormat (1)
//…imports omitted for brevity…
public class ColumnTextInputFormat extends FileInputFormat<Text,Text> 1
{
@Override
public RecordReader<Text,Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException {
RecordReader<Text,Text> recordreader =
(RecordReader<Text, Text>) new ColumnTextRecordReader();
recordreader.initialize(split, context);
return recordreader;
}
@Override
protected long computeSplitSize(long blockSize,
long minSize, long maxSize) { … }
}
1 File-based formats should extend FileInputFormat. The abstract base

class provides default setting and splitting of input files. Generic type
parameters indicate the key and value types that will be passed to the
Mapper.
The new InputFormat can extend FileInputFormat because it’s a file based input.
Remember that an InputFormat has two main jobs:
1 – create splits from the input file
2 -- generate a record reader to read from that split
In this example, we are using the default approach to file splitting, inherited from FileInputFormat. We will
talk about that more soon. For now we are simply overriding the createRecordReader method to return our
custom record reader (next slide).
public class ColumnTextInputFormat extends FileInputFormat<Text,Text> {
@Override 1
public RecordReader<Text,Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
RecordReader<Text,Text> recordreader =
(RecordReader<Text, Text>) new ColumnTextRecordReader();
recordreader.initialize(split, context);
return recordreader;
}
@Override
long minSize, long maxSize) { … }
}
1 Implement the createRecordReader method to create a RecordReader

object for each Input Split. (We will define ColumnTextRecordReader later.)
MapReduce InputFormats follow a “factory” idiom common in Java programming. (http://

en.wikipedia.org/wiki/Factory_method_pattern) The createRecordReader method is a
factory method to generate RecordReader objects. A RecordReader will be created for each input split,
which is why the split is passed to to the RecordReader’s initializer method.
Every class that extends the abstract base class FileInputFormat must implement createRecordReader. In
most cases, the main goal of a custom Input Format is to be a factory for custom Record Readers, so this is
the most important method.
public class ColumnTextInputFormat extends FileInputFormat<Text,Text>
{ private int recordwidth = 50;
@Override
public RecordReader<Text, Text> …
@Override
long minSize, long maxSize) 1 {
long defaultSize = super.computeSplitSize(blockSize,

minSize, maxSize);
if (defaultSize < recordwidth)

return recordwidth;
long splitSize = ((long)(Math.floor((double)defaultSize /
(double)recordwidth))) * recordwidth;
return splitSize;
}
}
1 Override computeSplitSize to make sure the file is split on a record

boundary.
One thing we must be conscious of in reading input records is how to deal with the fact that input splits
may cross record boundaries. Usually, the answer is to write the custom record reader to make sure that it
reads past the end of the split to read the rest of the record, which will be discussed shortly when we get to
record readers. Occasionally, as in this example, we can solve the problem a different way: by making sure
to split our file such that the break will happen between records. That isn’t usually possible, but in this case,
we know exactly how long a record is, so we can prevent record splits by splitting the file at an offset that’s
divisible by the record length (50 bytes in this example.)
By default the FileInputFormat class splits the input to be the same size as HDFS blocks. This usually makes
a lot of sense, because it makes it easy for Hadoop to optimize for data locality by running a map task on
the data node where the input splits block is. But that’s not the only way to do it, and sometimes there are
good reasons to split the file differently.
Here we try to make the split size be as close as possible to the block size to continue to have the task
scheduling/data locality advantage. There will be a small impact for the fact that some of the data for the
split will need to be read off a different block/data node.
IMPORTANT POINT: Input splits and blocks are different! A split doesn’t care where its data physically lives.
If the file is on HDFS, the record reader will simply request the file to read from it, without concern for
where it lives.
Example: ColumnTextRecordReader
public class ColumnTextRecordReader extends RecordReader<Text,Text> { 1
private Text key = null;

private Text value = null;
@Override
public Text getCurrentKey() {
return key;
}
@Override
public Text getCurrentValue() {
return value;
}
continued on next slide
1 Custom Record Readers usually extend the RecordReader abstract base
class, or a library implementation. Generic type parameters indicate key and
value types passed to the Mapper, which must match the InputFormat types.
Our record reader extends the base class RecordReader which doesn’t do much for us, so most of its
behavior we will implement ourselves.
Example: ColumnTextRecordReader Getters
public class ColumnTextRecordReader extends RecordReader<Text,Text> {

private Text key = null; 1
private Text value = null;
@Override
public Text getCurrentKey() {
return key;
}
@Override
public Text getCurrentValue() {
return value;
}
1 Mappers will call getters to get the current Key and Value. The getters do
nothing but return the private variables which are set by nextKeyValue().
Our record reader extends the base class RecordReader which doesn’t do much for us, so most of its
behavior we will implement ourselves.
First the easy stuff. a RecordReader’s main job is to Key, Value pairs from the input stream. Here we define
variables to hold those, and getters to retrieve them.
Example: ColumnTextRecordReader Initializer (1)
@Override
public void initialize(InputSplit genericSplit, 1
TaskAttemptContext context)
FileSplit split = (FileSplit) genericSplit;
this.start = split.getStart(); // start reading here
this.end = start + split.getLength(); // end reading here
this.pos = start; // set current position
Configuration job = context.getConfiguration();
FileSystem fs = file.getFileSystem(job);
this.fileIn = fs.open(split.getPath());
}
@Override
public void close() throws IOException {
fileIn.close();
}
1 The RecordReader has access to the whole file, but needs to read just the
part associated with its split. The InputSplit tells it which part of the file it
owns.
Next we override the initialize method. This is called before we start reading any records.
Here we look at the split we’ve been passed – this tells us what our file is (getPath()) and where we should
start reading from. Remember that a given record reader is only responsible for reading records from one
split, which is one portion of a file. We know which portion is ours by the start position and length specified
in the FileSplit.
@Override
public void initialize(InputSplit genericSplit,
this.pos = start; 1 // set current position
}
@Override
fileIn.close();
}
1 We use the pos variable to keep track of our progress through the split.
@Override
Configuration job = context.getConfiguration(); 1
}
@Override
fileIn.close();
}
1 Finally, we open an input stream from the file specified in the split.
Note that we need to query the job configuration to know what file system to use. This is usually but not
necessarily in the HDFS file system. For instance, if the job is running in LocalJobRunner mode, the file
system will be the local Unix filesystem rather than HDFS.
Example: ColumnTextRecordReader Stream Closer
@Override
}
@Override
fileIn.close(); 1
}
1 The close method closes the file input stream we opened in initialize.
This will be called when the split has been fully read.
Note that we need to query the job configuration to know what file system to use. This is usually but not
necessarily in the HDFS file system. For instance, if the job is running in LocalJobRunner mode, the file
system will be the local Unix filesystem rather than HDFS.
Example: ColumnTextRecordReader.nextKeyValue (1)
@Override
public boolean nextKeyValue() throws IOException { 1
if (pos >= end) return false; // don’t read past the split
int keywidth=7;
int lastnamewidth=25;
int firstnamewidth=10;
int datewidth=8;
byte[] keybytes = new byte[keywidth];

byte[] datebytes = new byte[datewidth];
byte[] lastnamebytes = new byte[lastnamewidth];
byte[] firstnamebytes = new byte[firstnamewidth];
1 nextKeyValue reads the next key/value pair starting at the current position
within the file (if possible). It returns true if a pair was read, false if no more
pairs were found.
The nextKeyValue() method will get called by the mapper repeatedly to process each record in the input
split one by one. It returns true if it was able to read another key/value pair, or false if it reached the end of
the input split.
@Override
public boolean nextKeyValue() throws IOException {
if (pos >= end) return false; 1 // don’t read past the split
int keywidth=7;
int datewidth=8;

1 If the current position is at or past the end of the split, there are no more
records to read, so return false.
We start by checking if we are at the end of the split. We *could* keep reading, but if we do, we will be
processing a record in another split, and it will get processed twice. This is a no-no. (We will talk soon
records that span splits, because then we would need to read the next split’s data. Bypass that for now.)
@Override
public boolean nextKeyValue() throws IOException {
if (pos >= end) return false; // don’t read past the split
int keywidth=7; 1
int datewidth=8;

1 Set up byte buffers for each field in the record.
Then we create empty byte arrays to hold the four fields we are going to read out of the file. (The values
for the widths can be set a number of ways. In our example code, we hard code them for simplicity, but to
make the record reader more flexible in the real world you’d want to configure them using configuration
parameters as discussed earlier in the class.)
//…continued from previous slide…
fileIn.readFully(pos,keybytes); 1
pos = pos + keywidth;
fileIn.readFully(pos,lastnamebytes);
pos = pos + lastnamewidth;
fileIn.readFully(pos,firstnamebytes);
pos = pos + firstnamewidth;
fileIn.readFully(pos,datebytes);
pos = pos + datewidth;
key = new Text(keybytes);

String valuestring = new String(lastnamebytes).trim() + "," +
new String(firstnamebytes).trim() + "\t" +
new String(datebytes).trim();
value = new Text(valuestring);
return true;
}
…
1 Read exactly enough bytes from the input stream to fill the buffers. Advance
the current position pointer.
Then we read each field. the readFully method is like read, except that it throws an exception if it can’t read
the specified number of bytes. If that happens, something is wrong with the format of the file.
At the end of this sequence, the pos pointer will be positioned to begin reading the next record next time
the method is called.
//…continued from previous slide…
fileIn.readFully(pos,keybytes);
pos = pos + keywidth;
fileIn.readFully(pos,lastnamebytes);
pos = pos + lastnamewidth;
fileIn.readFully(pos,firstnamebytes);
pos = pos + firstnamewidth;
fileIn.readFully(pos,datebytes);
pos = pos + datewidth;
key = new Text(keybytes); 1

String valuestring = new String(lastnamebytes).trim() + "," +
new String(firstnamebytes).trim() + "\t" +
new String(datebytes).trim();
value = new Text(valuestring);
return true;
}
…
1 Construct new Key and Value objects to hold the data just read into the byte
buffers. Return true to indicate a key/value pair was just read and is ready
to be retrieved by the getters.
Finally, now that we’ve read the fixed width fields, we convert them into values.
The key is easy, we just want to set it to a Text object containing the full 7-byte ID we read from the file.
The value is harder, because part of our goal is to output it in a specific format
(Lastname,Firstname<tab>Date).
Note that we don’t have to do this. We could have just set the value to the entire record minus the key (or
even output the offset as the key and the full record as the value, like LineRecordReader does), and then
let the Mapper parse out the data in the fields. Doing it this way lets our Mapper be agnostic about exactly
column formats…if we need to process files with similar data but different column widths or a different
order of the fields, we could write a new RecordReader, and leave the Mapper as-is.
Example: ColumnTextRecordReader Progress Tracker
…
@Override
public float getProgress() throws IOException, 1
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (pos - start) / (float)(end - start));
}
}
}
1 getProgress is used to inform the Job Tracker of how far along the current
Mapper is in its task. Return the approximate percentage complete.
The last method here is getProgress(). That allows the task tracker to keep track of how far each mapper is
along in its task. this returns a percentage of the length we’ve processed so far.
Example: Using a Custom Input Format
// Driver Class...
…
job.setJarByClass(MyDriver.class);
job.setJobName(”Process Column File");

job.setInputFormatClass(ColumnTextInputFormat.class); 1
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);

return (success ? 0 : 1);
}
…
1 The driver configures the job to use the custom InputFormat.
If this is not set, the job will use the default input format: TextInputFormat.
Reading Records That Span Splits (1)
▪ In our example, it was easy to ensure that InputSplits respected record
boundaries
▪ Typically, a record may span InputSplits
─ Custom RecordReaders must handle this case

It is up to the RecordReader implementation to deal with this, such that this record gets processed (and
only gets processed once).
Note that in our example, we deftly avoided this problem by computing the splits such that it aligned with
record boundaries, which was possible because of the fixed byte length of the columns. This is not usually
possible, in which case our record reader must handle the possibility of split records.
We don’t know the code for this because it is hairy, and because many implementations of custom record
readers actually build on line record reader which does this for you.
Reading Records That Span Splits (2)
▪ Example: LineRecordReader

Here’s how LineRecordReader does it: Remember that it is given the InputSplit in terms of a filename, start
byte offset, and length of the Split.
• If you’re starting at the beginning of the file (start offset is 0), start reading records from the beginning
of the InputSplit
• Otherwise, scan forward to the first record boundary and start reading from there
• Read records until you get to the end of the last record of your InputSplit. If you need to go past the
end of your InputSplit to get to the end of the last record, that’s OK. You must read a complete record.
That takes care of _almost_ every case. The ’edge case’ is, what happens if the record boundary occurs right
at the block split? The linerecord reader takes care of this by making sure that each split’s record reader
finishes reading a block that ends *after* the last byte in the split. Then each record reader (other than
the one for the first split) skips the first line, knowing that it was already read by another record reader,
regardless of whether it was a full or partial record.
Take the case where split 1 ends exactly at the end of line 100, and split 2 starts at the beginning of line
101. The Record Reader for split 1 reads line 101. Record reader 2 starts with line 102.
note that with the default settings for FileInputFormat each InputSplit corresponds to an HDFS block. So
a record that spans a split also spans file blocks. So for a record reader to continue reading past the end
of a split means reading from an HDFS block that is likely on a different node, so this does result in some
cross-network data retrieval. This is generally a small enough amount in the context of a job to not be a
bottleneck as discussed on the next slide.
Aside: Input Splits and Data Locality (1)
▪ By default, FileInputFormat creates Input Splits that correspond to HDFS
blocks
▪ The Job Tracker will attempt to run Mappers on data nodes where the
associated block is stored
Aside: Input Splits and Data Locality (2)
▪ But what about Input Formats that split the input file differently?
▪ Input Split includes a list of nodes which store the data
▪ Job Tracker will attempt to run Mappers on nodes containing the greatest
amount of the data
▪ Data from blocks on other nodes will be copied over the network
In our example, we create input splits that are multiples of 50 bytes. Blocks are usually 64 or 128
megabytes. A megabyte is not actually exactly 1M bytes…it is 1,048,576 bytes. Since this is not a multiple of
50, an input split will never exactly correspond to a block.
Another example is when a FileFormat is unsplittable. In this case, a single input split will access the data in
all the blocks that comprise the file.
This impact on the network should be considered when designing file input formats. This is why unsplittable
compression formats like gzip are not ideal.
Custom OutputFormats
▪ OutputFormats work much like InputFormat classes
▪ Custom OutputFormats must provide a RecordWriter implementation

OutputFormats basically take key/value pairs and write them out in some format.
There are two commonly-used OutputFormats that ship with Hadoop: TextOutputFormat (which writes
plain text files) and SequenceFileOutputFormat (which writes sequence files, as described earlier). Hadoop
has another OutputFormat (which is not file-based) called NullOutputFormat. This writes no output at all,
so it’s handy in cases like the Map-only lab which increments counters but isn’t intended to produce output
(although this lab uses TextOutputFormat, so it produces one empty file per mapper since no content is
written).
Custom OutputFormat Example: Column Output
In this example, the ColumOutputFormat does exactly what the usual TextOutputFormat does, except that
it pads the key with spaces so that the value will start in a particular column. It can be configured with a
parameter called keyColumnWidth to adjust how many space to pad with…default width is 8.
Example: ColumnOutputFormat (1)
public class ColumnOutputFormat<K,V> extends FileOutputFormat<K,V> { 1
@Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job)
Configuration conf = job.getConfiguration();
Path file = getDefaultWorkFile(job, "");

FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, false);
return new ColumnRecordWriter<K, V>(fileOut, 8);

}
1 OutputFormat classes are similar to InputFormat. File-based output should

extend the abstract base class and specify key and value types.
NOTE: The code to implement this example is in the outputformat project/example package on the student
VMs.
public class ColumnOutputFormat<K,V> extends FileOutputFormat<K,V> {
@Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) 1
Path file = getDefaultWorkFile(job, "");


}
1 getRecordWriter is a RecordWriter factory method, just like

getRecordReader.
The use of “K” and “V” as generic types instead of actual types means that this class supports any type.
public class ColumnOutputFormat<K,V> extends FileOutputFormat<K,V> {
@Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job)
Path file = getDefaultWorkFile(job, ""); 1


}
1 Our custom RecordWriter takes an output file pointer and the width to which
it should pad the key.
Example: ColumnRecordWriter (1)
public class ColumnRecordWriter<K,V> extends RecordWriter<K, V> { 1
private DataOutputStream out;

private int columnWidth;
public ColumnRecordWriter(DataOutputStream out, int columnWidth) {

this.out = out;
this.columnWidth = columnWidth;
}
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
out.close(); }
@Override
public void write(K key, V value) throws IOException, InterruptedException {
String outstring = String.format("%-" + columnWidth + "s%s\n",key.toString(),value.toString());
out.writeBytes(outstring); }
}
imports omitted for brevity
1 Custom RecordWriters usually extend abstract base class RecordWriter. Our
constructor takes a pointer to the file we should write do, and the width for
padding the key string.
public class ColumnRecordWriter<K,V> extends RecordWriter<K, V> {

this.out = out;
this.columnWidth = columnWidth; }
@Override
public void close(TaskAttemptContext context) throws IOException,
out.close(); 1
}
@Override
public void write(K key, V value) throws IOException, InterruptedException {
String outstring = String.format("%-" + columnWidth + "s%s\n",key.toString(),value.toString());
out.writeBytes(outstring); }
}
1 Close the file when we’re done with it.
public class ColumnRecordWriter<K,V> extends RecordWriter<K, V> {

this.out = out;
this.columnWidth = columnWidth; }
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
out.close(); }
@Override
public void write(K key, V value) throws IOException, 1
String outstring = String.format("%-" + columnWidth +
"s%s\n",key.toString(),value.toString());
out.writeBytes(outstring);
}
}
1 The write method does the actual work of outputting the
data. Construct an output string then write it to the file.
Example: Configuring Job to Use a Custom Output Format
// Driver Class...
…
…

job.setOutputFormatClass(ColumnOutputFormat.class); 1
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);

return (success ? 0 : 1);
}
…
1 The driver configures the job to use the custom OutputFormat.
InputFormat and OutputFormat Examples
▪ Your VM includes Eclipse projects with fully implemented code for the fixed
column width examples of InputFormat and OutputFormat
The examples we just covered (fixed width input and output format) are implemented on the VM in the
exercise workspace.
Chapter Topics

▪ Conclusion
Key Points
▪ All keys in Hadoop are WritableComparable objects
─ Writable: write and readFields methods provide serialization
─ Comparable: compareTo method compares two WritableComparable
objects
▪ Key/Value pairs can be encoded in binary SequenceFile and Avro data files
─ Useful when one job’s output is another job’s input
▪ Hadoop supports reading from and writing to compressed files
─ Use “splittable” encoding for MapReduce input files (e.g., Snappy)
▪ InputFormats handle input to Mappers by constructing
─ InputSplits dividing up the input file(s) and
─ RecordReaders to parse data from InputSplits into Key/Value pairs
▪ OutputFormats handle output from Reducers by constructing
─ RecordWriters to write Key/Value pairs
Common MapReduce Algorithms
Chapter 13
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
▪ How to sort and search large data sets
▪ How to perform a secondary sort
▪ How to index data
▪ How to compute term frequency – inverse document frequency (TF-IDF)
▪ How to calculate word co-occurrence
Introduction
▪ MapReduce jobs tend to be relatively short in terms of lines of code
▪ It is typical to combine multiple small MapReduce jobs together in a single
workflow
─ Often using Oozie (see later)
▪ You are likely to find that many of your MapReduce jobs use very similar code
▪ In this chapter we present some very common MapReduce algorithms
─ These algorithms are frequently the basis for more complex MapReduce jobs
A good way to think about MapReduce is that it’s like the UNIX philosophy. You have a series of small,
relatively simple pieces, but you chain them together in order to complete some larger, more complex task.
Chapter Topics
▪ Sorting and Searching Large Data Sets

▪ Indexing Data
▪ Hands-On Exercise: Creating an Inverted Index
▪ Computing Term Frequency – Inverse Document Frequency (TF-IDF)
▪ Calculating Word Co-Occurrence
▪ Hands-On Exercise: Calculating Word Co-Occurrence
▪ Performing a Secondary Sort
▪ Conclusion
Sorting (1)
▪ MapReduce is very well suited to sorting large data sets
▪ Recall: keys are passed to the Reducer in sorted order
▪ Assuming the file to be sorted contains lines with a single value:
─ Mapper is merely the identity function for the value (k, v) →(v, _)
─ Reducer is the identity function (k, _) →(k, '')
Addams Gomez
Andrews Julie Addams Gomez
Jones Zeke Addams Jane Addams Jane
Turing Alan Andrews Julie
Jones David Andrews Julie Jones Asa
Addams Jane Jones Asa
Jones David
Jones Asa Jones David
Addams Gomez Jones David Jones Zeke
Jones David Turing Alan
Jones David
Jones Zeke
Turing Alan
Mapper: The value passed in as input is used as the key in the output.
We’re taking advantage of the fact that Hadoop takes care of sorting the keys. By using the value (e.g. a line
of text from a file) passed into the Map function as the key in the map function’s output, these keys (lines
of text) are sorted when they are passed to the reducer, so all lines of text from all files are sorted in the
final output.
Sorting (2)
▪ Trivial with a single Reducer
▪ Harder for multiple Reducers
Addams Gomez
Addams Gomez
Andrews Julie Andrews Julie
Jones David
Andrews Julie Jones David Jones David
Jones Zeke
Turing Alan Jones David
Jones David
Addams Jane
Jones Asa Addams Jane
Addams Jane
Addams Gomez Jones Asa Jones Asa
Jones David Jones Zeke
Jones Zeke Turing Alan
Turing Alan
▪ For multiple Reducers, need to choose a partitioning function such that if

k1 < k2, partition(k1) <= partition(k2)
This all works well when you are using a single reducer, since all keys passed to the reducer will be sorted.
However, if you have multiple reducers there is a subtle but important distinction to note: the keys passed
to any given reducer are sorted. However, the default partitioner divides up the keyspace based on the
hashcode of the key so there is no global order.
An example will clarify this. Assume that the mappers all generate keys in the set (Apple, Banana, Cherry,
Date). If you have two reducers, the first might receive keys Apple and Cherry while the second receives
Banana and Date. Although the keys passed to each reducer are in sorted order, the order across reducers
is not sorted. The solution to this is to consider global order in your partioner by ensuring that a key that
would appear in sorted order before another key will therefore go to a reducer whose index number in the
getPartition method is less than the index number of reducer for the other key.
Sorting as a Speed Test of Hadoop
▪ Sorting is frequently used as a speed test for a Hadoop cluster
─ Mapper and Reducer are trivial
─ Therefore sorting is effectively testing the Hadoop framework’s I/O
▪ Good way to measure the increase in performance if you enlarge your cluster
─ Run and time a sort job before and after you add more nodes
─ terasort is one of the sample jobs provided with Hadoop
─ Creates and sorts very large files
The TeraSort is a common source of “benchmarketing” and there are usually too many variables at play for
it to be an ideal way to compare the overall, real-world performance of two different clusters. However, it
can be useful as a way to test optimizations on your own cluster by running it before you make a change
(such as adding more nodes), then running it again afterwards and comparing the results.
In case you don’t have a huge set of data for ‘terasort’ to work on, Hadoop also comes with a ‘teragen’
utility that can generate this data for you.
Searching
▪ Assume the input is a set of files containing lines of text
▪ Assume the Mapper has been passed the pattern for which to search as a
special parameter
─ We saw how to pass parameters to a Mapper in a previous chapter
▪ Algorithm:
─ Mapper compares the line against the pattern
─ If the pattern matches, Mapper outputs (line, _)
─ Or (filename+line, _), or …
─ If the pattern does not match, Mapper outputs nothing
─ Reducer is the Identity Reducer
─ Just outputs each intermediate key
“If the pattern matches, Mapper outputs (line, _)” – this simulates the standard UNIX grep program
“Or (filename+line, _)” – this would simulate grep’s -H and -n options.
Common question: So why not just use grep?
Answer: Can your grep search petabytes of data? In parallel, across hundreds of machines simultaneously?
Chapter Topics

▪ Indexing Data
▪ Conclusion
Indexing
▪ Assume the input is a set of files containing lines of text
▪ Key is the byte offset of the line, value is the line itself
▪ We can retrieve the name of the file using the Context object
─ More details on how to do this in the Exercise
As explained in the lab exercise, the file name can be accessed from the Reporter like this:
FileSplit fs = (FileSplit) context.getInputSplit();
String fileName = fs.getPath().getName();
Note that the InputSplit returned by the MockReporter in MRUnit can be safely cast to a FileSplit, just as
shown above. However, the file name returned is a constant value (literally “somefile”).
Inverted Index Algorithm
▪ Mapper:
─ For each word in the line, emit (word, filename)
▪ Reducer:
─ Identity function
─ Collect together all values for a given key (i.e., all filenames for a
particular word)
─ Emit (word, filename_list)
The InvertedIndex algorithm is described in detail in Lin & Dyer, pages 68-74.
An inverted index is what you’ll find in the back of a book: a list of words followed by references to where
those words appear. The diagram on the next slide makes the goal much more clear.
Question: Why is it called an “inverted index”?
Answer: There is also a “forward index” which is a list of documents followed by references to the
words these documents contain (in other words, the opposite of an inverted index). Forward indexes
are commonly used in search engine implementations (http://en.wikipedia.org/wiki/
Search_engine_indexing#The_forward_index).
Inverted Index: Dataflow
Aside: Word Count
▪ Recall the WordCount example we used earlier in the course
─ For each word, Mapper emitted (word, 1)
─ Very similar to the inverted index
▪ This is a common theme: reuse of existing Mappers, with minor modifications
Chapter Topics

▪ Indexing Data
▪ Conclusion
Hands-On Exercise: Creating an Inverted Index
▪ In this Hands-On Exercise, you will write a MapReduce program to generate an
inverted index of a set of documents
Chapter Topics

▪ Indexing Data
▪ Conclusion
Term Frequency – Inverse Document Frequency
▪ Term Frequency – Inverse Document Frequency (TF-IDF)
─ Answers the question “How important is this term in a document?”
▪ Known as a term weighting function
─ Assigns a score (weight) to each term (word) in a document
▪ Very commonly used in text processing and search
▪ Has many applications in data mining
TF-IDF might be useful for automatic suggestion of keywords (e.g. in a content management system),
ranking in a search engine results page or in natural language processing.
TF-IDF: Motivation
▪ Merely counting the number of occurrences of a word in a document is not a
good enough measure of its relevance
─ If the word appears in many other documents, it is probably less relevant
─ Some words appear too frequently in all documents to be relevant
─ Known as ‘stopwords’
─ e.g. a, the, this, to, from, etc.
▪ TF-IDF considers both the frequency of a word in a given document and the
number of documents which contain the word
Stopwords will generally include words like a, the, this, that, these, those, to, from, etc.
TF-IDF: Data Mining Example
▪ Consider a music recommendation system
─ Given many users’ music libraries, provide “you may also like” suggestions
▪ If user A and user B have similar libraries, user A may like an artist in user B’s
library
─ But some artists will appear in almost everyone’s library, and should
therefore be ignored when making recommendations
─ Almost everyone has The Beatles in their record collection!
The fact that you have a Beatles album does not indicate that you are a Beatles fan…most people have one.
So in this case, a Beatles album might be considered a stopword for a music collection.
If you have a Yoko Ono album, on the other hand, that is going to be more significant. Very few people have
a Yoko Ono album. Even fewer are likely to admit having a Yoko Ono album!
TF-IDF Formally Defined
▪ Term Frequency (TF)
─ Number of times a term appears in a document (i.e., the count)
▪ Inverse Document Frequency (IDF)
─ N: total number of documents

─ n: number of documents that contain a term
▪ TF-IDF
─ TF × IDF
TF-IDF is the product of two values:
1. term frequency (TF)

2. inverse document frequency (IDF)
Multiply these together to calculate the TF-IDF.
An extended explanation of the TF-IDF calculation inspired by (but deviating slightly from) this training
material, with code samples and unit tests is available online in a three-part blog by Marcello de Sales
(http://marcellodesales.wordpress.com/2009/12/31/tf-idf-in-hadoop-part-1-
word-frequency-in-doc/).
Computing TF-IDF
▪ What we need:
─ Number of times t appears in a document
─ Different value for each document
─ Number of documents that contains t
─ One value for each term
─ Total number of documents
─ One value
Computing TF-IDF With MapReduce
▪ Overview of algorithm: 3 MapReduce jobs
─ Job 1: compute term frequencies
─ Job 2: compute number of documents each word occurs in
─ Job 3: compute TF-IDF
▪ Notation in following slides:
─ docid = a unique ID for each document
─ contents = the complete text of each document
─ N = total number of documents
─ term = a term (word) found in the document
─ tf = term frequency
─ n = number of documents a term appears in
▪ Note that real-world systems typically perform ‘stemming’ on terms
─ Removal of plurals, tense, possessives etc
TF-IDF is a good example of how we chain jobs together such that the output of one job is used as input for
the next job.
The ‘docid’ uniquely identifies the document. It will typically be an absolute file path or a URL, since both of
these uniquely identify a document.
Computing TF-IDF: Job 1 – Compute tf
▪ Mapper
─ Input: (docid, contents)
─ For each term in the document, generate a (term, docid) pair
─ i.e., we have seen this term in this document once
─ Output: ((term, docid), 1)
▪ Reducer
─ Sums counts for word in document
─ Outputs ((term, docid), tf)
─ i.e., the term frequency of term in docid is tf
▪ We can add a Combiner, which will use the same code as the Reducer
This is a practical example of where we create a “Pair” object (i.e. a tuple containing two distinct objects,
also known as a composite key) to use as the key. In our mapper, we have an output key which contains
both the term (word) and the document ID in which that word was found.
Computing TF-IDF: Job 2 – Compute n
▪ Mapper
─ Input: ((term, docid), tf)
─ Output: (term, (docid, tf, 1))
▪ Reducer
─ Sums 1s to compute n (number of documents containing term)
─ Note: need to buffer (docid, tf) pairs while we are doing this (more later)
─ Outputs ((term, docid), (tf, n))
The value used in the mapper is a three-part composite key containing the document ID, the term
frequency, and the literal value 1.
This is also illustrating the need for, and general technique to achieve, passing calculated in one step of a
multi-part MapReduce chain to a later step in the chain.
The note about buffering the (docid, tf) pairs is explaining that we need to hang on to these pairs while
summing up the literal ‘1’ values used in the composite key, in order to compute ‘n’ (the number of
documents the term appears in). As will be described on that slide, there’s a chance that we might not be
able to hold all of those pairs
(i.e. given a sufficiently large data set).
Computing TF-IDF: Job 3 – Compute TF-IDF
▪ Mapper
─ Input: ((term, docid), (tf, n))
─ Assume N is known (easy to find)
─ Output ((term, docid), TF × IDF)
▪ Reducer
─ The identity function
The astute student might note that step 3 is trivial (since it’s just multiplying TF and IDF) and point out
that since N is known all along, you could just as well have done this multiplication in the reducer of step
2. Indeed that would work (and would be more efficient) but would make the example somewhat more
complicated to understand (which is why we teach this as a three-step process).
Computing TF-IDF: Working At Scale
▪ Job 2: We need to buffer (docid, tf) pairs counts while summing 1’s (to
compute n)
─ Possible problem: pairs may not fit in memory!
─ In how many documents does the word “the” occur?
▪ Possible solutions
─ Ignore very-high-frequency words
─ Write out intermediate data to a file
─ Use another MapReduce pass
The notion with the extra pass is that in the existing step 2 you’d just calculate and emit (docid, n) pairs and
then join that with the ((term, docid), tf) data that you had previously calculated in step 1, and then you’d
add step 2a where (docid, n) is joined with ((term, docid), tf) yielding ((term, docid), (tf, n))
TF-IDF: Final Thoughts
▪ Several small jobs add up to full algorithm
─ Thinking in MapReduce often means decomposing a complex algorithm into
a sequence of smaller jobs
▪ Beware of memory usage for large amounts of data!
─ Any time when you need to buffer data, there’s a potential scalability
bottleneck
Be sure to emphasize the point about the risks of buffering data. This will come up again during the
discussion of joining datasets later in the course, so you can quiz students about it then.
Chapter Topics

▪ Indexing Data
▪ Conclusion
Word Co-Occurrence: Motivation
▪ Word co-occurrence measures the frequency with which two words appear
close to each other in a corpus of documents
─ For some definition of ‘close’
▪ This is at the heart of many data-mining techniques
─ Provides results for “people who did this, also do that”
─ Examples:
─ Shopping recommendations
─ Credit risk analysis
─ Identifying ‘people of interest’
Identifying words which appear near one another allows you to add context. For example, someone
searching for “drug store” would likely not be of interest to law enforcement while someone searching for
“drug smuggling” probably would be. Similar examples include: “soldering gun” versus “machine gun” or
“bank account” versus “bank robbery.”
Word Co-Occurrence: Algorithm
▪ Mapper
map(docid a, doc d) {
foreach w in d do
foreach u near w do
emit(pair(w, u), 1)
}
▪ Reducer
reduce(pair p, Iterator counts) {

s = 0
foreach c in counts do
s += c
emit(p, s)
}
The mapper is similar to the mapper used in WordCount, only there is an extra inner loop to iterate over
words near the current word. The definition of “near” is up to the developer (you could look only at words
immediately adjacent to the current word, or you could analyze as many words “to the left of” or “to the
right of” the current word as you wish). For each word pair, you emit the literal value 1 just as you did in the
WordCount mapper.
And just as in WordCount, the iterator iterates over these (only in this case, they are word pairs rather than
individual words) and sums up the occurrences.
Chapter Topics

▪ Indexing Data
▪ Conclusion
Hands-On Exercises: Calculating Word Co-Occurrence
▪ In these Hands-On Exercises you will write an application that counts the
number of times words appear next to each other
▪ If you complete the first exercise, please attempt the bonus step, in which you
will rewrite your code to use a custom WritableComparable
*** NEW NOTE – CUR-886 ***

Word co-occurrence—the analysis of pairs of words that appear near each other—aims to find similarities
of meaning between word pairs and/or similarities in meaning in word patterns.
Technically, our word co-occurrence lab is actually counting bigrams:
• unigram = one word
• bigram = two words right next to each other
• n-gram = more than two
References:
http://www.soc.ucsb.edu/faculty/mohr/classes/soc4/summer_08/pages/
Resources/Readings/TheoryofMeaning.pdf
http://en.wikipedia.org/wiki/Bigram
Chapter Topics

▪ Indexing Data
▪ Conclusion
Secondary Sort: Motivation (1)
▪ Recall that keys are passed to the Reducer in sorted order
▪ The list of values for a particular key is not sorted
─ Order may well change between different runs of the MapReduce job
Andrews Julie 1935-Oct-01

Jones Zeke 2001-Dec-12
Turing Alan 1912-Jun-23
Jones David 1947-Jan-08
Addams Jane 1960-Sep-06
Jones Asa 1901-Aug-08
Addams Gomez 1964-Sep-18
Jones David 1945-Dec-30
In this example, we start with a list of names and birthdates in random order. If the last name is the key, we
can easily sort using the identity mapper and reducer.
However, the order of the items within a particular key group is not specified. Notice that the Joneses in
this list aren’t sorted at all, and in fact, on different runs may be in a different order.
▪ Sometimes a job needs to receive the values for a particular key in a sorted
order
─ This is known as a secondary sort
▪ Example: Sort by Last Name, then First Name

Secondary sort is being explained here so it can be referenced in the discussion of reduce-side joins in the
next chapter.
Sometimes this is as simple as wanting the final output to be ordered by two keys. In this example, the
primary key is the last name, but we also want to sort by first name. If we use the default partitioner, all
Joneses will go to the same reducer. If we pass them in sorted order, the reducer can simply output them,
guarantying that the final output will have all Joneses, in a single file and ordered by first name.
The question may arise: why don’t we just make the primary key be firstname,lastname, and then the
sorting happens automatically. The problem with that approach is that the partitiononer will then treat
Jones,David as a different key than Jones,Asa, and they won’t be guaranteed to go to the same reducer, and
even if they are, they won’t be grouped into the same call.
We need to sort by first and last name, but partition by last name only.
▪ Example: Find the latest birth year for each surname in a list
▪ Naïve solution
─ Reducer loops through all values, keeping track of the latest year
─ Finally, emit the latest year
▪ Better solution
─ Pass the values sorted by year in descending order to the Reducer, which can
then just emit the first value
Addams 1964
Andrews 1935
Jones 2001
Turing 1912
Another reason secondary sort would be useful is to make implementation of a min() or max() function
in your reducer easy. If you can ensure that the values are passed to your reducer in sorted order (which
normally you cannot, but by using secondary sort you can), then the smallest value is simply the first
value passed in. Likewise, you could flip the logic in your comparator (i.e. to return -1 when you previously
returned 1, and vice versa) to achieve descending order. In this case, the largest value would be the first
passed in. In either case, you need not iterate through all the values to find the largest one.
In this example, we do the same as in the name sort except that we choose to do a secondary sort by birth
year instead of first name. And we do it in descending order. (This might be useful for, say, genealogical
research)
Implementing Secondary Sort: Composite Keys
▪ To implement a secondary sort, the intermediate key should be a composite
of the ‘actual’ (natural) key and the value
▪ Implement a mapper to construct composite keys
let map(k, v) =
emit(new Pair(v.getPrimaryKey(), v.getSecondaryKey)), v)
Jones#2001 Jones Zeke 2001-Dec-12

Jones Zeke 2001-Dec-12
Turing Alan 1912-Jun-23 Turing#1912 Turing Alan 1912-Jun-23
Jones David 1947-Jan-08
Addams Jane 1860-Sep-06 Jones#1947 Jones David 1947-Jan-08
Jones Asa 1901-Aug-08 Addams#1860 Addams Jane 1860-Sep-06
Addams Gomez 1964-Sep-18
Jones David 1945-Dec-30 Jones#1901 Jones Asa 1901-Aug-08
Addams#1964 Addams Gomez 1964-Sep-18
Jones#1945 Jones David 1945-Dec-30
The secondary sort is described in TDG 3e on pages 277-283 (TDG 2e, 241-245). It is also discussed in Lin &
Dyer on pages 57-58.
This slide explains that you combine the original key and the value into a new composite key, and then
use this composite as the output key for your Mapper. But recall the problem in partitioning composite
keys described earlier in the course (which is that a tuple(a,b) does not yield the same hashcode as a tuple
(a,c) and therefore they don’t necessarily go to the same reducer even though they have the same first
element). We must work around this by creating our own Partitioner which examines just the first element
(the original or “natural” key) when selecting a reducer. (next slide)
Implementing Secondary Sort: Partitioning Composite Keys
▪ Create a custom partitioner
─ Use natural key to determine which Reducer to send the key to
let getPartition(Pair k, Text v, int numReducers) =

return(k.getPrimaryKey().hashCode() % numReducers)
Partition 0

Jones#1947 Jones David 1947-Jan-08
Jones#1947 Jones David 1947-Jan-08 Jones#1901 Jones Asa 1901-Aug-08
Addams#1860 Addams Jane 1860-Sep-06 Jones#1945 Jones David 1945-Dec-30
Jones#1901 Jones Asa 1901-Aug-08

Partition 1
Addams#1860 Addams Jane 1860-Sep-06
Jones#1945 Jones David 1945-Dec-30
If we used the default partitioner (Hash Partitioner) that would result in Addams#1860 possibly going to a
different reducer than Addams#1964. We need to make sure our partitioner uses only the “natural” part of
the key (last name).
Implementing Secondary Sort: Sorting Composite Keys
▪ Comparator classes are classes that compare objects
─ compare(A,B) returns:
─ 1 if A>B
─ 0 if A=B
─ -1 if A<B
▪ Custom comparators can be used to sort composite keys
─ extend WritableComparator
─ override int compare()
▪ Two comparators are required:
─ Sort Comparator
─ Group Comparator
A secondary sort can be achieved in MapReduce by sorting and grouping the keys in a particular way.
Implementing Secondary Sort: Sort Comparator
▪ Sort Comparator
─ Sorts the input to the Reducer
─ Uses the full composite key: compares natural key first; if equal, compares
secondary key
let compare(Pair k1, Pair k2) =

compare k1.getPrimaryKey(), k2.getPrimaryKey()
if equal
compare k1.getSecondaryKey(), k2.getSecondaryKey()
Addams#1860 > Addams#1964

Addams#1860 < Jones#1965
Note that the desired ordering has been achieved (within each key, the values appear in descending
numeric order).
Note: this is shown in pseudocode, because it’s not actually that simple.
When overriding WritableComparator, the primary “compare” method actually compares byte strings
rather than objects because it is much more efficient, and for most writables (such as Text or IntWritable),
this yields the correct result. For more complex objects (such as our hypothetical Pair example here), you
actually need to read the bytes and deserialize the Pair objects, and then call compare with the objects.
Implementing Secondary Sort: Grouping Comparator
▪ Grouping Comparator
─ Uses ‘natural’ key only
─ Determines which keys and values are passed in a single call to the Reducer
let compare(Pair k1, Pair k2) =

compare k1.getPrimaryKey(), k2.getPrimaryKey()
Addams#1860 = Addams#1964
Addams#1860 < Jones#1945
Note that the only the natural key is used, so Addams#1860 is considered “equal to” Addams#1965.
From the API documentation for the setGroupingComparatorClass method:
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed in a single call to the reduce function
if K1 and K2 compare as equal. Since setSortComparatorClass(Class) [NOTE: this was described on a
previous slide] can be used to control how keys are sorted, this can be used in conjunction to simulate
secondary sort on values.
Implementing Secondary Sort: Setting Comparators
▪ Configure the job to use both comparators

public class MyDriver extends Configured implements Tool {

…
job.setSortComparatorClass(NameYearComparator.class);
job.setGroupingComparatorClass(NameComparator.class);
…
}
}
Secondary Sort: Summary
Bonus Exercise: Exploring a Secondary Sort Example
▪ If you have time and want more depth
─ Bonus Exercise: explore the effects of different components in a secondary
sort job
▪ Please refer to the Bonus Exercises in the Hands-On Exercise Manual
Chapter Topics

▪ Indexing Data
▪ Conclusion
Key Points (1)
▪ Sorting
─ simple for single reduce jobs, more complex for multiple reduces
▪ Searching
─ Pass a match string parameter to a search mapper
─ Emit matching records, ignore non-matching records
▪ Indexing
─ Inverse Mapper: emit (term, file)
─ Identity Reducer
▪ Term frequency – inverse document frequency (TF-IDF)
─ Often used for recommendation engines and text analysis
─ Three sequential MapReduce jobs
Key Points (2)
▪ Word co-occurrence
─ Mapper: emits pairs of “close” words as keys, their frequencies as values
─ Reducer: sum frequencies for each pair
▪ Secondary Sort
─ Define a composite key type with natural key and secondary key
─ Partition by natural key
─ Define comparators for sorting (by both keys) and grouping (by natural key)
Joining Data Sets in MapReduce
Jobs
Chapter 14
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
Joining Data Sets in MapReduce Jobs
▪ How to write a Map-side join
▪ How to write a Reduce-side join
Introduction
▪ We frequently need to join data together from two sources as part of a
MapReduce job, such as
─ Lookup tables
─ Data from database tables
▪ There are two fundamental approaches: Map-side joins and Reduce-side joins
▪ Map-side joins are easier to write, but have potential scaling issues
▪ We will investigate both types of joins in this chapter
This is the same concept as used in relational databases. You have two distinct data sets which each contain
some common key and you want to relate this together in order to produce a new set which has data from
both input data sets. For example, you may have a customer list and a list of sales (which contains the ID for
the customer to which this order was sold). You might join these data sets (based on the customer ID field
which is common to each) in order to produce a report of what each customer bought.
But First…
▪ But first…
▪ Avoid writing joins in Java MapReduce if you can!
▪ Tools such as Impala, Hive, and Pig are much easier to use
─ Save hours of programming
▪ If you are dealing with text-based data, there really is no reason not to use
Impala, Hive, or Pig
Chapter Topics
▪ Writing a Map-side Join

▪ Writing a Reduce-side Join
▪ Conclusion
Map-Side Joins: The Algorithm
▪ Basic idea for Map-side joins:
─ Load one set of data into memory, stored in a hash table
─ Key of the hash table is the join key
─ Map over the other set of data, and perform a lookup on the hash table
using the join key
─ If the join key is found, you have a successful join
─ Otherwise, do nothing
An associative array is commonly called a ‘map’ in Java (e.g. the HashMap class is an example of this). It is
a data structure that stores a value that can be quickly accessed given its key. The term ‘associative array’
is a little unusual to Java programmers (though widely used by perl programmers), but is preferred here so
as not to confuse map (the more common name for this data structure) with map (the method name in the
Mapper).
You should draw this on the whiteboard as you explain the process to produce a map-side join.
Using the customer/order example described previously, you could store the customer data file in the
DistributedCache, then read it from that file in your mapper and store this in the associative array (in which
the key was the customer ID and the value was an object containing the details for that customer). Loading
this data as described would be done in the configure() method of the mapper (which is called before the
map method is ever called). The data file containing the order information, conversely, would be supplied
as input to your job. As each order record is passed to your map method, you simply look up the customer
to which it relates from the associative array by using the customer ID. Thus, you are able to join both order
and customer data and write out the report as needed.
The problem with this approach can be seen on the first sub-point (reading data into memory). This is
discussed on the next slide.
Map-Side Joins: Problems, Possible Solutions
▪ Map-side joins have scalability issues
─ The associative array may become too large to fit in memory
▪ Possible solution: break one data set into smaller pieces
─ Load each piece into memory individually, mapping over the second data set
each time
─ Then combine the result sets together
One way of avoiding this limitation is to ensure you read the smaller of the two data sets into the
associative array. In the previous example, we could have produced the same result by reading the order
data into the associative array and iterating over customer records in the mapper instead. However,
matching customers to orders is a one-to-many join (each customer likely returns to order more things in
future visits, so we have more order records than customer records). Since the memory limitation is based
on the size of the associative array, reading the smaller data set into memory this way will make running
out of memory less likely.
However, it’s still possible (given a sufficiently large set of customers), so the other solution described on
this slide is a possible workaround. Reduce-side joins (to be discussed in a moment) are a perhaps better
solution, though more complicated to implement.
Chapter Topics

▪ Conclusion
Reduce-Side Joins: The Basic Concept
▪ For a Reduce-side join, the basic concept is:
─ Map over both data sets
─ Emit a (key, value) pair for each record
─ Key is the join key, value is the entire record
─ In the Reducer, do the actual join
─ Because of the Shuffle and Sort, values with the same key are brought
together
The problem with map-side joins was that you read one data set into memory and iterated over the other
data set. If the first data set was too large, you’d run out of memory. What makes Hadoop scalable is that
you simply process key/value pairs one at a time and you tend to avoid maintaining any state between such
calls.
Reduce-side join takes advantage of this approach by simply reading both data sets simultaneously. In
your mapper, any given record you are passed could belong to the first data set or the second (yes, you
really are intentionally mixing them). You will find the common join key (e.g. customer ID) for the record
you have been given, and then output that ID as the key and output the record as the value. Because keys
are grouped together when they are passed to the reducer, the values passed to the reducer will be all
the records from both datasets for a given key. You will then simply need to merge them together. This is
described over the next several slides using the example of Human Resources data.
Reduce-Side Joins: Example
You have two types of records (employee records and location records). The employee record contains a
reference to the location ID (i.e. a foreign key), so the location ID will be the field we join on.
The result we want to achieve is a single record which contains employee and location data for a given
employee.
Example Record Data Structure
▪ A data structure to hold a record could look like this:
class Record {
enum RecType { emp, loc };
RecType type;
String empId;
String empName;
int locId;
String locName;
}
▪ Example records
type: emp type: loc

empId: 002 empId: <null>
empName: Levi Strauss empName: <null>
locId: 2 locId: 4
locName: <null> locName: London
Because we’re going to be mixing two record types, we need to define a data structure which can hold both
types of data.
The RecType field is an enum reference which will be used later to identify whether we have an employee
record or a location record.
Reduce-Side Join: Mapper
void map(k, v) {
Record r = parse(v);
emit (r.locId, r);
}
In the mapper, we parse whichever kind of record we were given and emit the location ID as the key and
the record itself as the value.
Reduce-Side Join: Shuffle and Sort
Reduce-Side Join: Reducer
void reduce(k, values) {

Record thisLocation;
List<Record> employees;
for (Record v in values) {

if (v.type == RecType.loc) {
thisLocation = v;
} else {
employees.add(v);
}
}
for (Record e in employees) {
e.locationName = thisLocation.locationName;
emit(e);
}
}
Now we iterate over the values to do the join. Since there are many employees per location, we will have
one location record among the many employee records. But because we need both to do the join and
because we don’t know where the location record we need will appear, we just iterate over all records and
store the values for later. Once we have read them all, we’re certain to have an employee record, so we can
do the join and emit the desired output.
But wait, doesn’t reading them all into a list give us the same problem we had in Map-side joins, namely
that we could have more employee records than will fit into memory? Yes!
Reduce-Side Join: Reducer Grouping
The dotted lines represent data grouped by key for a single call to reduce(). (This becomes relevant in a few
slides when we have to write a custom grouping comparator)
Scalability Problems With Our Reducer
▪ All employees for a given
location are buffered in the …
Reducer for (Record v in values) {
if (v.type == RecType.loc) {
─ Could result in out-of- thisLocation = v;
memory errors for large } else {
data sets employees.add(v);
}
}
…
▪ Solution: Ensure the location record is the first one to arrive at the Reducer
─ Using a Secondary Sort
And now it becomes clear why we talked about secondary sort previously. Since the reason we’re buffering
all the employee records into a list is because we don’t know where the location record may occur among
all the values, we can use the secondary sort technique to make sure the location record appears first,
before any employee records are received. This will negate the need to buffer any records at all and
therefore eliminate our memory problem.
A Better Intermediate Key (1)
class LocKey {
int locId;
boolean isLocation;
public int compareTo(LocKey k) {

if (locId != k.locId) {
return Integer.compare(locId, k.locId);
} else {
return Boolean.compare(k.isLocation, isLocation);
}
}

return locId;
}
}
This key is described on the next two slides.

Note that a Java class that implemented the pseudocode on the slide would implement the
WritableComparable interface.
class LocKey {
int locId; 1
boolean isLocation;

} else {
return Boolean.compare(isLocation, k.isLocation);
}
}

return locId;
}
}
1 Example Keys:
locId: 4 locId: 4
isLocation: true isLocation: false
class LocKey {
int locId;
boolean isLocation;
public int compareTo(LocKey k) { 1

} else {
return Boolean.compare(k.isLocation, isLocation);
}
}

return locId;
}
}
1 The compareTo method ensures that location keys will sort earlier than
employee keys for the same location.

locId: 4 > locId: 4
The boolean compare operator returns:

The value 0 if x == y; a value less than 0 if !x && y; and a value greater than 0 if x && !y
That is, true is considered “greater than” false. If you think of false=0 and true=1 then this is the same as an
arithmetic comparison.
class LocKey {
int locId;
boolean isLocation;

... code removed for space ...
}
}
public int hashCode() { 1
return locId;
}
}
1 The hashCode method only looks at the location ID portion of the record.
This ensures that all records with the same key will go to the same Reducer.
This is an alternative to providing a custom Partitioner.

locId: 4 == locId: 4
Hadoop will use hashCode() as the default way to group records that all go to the same reducer. This
code makes sure that the hashcode is based only on locId, not both parts of the key, so that all records
associated with a given location (that is, both location records and employee records) will be grouped into
the same call to the reducer.
A Better Mapper
void map(k, v) {
Record r = parse(v);
LocKey newkey = new LocKey;
newkey.locId = r.locId;
if (r.type == RecordType.emp) {
newkey.isLocation = false;
} else {
newkey.isLocation = true;
}
emit (newkey, r);
}
The # sign is not literally a part of the key, it’s just shown as a visual representation of a multi-part key.
Create a Sort Comparator…
▪ Create a sort comparator to ensure that the location record is the first one in
the list of records passed in each Reducer call
class LocKeySortComparator
boolean compare (k1,k2) {

return (k1.compareTo(k2));
}
}
Note: this slide used to have real code but is now pseudocode, because writing an actual comparator is
more complicated than this: it needs to have both an object compare method like the one here, and a
bitwise compare method, which either compares the bits directly or deserializes the object for comparison.
We didn’t cover the details about how to implement comparators previously, so we don’t try here either.
…And a Grouping Comparator…
▪ Create a Grouping Comparator to ensure that all records for a given location
are passed in a single call to the reduce() method
class LocKeyGroupingComparator
boolean compare (k1,k2) {

return (Integer.compare(k1.locId, k2.locId));
}
}
…And Configure Hadoop To Use It In The Driver
job.setSortComparatorClass(LocKeySortComparator.class);
job.setGroupingComparatorClass(LocKeyGroupingComparator.class);
Remember that by default Hadoop groups according to the key’s hashcode. So the two ways to affect
the grouping is to either change the hashCode method of the key, or implement a custom grouping
comparator. In the last chapter, we saw an example of using a grouping comparator. This time we are
relying on the hashcode method we showed a few slides ago, where the hashcode for the composite key is
set to the hashcode to the location ID alone…meaning that the calls to reduce() will be grouped by location
key (having already been sorted so that the location records are first).
A Better Reducer
Record thisLoc;
void reduce(k, values) {

for (Record v in values) {
if (v.type == RecordType.loc) {
thisLoc = v;
} else {
v.locationName = thisLoc.locationName;
emit(v);
}
}
}
A Better Reducer: Output with Correct Sorting and Grouping
Remember that by default Hadoop groups according to the key’s hashcode. So the two ways to affect
the grouping is to either change the hashCode method of the key, or implement a custom grouping
comparator. In the last chapter, we saw an example of using a grouping comparator. This time we are
relying on the hashcode method we showed a few slides ago, where the hashcode for the composite key is
set to the hashcode to the location ID alone…meaning that the calls to reduce() will be grouped by location
key (having already been sorted so that the location records are first).
Chapter Topics

▪ Conclusion
Key Points
▪ Joins are usually best done using Impala, Hive, or Pig
▪ Map-side joins are simple but don’t scale well
▪ Use reduce-side joins when both datasets are large
─ Mapper:
─ Merges both data sets into a common record type
─ Use a composite key (custom WritableComparable) with join
key/record type
─ Shuffle and sort:
─ Secondary sort so that ‘primary’ records are processed first
─ Custom Partitioner to ensure records are sent to the correct Reducer (or
hack the hashCode of the composite key)
─ Reducer:
─ Group by join key (custom grouping comparator)
─ Write out ‘secondary’ records joined with ‘primary’ record data
Integrating Hadoop into the
Enterprise Workflow
Chapter 15
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
Integrating Hadoop Into The Enterprise Workflow
▪ How Hadoop can be integrated into an existing enterprise
▪ How to load data from an existing RDBMS into HDFS using Sqoop
▪ How to manage real-time data such as log files using Flume
▪ How to access HDFS from legacy systems with FuseDFS and HttpFS
Chapter Topics
Enterprise Workflow
▪ Integrating Hadoop into an Existing Enterprise

▪ Loading Data into HDFS from an RDBMS Using Sqoop
▪ Hands-On Exercise: Importing Data With Sqoop
▪ Managing Real-Time Data Using Flume
▪ Accessing HDFS from Legacy Systems with FuseDFS and HttpFS
▪ Conclusion
Introduction
▪ Your data center already has a lot of components
─ Database servers
─ Data warehouses
─ File servers
─ Backup systems
▪ How does Hadoop fit into this ecosystem?
Additionally, your data center probably has lots of other servers (Web, mail, etc.) which are generating log
files containing data you want to analyze.
RDBMS Strengths
▪ Relational Database Management Systems (RDBMSs) have many strengths
─ Ability to handle complex transactions
─ Ability to process hundreds or thousands of queries per second
─ Real-time delivery of results
─ Simple but powerful query language
These types of databases are, generally speaking, able to store, retrieve and process relatively small
amounts of data very quickly. In contrast, Hadoop is optimized for processing large amounts of data, but
doesn’t do so in real time.
RDBMS Weaknesses
▪ There are some areas where RDBMSs are less ideal
─ Data schema is determined before data is ingested
─ Can make ad-hoc data collection difficult
─ Upper bound on data storage of 100s of terabytes
─ Practical upper bound on data in a single query of 10s of terabytes
With Hadoop, you don’t need to define a formal schema up front. This means you can store the data now
and worry about how to process it later.
You may also find that you cannot afford to reach the technical upper limit on how much data an RDBMS
can handle. Many commercial databases (Oracle, DB2, SQL Server, Sybase, etc.) can be quite expensive as
their licensing costs are often tied to the machine specifications (e.g. per processor rather than just per-
machine.) For large installations, it’s not unusual for complete licensing costs to reach into the millions of
dollars. Additionally, they may require (or simply work best with) specialized hardware that has expensive
reliability features.
Typical RDBMS Scenario
▪ Typical scenario:
─ Interactive RDBMS serves queries from a Web site
─ Data is extracted and loaded into a DW for processing and archiving

OLTP: Online Transaction Processing
OLAP: Online Analytical Processing
A data warehouse is used for analyzing large amounts of data, often to forecast trends or produce reports,
rather than real-time transaction processing. OLAP stands for Online Analytical Processing, which is what
data warehouses are for (as contrasted to OLTP, which is what RDBMS do, as explained earlier).
E.g. Netteza, Oracle Exadata, Teradata
Data from the transactional RDBMs is typical denormalized into an OLAP, which cube represents a
multi-dimensional data set, which, for example, might represent sales data like “products by region by
salesperson” (an example of a three-dimensional data set). Denormalization refers to the process of adding
information to records in a data warehouse system in an effort to reduce joins and improve query speed.
1. User visits website.
2. Web logs containing valuable information about user behavior are discarded
3. User buys product, which writes to the database (Transactional)
4. Data is extracted/transformed/loaded into a DW
5. BI tools analyze from DW
6. Data is too big/expensive to store long term so it is archived to tape
OLAP Database Limitations
▪ All dimensions must be prematerialized
─ Re-materialization can be very time consuming
▪ Daily data load-in times can increase
─ Typically this leads to some data being discarded
“Pre-materialized” means pre-computed in order to speed execution times for anticipated queries. When
putting things into a data warehouse, you go from a highly-normalized database to something with a much
flatter structure. That is:
• RDBMS = customers table, orders table, etc.
• Data Warehouse = big list of customers’ orders
For example, if you have a people table and a television shows table and a favorite_shows table that relates
people with their favorite shows, you have to decide to create that de-normalized view in your warehouse.
If someone hasn’t decided to create the flat, big list of customers orders or peoples favorite TV shows, you
can’t query that info from a data warehouse.
Warehouse "star schemas" are highly denormalized. Imagine an ER diagram that has a 5-way intersect
table that joins 5 other denormalized tables. That image of a single table, with joins radiating out to 5
other tables, is the "star". The center, intersect table is your "Fact" table. Each of the 5 tables with all the
properties is a "Dimension" table. "All dimensions must be materialized" means "You must run all the bulk
reporting on whatever (possibly normalized) source tables to generate these big denormalized tables for
your schema." This kind of bulk operation takes time to prepare your data warehouse before you can use it
for analytic queries, and it takes more and more time as your data grows.
Using Hadoop to Augment Existing Databases
▪ With Hadoop you can store and process all your data
─ The ‘Enterprise Data Hub’
▪ Reserve DW space for high value data
Hadoop doesn’t necessarily replace your OLTP or OLAP systems, it can be integrated between them. In
this example, data generated and stored in your interactive database (OLTP) can be offloaded periodically
into Hadoop for storage. That data can then be analyzed in Hadoop and the results fed back into your
relational database (i.e. the result of analysis might be product recommendations for customers). Likewise,
the data can also be exported from Hadoop and brought into your data warehouse system you can do do
the business intelligence (BI) activities you’re used to. But because a growing number of BI tools support
Hadoop, you might be able to have those query Hadoop directly, thereby reducing load on your data
warehouse system.
1. User visits website
2. Web logs get Flumed into Hadoop
3. User buys product, which writes to the database (Transactional)
4. Order database records get Sqooped into Hadoop (nightly)
5. MR jobs join the purchases to the web logs to figure out what people’s tastes are (i.e., recommendation
engine)
6. MR/Hive/Pig jobs perform some ETL on the data for future load into EDW
7. Recommendations are Sqooped back to the database for real-time use
in the web app
8. Sqoop moves some summarized data to the EDW
Benefits of Hadoop
▪ Processing power scales with data storage
─ As you add more nodes for storage, you get more processing power ‘for free’
▪ Views do not need prematerialization
─ Ad-hoc full or partial dataset queries are possible
▪ Total query size can be multiple petabytes
Hadoop Tradeoffs
▪ Cannot serve interactive queries
─ The fastest MapReduce job will still take several seconds to run
─ Cloudera Impala provides near real-time ad hoc queries
▪ Less powerful updates
─ No transactions
─ No modification of existing records
Your Web application shouldn’t be getting its data from Hadoop, it will be much too slow. Even a trivial
Hadoop job will usually take at least 10 seconds to run.
“No modification of existing records” is a reference to the fact that HDFS does not support random access
writes, as explained earlier in the course.
Impala: new in summer of 2013. provides near real-time queries response: seconds instead of minutes.
However, still not intended for the high volume, real time querying required in serving as the backend for
an interactive application such as a website.
Traditional High-Performance File Servers
▪ Enterprise data is often held on large fileservers, such as products from
─ NetApp
─ EMC
▪ Advantages:
─ Fast random access
─ Many concurrent clients
▪ Disadvantages
─ High cost per terabyte of storage
We’re talking about storage arrays here, or more broadly, NAS (Network Attached Storage) and SAN
(Storage Area Networks). These fileservers are meant to store data, not process it.
The cost per terabyte is probably on the order of about 10 times more expensive with these systems than
with HDFS, even after taking the loss of usable storage space in HDFS caused by replicating data three
times.
File Servers and Hadoop
▪ Choice of destination medium depends on the expected access patterns
─ Sequentially read, append-only data: HDFS
─ Random access: file server
▪ HDFS can crunch sequential data faster
▪ Offloading data to HDFS leaves more room on file servers for ‘interactive’ data
▪ Use the right tool for the job!
The third point (about offloading) is saying that you can save money overall by moving certain bulk data (for
example, log files) from your storage array to HDFS. Since HDFS has a lower cost per terabyte of storage,
this saves money by freeing space on your storage array for things that would really benefit from being
housed there (and therefore, you can delay buying an additional storage array to house more data).
Chapter Topics
Enterprise Workflow

▪ Conclusion
Importing Data From an RDBMS to HDFS
▪ Typical scenario: data stored in an RDBMS is needed in a MapReduce job
─ Lookup tables
─ Legacy data
▪ Possible to read directly from an RDBMS in your Mapper
─ Can lead to the equivalent of a distributed denial of service
(DDoS) attack on your RDBMS
─ In practice – don’t do it!
▪ Better idea: use Sqoop to import the data into HDFS beforehand

And aside from making you unpopular with the DBAs when you do this, it’s also not necessary. The Sqoop
tool lets you import data from your RDBMS into HDFS easily, as we’ll see next.
Mention example DSs such as Oracle database, MySQL or Teradata.
Sqoop: SQL to Hadoop (1)
▪ Sqoop: open source tool originally written at Cloudera
─ Now a top-level Apache Software Foundation project
▪ Imports tables from an RDBMS into HDFS
─ Just one table
─ All tables in a database
─ Just portions of a table
─ Sqoop supports a WHERE clause
▪ Uses MapReduce to actually import the data
─ ‘Throttles’ the number of Mappers to avoid DDoS scenarios
─ Uses four Mappers by default
─ Value is configurable
▪ Uses a JDBC interface
─ Should work with virtually any JDBC-compatible database
Any relational database a developer is likely to be using in a production system has a JDBC (Java Database
Connectivity; basically Java’s version of Microsoft’s ODBC) driver available and will therefore probably work
with Sqoop.
Sqoop: SQL to Hadoop (2)
▪ Imports data to HDFS as delimited text files or SequenceFiles
─ Default is a comma-delimited text file
▪ Can be used for incremental data imports
─ First import retrieves all rows in a table
─ Subsequent imports retrieve just rows created since the last import
▪ Generates a class file which can encapsulate a row of the imported data
─ Useful for serializing and deserializing data in subsequent MapReduce jobs
Point out that the default comma-delimited format could be easily processed using the
KeyValueTextInputFormat discussed earlier in class.
Incremental importing is described in the Sqoop documentation (http://archive.cloudera.com/
cdh/3/sqoop/SqoopUserGuide.html#_incremental_imports). Students sometimes ask
how Sqoop knows which records are newer than the ones it has previously imported. The brief answer is
that Sqoop can check rows for a timestamp or for an incrementing row ID (i.e. a primary key defined as
‘autoincrement’).
The third point can be misleading. Sqoop generates a binary .class file, but more importantly it also
generates the .java source file for that Java class. That class models a given row of data, so an Employees
table import will generate a Java class which represents an employee based on data in that table. This is
particularly helpful if you plan to read or write SequenceFiles for that data later.
Custom Sqoop Connectors
▪ Cloudera has partnered with other organizations to create custom Sqoop
connectors
─ Use a database’s native protocols rather than JDBC
─ Provides much faster performance
▪ Current systems supported by custom connectors include:
─ Netezza
─ Teradata
─ Oracle Database (connector developed with Quest Software)
▪ Others are in development
▪ Custom connectors are not open source, but are free
─ Available from the Cloudera Web site
Although you can use JDBC to connect Sqoop to nearly any database, if you’re using a database that
has a custom Sqoop connector available, you’ll get much better performance by using it because these
connectors are highly optimized for each specific database.
When you go to the Downloads page on the Cloudera web site, the Connectors section shows connectors
for MicroStrategy and Tableau in addition to the three connectors listed on the slide. The Quest, Teradata,
and Netezza connectors enable Sqoop to use the native functionality of a DB product as described in the
slide. The MicroStrategy and Tableau connectors are different - they use similar technology to integrate
their BI products with Hadoop.
In addition to the connectors mentioned here, Microsoft makes a connector for their SQL Server database,
but this is available at Microsoft’s Web site.
NOTE: there is a “direct mode” (via the --direct option to the sqoop command) which may give better
performance than straight JDBC for databases for which no custom Sqoop connector is available. MySQL is
one such database, so you might see warnings about the direct mode being faster when you run the Sqoop
lab. See the Sqoop documentation for more information on direct mode.
Sqoop: Basic Syntax
▪ Standard syntax:
sqoop tool-name [tool-options]
▪ Tools include:
import
import-all-tables
list-tables
▪ Options include:
--connect
--username
--password
The problem with specifying a password on the command line, is that on a multi-user UNIX or Linux system,
anyone can use the “ps” command (with certain options, depending on the type of system used) to see all
processes running on the system, complete with all command line options. Thus, the database credentials
would be visible to others, which is clearly bad for security. A good workaround is to use the -P (capital P)
option instead of --password, as this will prompt you to type the password interactively (and thus it will not
be part of the command line).
Sqoop: Example
▪ Example: import a table called employees from a database called
personnel in a MySQL RDBMS
$ sqoop import --username fred --password derf \

--connect jdbc:mysql://database.example.com/personnel \
--table employees
▪ Example: as above, but only records with an ID greater than 1000
$ sqoop import --username fred --password derf \

--connect jdbc:mysql://database.example.com/personnel \
--table employees \
--where "id > 1000"
The string following the --connect option is a JDBC connection string. It is a database-specific way of stating
which database you want to connect to (this example is based on MySQL). Information on the format of the
connection string will be something their database vendor (rather than Cloudera or the Sqoop community)
will provide.
Sqoop: Other Options
▪ Sqoop can take data from HDFS and insert it into an already-existing table in
an RDBMS with the command
$ sqoop export [options]
▪ For general Sqoop help:
$ sqoop help
▪ For help on a particular command:
$ sqoop help command
The “sqoop help” command just lists the available Sqoop tools (like import, export, list-tables, etc.)
The “sqoop help export” command, for example, tells you about the options available when exporting data.
Chapter Topics
Enterprise Workflow

▪ Conclusion
Hands-On Exercise: Importing Data With Sqoop
▪ In this Hands-On Exercise, you will import data into HDFS from MySQL
Chapter Topics
Enterprise Workflow

▪ Conclusion
Flume: Basics
▪ Flume is a distributed, reliable, available service for efficiently
moving large amounts of data as it is produced
─ Ideally suited to gathering logs from multiple systems and
inserting them into HDFS as they are generated
▪ Flume is Open Source
─ Initially developed by Cloudera
▪ Flume’s design goals:
─ Reliability
─ Scalability
─ Extensibility
Although Cloudera employs many Flume committers, there are also several from other companies including
Intuit and Trend Micro.
Flume: High-Level Overview
On this slide, you should point out that you have various kinds of systems generating data (such as Web
servers generating log files) and the agents collect this information. There can be thousands of agents in a
Flume system. This is passed through whatever processing you define, so you might compress or encrypt
data. Ultimately, this information is collected and written out to your Hadoop cluster.
The Flume agents are separate from your Hadoop cluster (i.e. you don’t run them on your Hadoop worker
nodes).
Instructors wanting to know more about Flume are advised to watch Henry Robinson’s “Inside Flume”
presentation (http://www.slideshare.net/cloudera/inside-flume), in particular slide
#10. However, this relates to the older version of Flume and is somewhat obsoleted by architectural
changes coming in Flume NG.
Flume Agent Characteristics
▪ Each Flume agent has a source, a sink and a channel
▪ Source
─ Tells the node where to receive data from
▪ Sink
─ Tells the node where to send data to
▪ Channel
─ A queue between the Source and Sink
─ Can be in-memory only or ‘Durable’
─ Durable channels will not lose data if power is lost
Flume’s Design Goals: Reliability
▪ Channels provide Flume’s reliability
▪ Memory Channel
─ Data will be lost if power is lost
▪ File Channel
─ Data stored on disk
─ Guarantees durability of data in face of a power loss
▪ Data transfer between Agents and Channels is transactional
─ A failed data transfer to a downstream agent rolls back and retries
▪ Can configure multiple Agents with the same task
─ e.g., two Agents doing the job of one “collector” – if one agent fails then
upstream agents would fail over
Flume’s Design Goals: Scalability
▪ Scalability
─ The ability to increase system performance linearly by adding more
resources to the system
─ Flume scales horizontally
─ As load increases, more machines can be added to the configuration
Flume’s Design Goals: Extensibility
▪ Extensibility
─ The ability to add new functionality to a system
▪ Flume can be extended by adding Sources and Sinks to existing storage layers
or data platforms
─ General Sources include data from files, syslog, and standard output from a
process
─ General Sinks include files on the local filesystem or HDFS
─ Developers can write their own Sources or Sinks
Reading data from Twitter streams may seem silly at first, but it’s widely used by marketers, financial
analysts and political scientists for “sentiment analysis” and to determine trending topics.
You might write your own connector to connect Flume up to some legacy system inside your company.
Flume: Usage Patterns
▪ Flume is typically used to ingest log files from real-time systems such as Web
servers, firewalls and mailservers into HDFS
▪ Currently in use in many large organizations, ingesting millions of events per
day
─ At least one organization is using Flume to ingest over 200 million events per
day
▪ Flume is typically installed and configured by a system administrator
─ Check the Flume documentation if you intend to install it yourself
An “event” is a unit of data in Flume. It consists of a body (such as a line from a log) and metadata (key/
value pairs which might include things like the date, time, hostname and user ID for which that log line was
generated).
Chapter Topics
Enterprise Workflow

▪ Conclusion
FuseDFS and HttpFS: Motivation
▪ Many applications generate data which will ultimately reside in HDFS
▪ If Flume is not an appropriate solution for ingesting the data, some other
method must be used
▪ Typically this is done as a batch process
▪ Problem: many legacy systems do not ‘understand’ HDFS
─ Difficult to write to HDFS if the application is not written in Java
─ May not have Hadoop installed on the system generating the data
▪ We need some way for these systems to access HDFS
“Many legacy systems do not ‘understand’ HDFS” – A good way of explaining this is that you cannot click
File -> Open in Excel and read in the output from your MapReduce job stored in HDFS.
FuseDFS
▪ FuseDFS is based on FUSE (Filesystem in USEr space)
▪ Allows you to mount HDFS as a ‘regular’ filesystem
▪ Note: HDFS limitations still exist!
─ Not intended as a general-purpose filesystem
─ Files are write-once
─ Not optimized for low latency
▪ FuseDFS is included as part of the Hadoop distribution
FUSE is available for Linux, Mac OS X, NetBSD and OpenSolaris operating systems.
FUSE is not specific to Hadoop. There are many other interesting FUSE filesystems available, including
one that lets you “mount” a remote FTP server so you can access it like a local filesystem and another
which lets you browse a ZIP file like a local filesystem. There is a long list of FUSE filesystems (http://
sourceforge.net/apps/mediawiki/fuse/index.php?title=FileSystems).
HttpFS
▪ Provides an HTTP/HTTPS REST interface to HDFS
─ Supports both reads and writes from/to HDFS
─ Can be accessed from within a program
─ Can be used via command-line tools such as curl or wget
▪ Client accesses the HttpFS server
─ HttpFS server then accesses HDFS
▪ Example:
curl -i -L http://httpfs-host:14000/webhdfs/v1/user/
foo/README.txt?op=OPEN
returns the contents of the HDFS /user/foo/README.txt file

REST: REpresentational State Transfer
The ability to use command-line tools like curl or wget make it easy to access HDFS files from UNIX shell
scripts.
Hoop has been renamed HttpFS. It is available in CDH4 and has been backported to CDH3.
For more information, refer to http://www.cloudera.com/blog/2012/08/httpfs-for-
cdh3-the-hadoop-filesystem-over-http.
The following blog entry about Hoop might still be useful to help you prepare to teach this slide: http://
www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http.
You might also get questions about WebHDFS when you teach this slide. WebHDFS is a Hadoop file system
that provides secure read/write access to HDFS over HTTP using a REST interface. HttpFS uses the same
REST interface that WebHDFS uses. However, when accessing HDFS using WebHDFS, you directly access
machines in the cluster; with HttpFS, you use a proxy server.
Note that HttpFS is a Cloudera initiative; WebHDFS is a Hortonworks initiative.
WebHDFS REST API: http://hadoop.apache.org/docs/r1.0.4/webhdfs.html
From the curl man page: curl is a tool to transfer data from or to a server, using one of the supported
protocols (HTTP, HTTPS, FTP, FTPS, SCP, SFTP, TFTP, DICT, TELNET, LDAP or FILE). The command is designed
to work without user interaction.
From wget man page: a free utility for non-interactive download of files from the Web. It supports HTTP,
HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.
Chapter Topics
Enterprise Workflow

▪ Conclusion
Key Points
▪ Hadoop augments data center components such as databases and data
warehouses
▪ Sqoop is a tool to load data from a database into HDFS
▪ Flume is a tool for managing real-time data
─ e.g. importing data from log files into HDFS
▪ FuseDFS and HttpFS provide access to HDFS from legacy systems
An Introduction to Hive, Impala,
and Pig
Chapter 16
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
An Introduction to Hive, Impala, and Pig
▪ What features Hive provides
▪ How Impala compares to Hive
▪ How a typical Pig script works
▪ How to choose between Impala, Hive, and Pig
Chapter Topics
An Introduction to Hive, Impala, and
Pig
▪ The Motivation for Hive, Impala, and Pig

▪ Hive Basics
▪ Hands-On Exercise: Manipulating Data with Hive
▪ Impala Overview
▪ Pig Overview
▪ Choosing Between Hive, Pig, and Impala
▪ Conclusion
Hive and Pig: Motivation (1)
▪ MapReduce code is typically written in Java
─ Although it can be written in other languages using Hadoop Streaming
▪ Requires:
─ A programmer
─ Who is a good Java programmer
─ Who understands how to think in terms of MapReduce
─ Who understands the problem they’re trying to solve
─ Who has enough time to write and test the code
─ Who will be available to maintain and update the code in the future as
requirements change
We briefly covered both Hive and Pig earlier in this course, but we’ll cover it a bit more now. However,
Cloudera also offers a two-day course that goes into depth on both Hive and Pig (mention dates and
locations of upcoming Hive/Pig classes, including the next date it will be offered in the current location, if
any).
Hadoop Streaming is convenient for certain kinds of analysis, but it has limitations (such as performance) as
discussed earlier.
The type of programmer described here is hard to find. Even if you find one, they are likely to be in demand
and hard to retain (hence the last point, they may have time to initially write the MapReduce code but too
busy with other things to maintain it).
Hive and Pig: Motivation (2)
▪ Many organizations have only a few developers who can write good
MapReduce code
▪ Meanwhile, many other people want to analyze data
─ Business analysts
─ Data scientists
─ Statisticians
─ Data analysts
▪ We need a higher-level abstraction on top of MapReduce
─ Providing the ability to query the data without needing to know MapReduce
intimately
─ Hive, Pig, and Impala address these needs
Chapter Topics
Pig

▪ Hive Basics
▪ Impala Overview
▪ Pig Overview
▪ Conclusion
Hive: Introduction
▪ Apache Hive is a high-level abstraction on top of MapReduce
─ Uses an SQL-like language called HiveQL
─ Generates MapReduce jobs that run on the Hadoop cluster
─ Originally developed by Facebook for data warehousing
─ Now an open-source Apache project
[Slide 9-5 copied from DA 201306]

This slide attempts to ease the transition from Pig to Hive. We want to emphasize that Hive and Pig share
similar high-level goals (i.e. making analysis of data stored in Hadoop easier and more productive than
by writing MapReduce code), but have somewhat different approaches for achieving them (one notable
difference is that HiveQL is declarative and generally expressed as a single operation, while Pig Latin is
procedural and is expressed as a series of distinct processing steps). Both were originally developed as
internal projects at two different companies (Pig came from Yahoo while Hive came from Facebook). The
fact that Hive’s interpreter runs on the client machine, generates MapReduce jobs, and then submits them
to a Hadoop cluster for execution means that Hive’s high-level architecture is fairly similar to Pig’s. We will
explore this in a bit more detail later in the chapter.
The HiveQL example shown here joins the customers and orders table in order to calculate the total cost
of all orders from customers in each ZIP (postal) code, where that ZIP code begins with “63” and then sorts
them in descending order of cost. The syntax should be very familiar to anyone who knows SQL (but note
that while it’s a perfectly legal query, this example won’t work as shown on our VM simply because our
orders table doesn’t actually have a cost column).
High-Level Overview for Hive Users
▪ Hive runs on the client machine
─ Turns HiveQL queries into MapReduce jobs
─ Submits those jobs to the cluster


The high-level architecture for Hive, at least in the level of detail appropriate for this audience (i.e.
end users rather than developers or system administrators), is quite similar to that of Pig. Your HiveQL
statements are interpreted by Hive. Hive then produces one or more MapReduce jobs, and then submits
them for execution on the Hadoop cluster. A more detailed illustration of the Hive architecture (more
appropriate for a technical audience) is shown in PH1e p. 7 or TDG p. 420). There isn’t a lot of detailed
documentation on Hive’s architecture, nor is an in-depth discussion of such details relevant for our
target audience. But if you’d like to know more as an instructor, you might read the slides from this 2011
presentation http://www.slideshare.net/recruitcojp/internal-hive#btnNext or
this one from 2009 http://www.slideshare.net/nzhang/hive-anatomy. The Hive wiki has
a developer guide https://cwiki.apache.org/Hive/developerguide.html, but this page
hasn’t been updated since 2011 and may be outdated.
Students (at least those with software development experience) often ask if they can see the code that Hive
created. As with Pig, Hive does not work by translating HiveQL to Java MapReduce code in order to compile
and submit this for execution on the cluster. Instead, it interprets the HiveQL code and creates an execution
plan that ultimately runs some set of built-in MapReduce jobs. While it’s possible to see the execution plan
(via the EXPLAIN keyword, to be discussed later), there’s really nothing else to see.
If people ask about how Hive compile its SQL statements and convert them into MR jobs point them
towards Hive developer Guide:
https://cwiki.apache.org/Hive/developerguide.html
and
https://cwiki.apache.org/Hive/developerguide.html#DeveloperGuide-
QueryProcessor
How Hive Loads and Stores Data
▪ Hive queries operate on tables, just like in an RDBMS
─ A table is simply an HDFS directory containing one or more files
─ Hive supports many formats for data storage and retrieval
▪ How does Hive know the structure and location of tables?
─ These are specified when tables are created
─ This metadata is stored in Hive’s metastore
─ Contained in an RDBMS such as MySQL

HiveQL queries don’t specify the data path, format, or column order (i.e. equivalent to Pig’s LOAD
statement). In Hive, as in a relational database management system (RDBMS), this information is provided
when you create the tables. We will cover the technique for creating Hive tables in the next chapter; in
this chapter (and associated lab) we’ll simply work with tables that have been created ahead of time. The
data for a table named customers, by default, will be /user/hive/warehouse/customers (the
/user/hive/warehouse/ path is known as Hive’s warehouse directory). The data can be delimited
textfiles or one of many other formats we’ll discuss later. Note that this is a distinction from Pig, as it
supports LOAD of individual files, while Hive is less granular because it loads all data in the directory.
Hive maintains table metadata via the metastore, a service which, by default, is backed by a small Apache
Derby http://db.apache.org/derby/ embedded database. This makes it easy to get started
with Hive because it’s not necessary to set up a full-featured multi-user database, but this approach
doesn’t scale well beyond a single user machine. The metastore in production deployments typically
use MySQL, though Oracle is also supported in CDH 4. The details of metastore setup is covered in our
Admin course, but if you’re interested in the basics as an instructor, you can read more here: http://
www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-
Installation-Guide/cdh4ig_topic_18_4.html and here https://cwiki.apache.org/
Hive/adminmanual-metastoreadmin.html.
How Hive Loads and Stores Data (cont’d)
▪ Hive consults the metastore to determine data format and location
─ The query itself operates on data stored on a filesystem (typically HDFS)

This depicts the basic data operations that occur when you query data in Hive and is meant to reinforce
the concept presented in the previous slide that only the metadata is stored in a relational database – the
actual data itself (i.e. what is being analyzed or processed by the queries specified the user, and from which
results are ultimately produced) comes from data stored in a filesystem (typically HDFS, but possibly local or
a remote filesystem supported by Hadoop such as S3).
Although Hive shares many superficial similarities with a relational database (syntax, concept of tables,
etc.), this is a good point to emphasize one important distinction between Hive and RDBMS: the manner in
which schema is applied. In an RDBMS, you create the table with rigid structure (e.g. a last name column
might be predefined to hold a maximum of 15 characters) that must be specified before any data is added
to the table. Conversely, with Hadoop you can store the data in HDFS without knowing its format at all.
You can examine it after you’ve already stored the data, determine the best schema, and then use that
information when you create the Hive table. In other words, Hadoop and Hive require you to know the
format of the data only when you need to analyze it (called “schema on read”) rather than when you need
to store it (called “schema on write”) as with an RDBMS. This provides far more flexibility, although the
side effect is that conflicts between the expected and actual data formats won’t be detected at the time
records are added as with an RDBMS; they’ll be detected at the time the query is executed. We’ll further
cover table creation in the next chapter.
Hive Data: Physical Layout
▪ Hive data is stored in Hive’s warehouse directory in HDFS
─ Default path: /user/hive/warehouse/<table_name>
▪ Tables represent subdirectories of the warehouse directory
▪ Possible to create external tables if the data is already in HDFS and should not
be moved from its current location
▪ Actual data is stored in flat files
─ Control character-delimited text, or SequenceFiles
─ Can be in arbitrary format with the use of a custom Serializer/Deserializer
(‘SerDe’)
─ All data in a directory is considered to be part of the table data
Data is stored in flat files in which fields are delimited (by default) by control-A characters (and individual
items in complex types like arrays, structs and maps are delimited by control-B or control-C characters; see
TDG 3e pages 435-436 (TDG 2e, 387) for details).
SerDe is pronounced “SURR-dee” (rhymes somewhat with ‘dirty’). A table describing several available Hive
SerDes can be found in TDG 3e on page 437 (TDG 2e, 389).
Using the Hive Shell
▪ You can execute HiveQL statements in the Hive Shell
─ This interactive tool is similar to the MySQL shell
▪ Run the hive command to start the shell
─ Each statement must be terminated with a semicolon
─ Use the quit command to exit the shell
$ hive
hive> SELECT cust_id, fname, lname

FROM customers WHERE zipcode=20525;
1000000 Quentin Shepard

1000001 Brandon Louis
1000002 Marilyn Ham
hive> quit;
The need to terminate statements with a semicolon is familiar to those who’ve used a database shell (or
Grunt), and as in either you must hit the Enter key after typing a statement in order to execute it (i.e. simply
adding a semicolon at the end isn’t enough since statements may span lines). This example shows a session
in which we start the Hive shell from the UNIX command line, run a query, see the results displayed as tab-
separated columns on the terminal’s standard output, then quit Hive and return to the UNIX shell. In reality,
this would also usually display several log messages (depending on configuration), but I have omitted them
here for brevity. The -S option, discussed on the next slide, can be used to suppress these messages. You
can also use the “SOURCE path” command from within Hive shell to execute HiveQL statements in the
file referenced by ‘path’.
Hive’s configuration is stored in XML files, typically in /etc/hive/conf, but you can specify an alternate
configuration directory via the HIVE_CONF_DIR environment variable. Those settings apply to everyone
who uses Hive on that machine (i.e. not much of a concern if you run Hive on your laptop, since you’re likely
the only user, but be careful if changing settings on server used by others). This is more an area for system
administrators, so we don’t cover configuration in depth in this class. We will, however, discuss how to set a
few important properties via a per-user configuration file ($HOME/.hiverc) in a moment. See TDG3e pp.
417-419 or PH1e pp. 24-34 for details on configuration.
Accessing Hive from the Command Line
▪ You can also execute a file containing HiveQL code using the -f option
$ hive -f myquery.hql
▪ Or use HiveQL directly from the command line using the -e option
$ hive -e 'SELECT * FROM customers'
▪ Use the -S (silent) option to suppress informational messages

─ Can also be used with -e or -f options
$ hive -S
The file containing HiveQL is simply a text file and it’s often referred to as a “script” (just as with Pig). It’s
customary to use the .hql (HiveQL) file extension, but .q (query) is also common. We will use the former
in this class. If you’re only executing a single statement with the -e option, it’s not necessary to terminate
it with a semicolon.
Hive frequently displays informational messages in normal usage; for example, when starting Hive, you
might see information about the configuration files it has loaded or where it will store logs for the current
session (depending on how Log4J is configured), and you’ll see MapReduce status messages while your
query runs. Using the -S (note that this is a capital S, and it is case sensitive) will enable silent mode, which
suppresses all non-essential output. This is very handy for when you want to run a one-off query and collect
the results to a local file:
$ hive -e 'SELECT DISTINCT email FROM users' > emails.txt
Hive Data Types
▪ Primitive types:
─ TINYINT
─ SMALLINT
─ INT
─ BIGINT
─ FLOAT
─ BOOLEAN
─ DOUBLE
─ STRING
─ BINARY (available starting in CDH4)
─ TIMESTAMP (available starting in CDH4)
▪ Type constructors:
─ ARRAY < primitive-type >
─ MAP < primitive-type, data-type >
─ STRUCT < col-name : data-type, ... >
A complete list of data types and descriptions can be found in TDG 3e, pages 426-428 (TDG 2e, 378-380).
Hive Basics: Creating Tables
SHOW TABLES;
CREATE TABLE customers

(cust_id INT, fname STRING, lname STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
DESCRIBE customers;
These commands should be familiar to anyone who has RDBMS experience, particularly with MySQL
experience. As with MySQL, the “DESCRIBE” command also supports the “DESC” abbreviation.
Loading Data Into Hive
▪ Data is loaded into Hive with the LOAD DATA INPATH statement
─ Assumes that the data is already in HDFS
LOAD DATA INPATH "cust_data"

INTO TABLE customers;
▪ If the data is on the local filesystem, use LOAD DATA LOCAL INPATH
─ Automatically loads it into HDFS in the correct directory
Loading the data assumes not only that the data to be loaded exists in HDFS, but also that the table into
which it will be loaded has been created previously. Adding the “LOCAL” keyword to the load statement
tells Hive to load from the local filesystem (e.g. the ext3 filesystem on your Linux box), therefore saving you
the step of moving this data into HDFS first.
Using Sqoop to Import Data into Hive Tables
▪ The Sqoop option --hive-import will automatically create a Hive table
from the imported data
─ Imports the data
─ Generates the Hive CREATE TABLE statement based on the table
definition in the RDBMS
─ Runs the statement
─ Note: This will move the imported table into Hive’s warehouse directory
Basic SELECT Queries
▪ Hive supports most familiar SELECT syntax
SELECT * FROM customers LIMIT 10;
SELECT lname,fname FROM customers

WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY lname DESC;
The first query selects the first ten records from the customer table.
The second query is similar, but demonstrates we can filter the results (with a WHERE clause just like SQL)
and also specify ordering.
Joining Tables
▪ Joining datasets is a complex operation in standard Java MapReduce
─ We saw this earlier in the course
▪ In Hive, it’s easy!
SELECT zipcode, SUM(cost) AS total

FROM customers JOIN orders
ON customers.cust_id = orders.cust_id;
Here we’re doing a simple join on two tables.

Hive is an excellent choice when you need to join data like this (as is Pig). Doing the equivalent by writing
MapReduce is certainly possible, but far more difficult and time consuming.
Storing Output Results
▪ The SELECT statement on the previous slide would write the data to the
console
▪ To store the results in HDFS, create a new table then write, for example:
INSERT OVERWRITE TABLE regiontotals

SELECT zipcode, SUM(cost) AS total
FROM customers JOIN orders
ON customers.cust_id = orders.cust_id
ORDER BY zipcode;
─ Results are stored in the table

─ Results are just files within the regiontotals directory
─ Data can be used in subsequent queries, or in MapReduce jobs
The last point on the slide is important. Not only can you use the output from one query as the input
to a new query, but you can have Hive (or Pig) fit in nicely to a multi-part workflow. For example, your
MapReduce jobs can produce data that you import and analyze in Hive. Likewise, you can export data from
Hive queries for subsequent processing in your MapReduce jobs.
Using User-Defined Code
▪ Hive supports manipulation of data via User-Defined Functions (UDFs)
─ Written in Java
▪ Also supports user-created scripts written in any language via the
TRANSFORM operator
─ Essentially leverages Hadoop Streaming
User-defined functions are commonly called “UDFs” (for user-defined functions). This example illustrates
a UDF defined in Python which will translate a UNIX timestamp (the number of milliseconds elapsed since
January 1, 1970; easy for computers to store) value to a weekday value (which is more human-readable).
UDFs can, of course, also be implemented in Java.
See TDG 3e pages 451-458 (TDG 2e, 402-409) for more information on UDFs. A list of built-in operators
and functions for Hive can be found in the Hive Wiki (https://cwiki.apache.org/confluence/
display/Hive/LanguageManual+UDF)
Hive Limitations
▪ Not all ‘standard’ SQL is supported
─ Subqueries are only supported in the FROM clause
─ No correlated subqueries
▪ No support for UPDATE or DELETE
▪ No support for INSERTing single rows
Hive: Where To Learn More
▪ Main Web site is at http://hive.apache.org/
▪ Cloudera training course: Cloudera Training for Data Analysts: Using Pig, Hive,
and Impala with Hadoop
Mention locations and dates for upcoming Hive/Pig classes.
Chapter Topics
Pig

▪ Hive Basics
▪ Impala Overview
▪ Pig Overview
▪ Conclusion
Hands-On Exercise: Manipulating Data With Hive
▪ In this Hands-On Exercise, you will manipulate a dataset using Hive
Chapter Topics
Pig

▪ Hive Basics
▪ Impala Overview
▪ Pig Overview
▪ Conclusion
Impala Overview
▪ High-performance SQL engine for vast amounts of data
─ 10 to 50+ times faster than Hive, Pig, or MapReduce
▪ Developed by Cloudera
─ 100% open source, released under the Apache software
license
Impala Overview
▪ Impala runs on Hadoop
clusters
─ Data stored in HDFS
─ Does not use
MapReduce
─ Uses the same
Metastore as Hive
For more information on Impala, look at http://blog.cloudera.com/blog/2012/10/

cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
It *does not* use MapReduce. The impalad daemons run on each slave node,. You access it from a
command-line tool on a client machine. It leverages Hive’s metadata: you create Hive tables, and then
query those tables using Impala.
Impala was announced at the Strata + Hadoop World conference in New York City on October 24, 2012,
after which the beta version that had tested by many of Cloudera’s customers during the previous months
became available to the general public. Several additional beta versions followed until the GA (General
Availability; i.e. 1.0 production version) was released on May 1, 2013.
“Inspired by Google’s Dremel database” – Dremel is a distributed system for interactive ad-hoc queries
that was created by Google. Although it’s not open source, the Google team described it in a published
paper http://research.google.com/pubs/archive/36632.pdf. Impala is even more
ambitious than Dremel in some ways; for example, the published description of Dremel says that joins
are not implemented at all, while Impala supports the same inner, outer, and semi-joins that Hive does.
Impala development is led by Marcel Kornacker, who joined Cloudera to work on Impala in 2010 after
serving as tech lead for the distributed query engine component of Google’s F1 database http://
tiny.cloudera.com/dac15b.
How is Impala so Fast?
▪ MapReduce is not optimized for interactive queries
─ High latency – even trivial queries can take 10 seconds or more
▪ Impala does not use MapReduce
─ Uses a custom execution engine built specifically for Impala
─ Queries can complete in a fraction of a second
Using the Impala Shell
▪ Impala shell is very similar to Hive shall
$ impala-shell
> SELECT cust_id, fname, lname FROM customers

WHERE zipcode='20525';
+---------+--------+-----------+
| cust_id | fname | lname |
+---------+--------+-----------+
| 1133567 | Steven | Robertson |
| 1171826 | Robert | Gillis |
+---------+--------+-----------+
> exit;

Note: shell prompt abbreviated as >
Impala’s shell is similar to Hive’s shell (or Grunt, Pig’s shell). However, one difference you’ll find obvious
after using Impala’s shell for a few minutes is that line editing works very well (unlike Hive’s shell, which
gets confused when you try to edit a previous command that spans multiple lines).
Beta versions of Impala didn’t require you to terminate commands with a semicolon, but Impala 1.0 and
later versions require this just like Hive shell (or Grunt).
You can also use quit to terminate the shell (as in Hive’s shell or Grunt), but exit is the preferred
command (and the one shown in the documentation).
Impala Limitations
▪ Impala does not currently support some features in Hive, e.g.
─ Complex data types (ARRAY, MAP, or STRUCT)
─ Support for BINARY data type
─ Custom file and row format support (SerDes)
─ SQL-style authorization (privileges and roles)
─ These limitations will all be addressed in future versions of Impala
“Many of these are being considered for future releases” – based on public statements by Cloudera
engineers (e.g. in presentations, on mailing lists, in blog entries, product documentation, etc.). However,
there is generally no commitment for a specific timeline by which these features will be implemented. The
list of features that follow are unsupported in Impala, at least as of the 1.0.1 release. Anything on this list is
a possibility for future inclusion in Impala, though the ones on the “Post-GA Top Asks” section of this blog
http://blog.cloudera.com/blog/2012/12/whats-next-for-cloudera-impala/ are
probably the most likely to be implemented in the near term.
• “No support yet for array, map, or struct” – nor is there support for UNION, but we never covered this
type.
• “external transformations” means there is no equivalent to Hive’s TRANSFORM … USING clause that
allows you process data using external scripts.
A more complete list of unsupported features can be found here http://www.cloudera.com/
content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-
Using-Impala/ciiu_langref_unsupported.html and here http://www.cloudera.com/
content/cloudera-content/cloudera-docs/Impala/latest/Cloudera-Impala-
Release-Notes/cirn_known_issues.html.
Chapter Topics
Pig

▪ Hive Basics
▪ Impala Overview
▪ Pig Overview
▪ Conclusion
Apache Pig: Introduction
▪ Apache Pig is a platform for data analysis and processing on Hadoop
─ It offers an alternative to writing MapReduce code directly
▪ Originally developed as a research project at Yahoo
─ Goals: flexibility, productivity, and maintainability
─ Now an open-source Apache project

Pig and Hive basically have the same goals, but represent two different ways of achieving them (as they
were created by different groups at different companies).
PigLatin is to Pig what HiveQL is to Hive.
Instructors who want a brief overview of Pig should watch this 15-minute video (http://
www.cloudera.com/resource/introduction-to-apache-pig/). Chapter 11 of TDG 3e (and
TDG 2e) describes Pig in great detail, although unlike with Hive there is an entire book available about Pig
(“Programming Pig” by Alan Gates, published by O’Reilly).
PP1e page 10 gives a history of Pig. The original paper from Yahoo Research (Pig Latin: A Not-So-Foreign
Language for Data Processing, presented at the SIGMOD ‘08 Conference) gives even more background on
the early history and design of Pig. A key quote about the Yahoo Research team’s motivation for creating it
is that they felt MapReduce is “too low-level and rigid, and leads to a great deal of custom user code that is
hard to maintain and reuse.” They also wanted to create a data flow language, which would “fit in a sweet
spot between the declarative style of SQL, and the low-level, procedural style of MapReduce.” As described
on PP1e p.10, Pig is not an acronym and the name was chosen without much thought simply because the
researchers at Yahoo wanted to give a name to what they’d previously just called “the language.”
Yahoo contributed Pig to the Apache project (via the Incubator) in 2007, with the first official release
following about one year later. Pig graduated to become a Hadoop subproject in 2008 and became a
top-level Apache project in 2009. A 2011 presentation http://www.slideshare.net/ydn/
ahis2011-platform-pigmaking-hadoop-easy from Alan Gates (a Pig committer and author of
Programming Pig) stated that “70% of production grid [cluster] jobs at Yahoo” use Pig.
The Anatomy of Pig
▪ Main components of Pig
─ The data flow language (Pig Latin)
─ The interactive shell where you can type Pig Latin statements (Grunt)
─ The Pig interpreter and execution engine


Yes, the title and first bullet point is a bit of intentional porcine humor. If your students laugh, it’s an
indication that they’re paying attention (and are easily amused). These are the core components that
students needs to be aware of at this point:
• The language itself (example shown in previous chapter) is called Pig Latin, but the entire system is
called Pig.
• Pig’s shell (i.e. the main way you use Pig interactively) is called Grunt. Note that you can also execute
Pig Latin in other ways, such as batch, as we’ll see later.
• As explained in the previous chapter, Pig interprets the Pig Latin statements and turns them into
MapReduce jobs, which it then submits to the cluster for execution.
Pig is a client-side application, which means that Pig is running on the machine that the user uses, not
necessarily a machine within the cluster (e.g. it might be that user’s laptop), although the machine running
Pig must have access to the cluster so that it can submit MapReduce jobs to it. It’s not necessary to install
it on all the nodes in your cluster (well, unless it’s possible that people might log into any node to run the
“pig” command).
But note, as shown on the next slide, that this is course on how to use these tools and not about how to
install them (that’s covered in the Admin course).
Using the Grunt Shell to Run Pig Latin
▪ You can use Pig interactively, via the Grunt shell
─ Pig interprets each Pig Latin statement as you type it
─ Execution is delayed until output is required
─ Very useful for ad hoc data inspection
▪ Starting Grunt
$ pig
grunt>
▪ Useful commands:
$ pig -help (or -h)

$ pig -version (-i)
$ pig -execute (-e)
$ pig script.pig
Invoking pig in a terminal will start the grunt shell. This is an interactive shell that takes commands from
the user. Other useful commands from the terminal are:
$ pig -help (or for short, $ pig -h)
$ pig -version
$ pig -execute
The last one takes a command or commands to execute. They should be in quotes like this:
$ pig -e "fs -ls"
Pig can also execute a script containing pig commands:
$ pig script.pig
Pig Features
▪ Pig supports many features which allow developers to perform sophisticated
data analysis without having to write Java MapReduce code
─ Joining datasets
─ Grouping data
─ Referring to elements by position rather than name
─ Useful for datasets with many elements
─ Loading non-delimited data using a custom SerDe
─ Creation of user-defined functions, written in Java
─ And more
These features generally apply to Hive as well.
Key Data Concepts in Pig
▪ Relational databases have tables, rows, columns, and fields
▪ Example: Sales data
name price country
Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it
Étienne (the French equivalent of Stephen), is pronounced like “eh-TYEN” and was chosen to gently
illustrate that the data can contain special characters like accents. The country codes http://
en.wikipedia.org/wiki/ISO_3166-1 correspond to the United States, Canada, Mexico, Germany,
France, and Italy. This column was added to support the distinction between “all values in a row” versus
just some values in a row.
The column headers (name, price, country) wouldn’t really be in the data file we loaded; they’re just shown
here for readability. The upcoming series of slides will relate key terms like tuple, and bag to the data
shown here.
Pig Data Concepts: Fields
▪ A single element of data is called a field
name price country
Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it
I’ve highlighted a few examples of fields here; one in each column, selected arbitrarily. Had I highlighted
all of them, it would have been confusing to see the pattern (i.e. that any intersection of a row and column
shown here is an field). A field may or may not have data (in the latter case, the value is NULL).
It’s perhaps noteworthy that PP1e uses the terms ‘scalar’ when referring to a single value, while the original
Pig Latin paper mostly uses “atomic value” or “atom” (the latter is defined in section 3.1 “Data Model”). The
Pig Web site uses the term ‘field’ and since that’s what is likely familiar to the audience, that’s what I will
use here too.
Pig Data Concepts: Tuples
▪ A collection of values is called a tuple
─ Fields within a tuple are ordered, but need not all be of the same type
name price country
Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it
Notice that a tuple is a collection of values, rather than a collection of atoms (scalar values). This subtle
distinction will become important in the next chapter when we discuss nested data structures (i.e. tuples
can contain atoms, but also complex types like maps or other tuples). A tuple is analogous to a row (or
partial row) in SQL.
The pronunciation of “tuple” isn’t universally accepted. The word shares its origins with similar words like
quintuplet and sextuplet, and most of the Cloudera instructors pronounce it accordingly (i.e. like TUPP-leht,
which almost rhymes with the word “puppet”). However, consider the pronunciation of another similar
word, quadruplet, implying a pronunciation like “TOO-pull.” Which way a person pronounces the word
seems to vary by academic background (e.g. whether they studied mathematics, engineering, computer
science, or music) and region. Regardless of how you pronounce it, you’re bound to find people who agree
with you and others who don’t.
I arbitrarily selected the row with Bob as the tuple, but any of the rows shown here is a tuple. The concept
of a tuple should be readily familiar to Python programmers, because this is a core data structure in that
language.
Pig Data Concepts: Bags
▪ A collection of tuples is called a bag
▪ Tuples within a bag are unordered by default
─ The field count and types may vary between tuples in a bag
name price country
Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it
A bag is analogous to a table or resultset (or portion thereof) in SQL. While I highlighted all the tuples here,
any collection of tuples (e.g. just rows 1-3) could also be a bag. The bag we’re looking at here is an “outer
bag” but as we’ll see later, a field could also contain a bag (in which case that bag would be an “inner bag”).
“Tuples within a bag are unordered” – while the columns always appear in the same order (i.e. name is
always position 0, price is position 1, etc.), the order of the tuples won’t necessarily remain the same.
If running some code produced this bag, than running the same code again might put Dieter as the first
item and Alice as the last. One exception is when we explicitly set the ordering using ORDER BY as will be
discussed later. As explained in PP1e, “A bag is an unordered collection of tuples. Since it has no order, it is
not possible to reference tuples in a bag by position.” However, we haven’t yet covered how to reference
items in complex data structures, so we’re simply setting the groundwork here for when we do.
“The field count and types may vary between tuples in a bag” – as explained in the Pig documentation
http://pig.apache.org/docs/r0.10.0/basic.html#relations, “relations [i.e. named
bags, as explained on the next slide] don’t require that every tuple contain the same number of fields or
that the fields in the same position (column) have the same type.” This is a major deviation from SQL,
where each row must have the same number of fields, and all fields at a given position must have the same
data type.
Pig Latin Scripts
▪ Pig Latin is a data flow language
─ The flow of data is expressed as a sequence of statements
▪ Typically, a Pig Latin script starts by loading one or more datasets into bags,
and then creates new bags by modifying those it already has
Example Pig Latin Script (1)
▪ Example: a script to read in sales data (sales rep name, sale total, country)
from a file and find all sales over $999, with highest order first
HDFS file: sales HDFS file: topsales
Alice 2999 us Bob 7001 fr

Bob 3625 uk Bob 3625 uk
Carlos 2764 mx Alice 2999 us
Alice 355 ca Carlos 2764 mx
Carlos 998 mx
Bob 7001 fr
By default, Pig expects input from tab-delimited HDFS files. This can be changed though.

allsales = LOAD 'sales' AS (name, price, country); 1
bigsales = FILTER allsales BY price > 999;
sortedbigsales = ORDER bigsales BY price DESC;
STORE sortedbigsales INTO 'topsales';
1 Load the data from the file into a bag.

allsales = LOAD 'sales' AS (name, price, country);
bigsales = FILTER allsales BY price > 999; 1
STORE sortedbigsales INTO 'topsales';
1 Create a new bag with sales over 999.

sortedbigsales = ORDER bigsales BY price DESC; 1
STORE bigsales INTO ’topsales';
1 Create a new bag with filtered data sorted by price (highest first).

STORE sortedbigsales INTO 'topsales'; 1
1 Output sorted data into a new directory.
Pig: Where To Learn More
▪ Main Web site: pig.apache.org
▪ To locate the Pig documentation:
─ For CDH4.5, select the Release 0.11 link under documentation on the left
side of the page
▪ Cloudera training course: Cloudera Training for Data Analysts: Using Pig, Hive,
and Impala with Hadoop
And of course you can learn more by taking the Data Analyst course (mention relevant dates and locations).
Chapter Topics
Pig

▪ Hive Basics
▪ Impala Overview
▪ Pig Overview
▪ Conclusion
Which to Choose? (1)
▪ Choose the best solution for the given task
─ Mix and match as needed
▪ MapReduce
─ Low-level approach offers flexibility, control, and performance
─ More time-consuming and error-prone to write
─ Choose when control and performance are most important
▪ Pig, Hive, and Impala
─ Faster to write, test, and deploy than MapReduce
─ Better choice for most analysis and processing tasks
[Slide from Data Analyst course 17-11 as of 8/5/2013]

MapReduce is usually too low-level for most analytical tasks students will likely need to perform in their
jobs, but another case where MapReduce is a better choice than Pig or Hive is when you need to process
input data in binary formats (such as audio or video files). Although Pig (via bytearray) and Hive (via
BINARY) have data types to store and retrieve binary data, neither Pig Latin nor HiveQL have any real
support for processing it (at least when compared to the flexibility you would have doing this processing in
Java or another general purpose programming language).
Using MapReduce in Java has better performance than, and can allow for optimizations not available in,
Hadoop Streaming (i.e. custom combiners and partitioners) but requires you to write even more code.
Using Streaming with a scripting language like Python allows you to use external libraries available in that
language (e.g. libraries for parsing XML, genome sequencing, statistical packages, etc.) but you can also
process data through external scripts in Pig or Hive.
“productivity” means “human labor” in this context. In other words, it takes less time to write the code to
do typical analytical tasks in Pig, Hive, or Impala than it would to write the equivalent MapReduce code.
However, writing MapReduce code can sometimes be more efficient in terms of total runtime, since you
could do several operations (e.g
Which to Choose? (2)
▪ Use Impala when…
─ You need near real-time responses to ad hoc queries
─ You have structured data with a defined schema
▪ Use Hive or Pig when…
─ You need support for custom file types, or complex data types
▪ Use Pig when…
─ You have developers experienced with writing scripts
─ Your data is unstructured/semi-structured
▪ Use Hive When…
─ You have very complex long-running queries
Comparing Pig, Hive, and Impala
Description of Feature Pig Hive Impala
SQL-based query language No Yes Yes
Schema Optional Required Required
Process data with external scripts Yes Yes Yes
Custom file format support Yes Yes No
Query speed Slow Slow Fast
Accessible via ODBC/JDBC No Yes Yes
Line by line comparison:

Hive and Impala use a similar SQL-like language for queries, which is familiar to data analysts, whereas Pig
uses a data flow language that will be more accessible to developers.
Hive and Impala both require a schema, which means you have to understand the structure of the data up
front, whereas Pig doesn’t, and therefore can be more flexible.
Both Pig and Hive allow you to create user-defined functions and/or process data with external scripts,
which allows for more flexibility.
Both also support complex data types and custom file types, again, more flexible.
Compared to Impala, Hive and Pig are very slow. (Still faster than traditional approaches to big data,
though.)
Hive and Impala allow access through industry standard database connectors via ODBC and JDBC.
Do These Replace an RDBMS?
▪ Probably not if the RDBMS is used for its intended purpose
▪ Relational databases are optimized for:
─ Relatively small amounts of data
─ Immediate results
─ In-place modification of data
▪ Pig, Hive, and Impala are optimized for:
─ Large amounts of read-only data
─ Extensive scalability at low cost
▪ Pig and Hive are better suited for batch processing
─ Impala and RDBMSs are better for interactive use
Analysis Workflow Example
[slide from Data Analyst 17-12 July ‘13]

This workflow emphasizes key points we’ve made in the course. This illustrates how we might use all of the
tools we’ve studied. Key point: we create value by bringing together lots of diverse data and analyzing it
to find insight. In this example, we might bring in Web log data using an ETL process in Pig that sessionizes
the data (use case mentioned in chapter 3), import information about retail transactions from the sales
database using Sqoop (we did this in the first lab), and perform sentiment analysis on social media data
using Hive (we covered sentiment analysis in chapter 12 and did some work with product rating data in the
lab that followed it). All of this yields data in our Hadoop cluster, and we could analyze it interactively using
Impala and perhaps use Hive or Pig to do more batch-oriented processing such as producing reports that
we could push to our corporate intranet (this implies a batch process, so it could be done with either Pig or
Hive, but could likely also be done with Impala).
This ties together pieces of hands-on exercises (or tasks similar to them) – this workflow’s goal is to analyze
social media data to see which products people talk about, analyze Web logs to see which they look at,
and analyze sales transactions to see which they actually bought. This is known in the marketing world as
a conversion funnel http://en.wikipedia.org/wiki/Conversion_funnel, and improving it
by analyzing what works and what doesn’t might mean millions of additional dollars in revenue. This is also
a topic we explore further during Cloudera’s Introduction to Data Science: Building Recommender Systems
course.
Chapter Topics
Pig

▪ Hive Basics
▪ Impala Overview
▪ Pig Overview
▪ Conclusion
Key Points
▪ Hive and Pig provide an abstraction layer above MapReduce
─ Easier to work with
─ Generates MapReduce jobs automatically
▪ Hive treats HDFS directories as tables
─ Uses HiveQL – a SQL like language for working with table data
▪ Impala provides near real-time queries
─ Uses its own execution engine instead of MapReduce
─ 10-50x faster than Hive or Pig
▪ Pig Latin is a data flow language
─ Requires no schema
─ Well-suited to semi-structured data
An Introduction to Oozie
Chapter 17
Chapter Goal
Course Chapters
▪ Introduction
▪ Conclusion
▪ What Oozie is
▪ How to create Oozie workflows
Chapter Topics
▪ Introduction to Oozie
▪ Creating Oozie workflows
▪ Hands-On Exercise: Running an Oozie Workflow
▪ Conclusion
The Motivation for Oozie (1)
▪ Many problems cannot be solved with a single MapReduce job
▪ Instead, a workflow of jobs must be created
▪ Simple workflow:
─ Run Job A
─ Use output of Job A as input to Job B
─ Use output of Job B as input to Job C
─ Output of Job C is the final required output
▪ Easy if the workflow is linear like this
─ Can be created as standard Driver code
We’ve discussed this “job chaining” workflow pattern several times, including in the TF-IDF example.
You could implement this by simply updating your Driver to run the first job, then the second, then the
third. That works OK when you have a simple sequence of jobs.
The Motivation for Oozie (2)
▪ If the workflow is more complex, Driver code becomes much more difficult to
maintain
▪ Example: running multiple jobs in parallel, using the output from all of those
jobs as the input to the next job
▪ Example: including Hive or Pig jobs as part of the workflow
However, defining this workflow in your driver doesn’t work well for non-sequential jobs, such as where
you need to run several jobs in parallel and then combine this output to form the input for a subsequent
job. It also doesn’t lend itself to a situation where you’d produce some data to be joined using Pig or Hive
and use that output as the input to a subsequent job.
What is Oozie?
▪ Oozie is a ‘workflow engine’
▪ Runs on a server
─ Typically outside the cluster
▪ Runs workflows of Hadoop jobs
─ Including Pig, Hive, Sqoop jobs
─ Submits those jobs to the cluster based on a workflow definition
▪ Workflow definitions are submitted via HTTP
▪ Jobs can be run at specific times
─ One-off or recurring jobs
▪ Jobs can be run when data is present in a directory
Oozie is the Burmese word which describes a person who drives an elephant (this is the same thing
“mahout” means, only in Burmese). It’s a play on the name “Hadoop” (whose mascot is an elephant, based
on Doug Cutting’s son’s stuffed toy).
The recurring job feature is similar in concept to the UNIX cron system.
Chapter Topics
▪ Conclusion
Oozie Workflow Basics
▪ Oozie workflows are written in XML
▪ Workflow is a collection of actions
─ MapReduce jobs, Pig jobs, Hive jobs etc.
▪ A workflow consists of control flow nodes and action nodes
▪ Control flow nodes define the beginning and end of a workflow
─ They provide methods to determine the workflow execution path
─ Example: Run multiple jobs simultaneously
▪ Action nodes trigger the execution of a processing task, such as
─ A MapReduce job
─ A Hive query
─ A Sqoop data import job
Oozie workflow definitions look vaguely similar to Ant build files, which has both positive connotations (it’s
self-describing) and negative ones (it’s incredibly verbose).
Simple Oozie Example (1)
▪ Simple example workflow for WordCount:
Here we see three control nodes (start, kill and end) and one action node (which runs a map reduce job).
This is a graphical representation of (just a visual aid to help explain) the information in the workflow XML
file we’ll see on the next screen.
1 <workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">

2 <start to='wordcount'/>
3 <action name='wordcount'>
4 <map-reduce>
5 <job-tracker>${jobTracker}</job-tracker>
6 <name-node>${nameNode}</name-node>
7 <configuration>
8 <property>
9 <name>mapred.mapper.class</name>
10 <value>org.myorg.WordCount.Map</value>
11 </property>
12 <property>
13 <name>mapred.reducer.class</name>
14 <value>org.myorg.WordCount.Reduce</value>
15 </property>
16
file continued on next slide
16 <property>
17 <name>mapred.input.dir</name>
18 <value>${inputDir}</value>
19 </property>
20 <property>
21 <name>mapred.output.dir</name>
22 <value>${outputDir}</value>
23 </property>
24 </configuration>
25 </map-reduce>
26 <ok to='end'/>
27 <error to='kill'/>
28 </action>
29 <kill name='kill'>
30 <message>Something went wrong: ${wf:errorCode('wordcount')}</message>
31 </kill/>
32 <end name='end'/>
33 </workflow-app>
This shows the workflow.xml represented by the graphic on the previous slide in its entirety. We cover each
part over the next several slides.
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> 1
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong:
${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app> 1
1 A workflow is wrapped in the workflow-app entity
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">

<start to='wordcount'/> 1
<map-reduce>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
1 The start node is the control node which tells Oozie which workflow node
should be run first. There must be one start node in an Oozie workflow. In
our example, we are telling Oozie to start by transitioning to the wordcount
workflow node.
<action name='wordcount'> 1
<map-reduce>
<configuration>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
1 The wordcount action node defines a map-reduce action – a standard Java

MapReduce job.
... <map-reduce> 1
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce> ...
code removed for space
1 Within the action, we define the job’s properties.
The name of the mapper and reducer are hardcoded in this workflow XML file, but the ${inputDir}
and ${outputDir} are references to properties. These properties may be specified either in an external
properties file or on the command line when submitting the job to Oozie.
<map-reduce>
</map-reduce>
<ok to='end'/> 1
<error to='kill'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
1 We specify what to do if the action ends successfully, and what to do if it

fails. In this example, if the job is successful we go to the end node. If it fails
we go to the kill node.
In this contrived example, we will transition to the end node regardless of success or failure. In a more
realistic example, we might handle each condition differently (for example, the error condition might
transition to a node which invokes Oozie’s ‘email’ action, thus notifying us of failure).
Although it’s grayed out, the ‘kill’ node can be used to terminate the workflow at an arbitrary point. This is
discussed in more detail two slides from now, but students sometimes ask about it here.
<workflow-app name='wordcount-wf'
xmlns="uri:oozie:workflow:0.1">
<map-reduce>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'> 1
</kill/>
<end name='end'/>
</workflow-app>
1 If the workflow reaches a kill node, it will kill all running actions and then
terminate with an error. A workflow can have zero or more kill nodes.
<workflow-app name='wordcount-wf'
xmlns="uri:oozie:workflow:0.1">
<map-reduce>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/> 1
</workflow-app>
1 Every workflow must have an end node. This indicates that the workflow has
completed successfully.
Other Oozie Control Nodes
▪ A decision control node allows Oozie to determine the workflow execution
path based on some criteria
─ Similar to a switch-case statement
▪ fork and join control nodes split one execution path into multiple
execution paths which run concurrently
─ fork splits the execution path
─ join waits for all concurrent execution paths to complete before
proceeding
─ fork and join are used in pairs
The InfoQ “Introduction to Oozie” article (http://www.infoq.com/articles/

introductionOozie) shows an example of fork and join.
Oozie Workflow Action Nodes
Node Name Description
map-reduce Runs either a Java MapReduce or Streaming job
fs Create directories, move or delete files or directories
java Runs the main() method in the specified Java class as a single-
Map, Map-only job on the cluster
pig Runs a Pig script
hive Runs a Hive query
sqoop Runs a Sqoop job
email Sends an e-mail message
There is no HBase action (yet). The Hive integration was contributed by Cloudera.
Submitting an Oozie Workflow
▪ To submit an Oozie workflow using the command-line tool:
$ oozie job -oozie http://<oozie_server>/oozie \

-config config_file -run
▪ Oozie can also be called from within a Java program

─ Via the Oozie client API
Oozie workflows reside in a file name named workflow.xml.

The config file contains a mandatory entry for oozie.wf.application.path. This is the path to the
workflow.xml file.
The config file can also contain other variables that are used when processing the Oozie workflow.
Note that in CDH 4.1.1, the run.sh file in the Oozie lab specifies the –auth simple parameter with
the oozie command. This parameter is needed to avoid a null pointer exception. Refer to Oozie bug
https://issues.apache.org/jira/browse/OOZIE-1010 for more information.
This bug is fixed in CDH 4.3.1.
More on Oozie
Information Resource
Oozie installation and CDH Installation Guide

configuration http://docs.cloudera.com
Oozie workflows and actions https://oozie.apache.org
The procedure of running a https://cwiki.apache.org/OOZIE/map-

MapReduce job using Oozie reduce-cookbook.html
Oozie examples Oozie examples are included in the Oozie

distribution. Instructions for running them:
http://oozie.apache.org/docs/3.2.0-
incubating/DG_Examples.html
Oozie examples are bundled within the Oozie distribution in the oozie-examples.tar.gz file. On the course
VM, this is located here:
/usr/share/doc/oozie-3.2.0+123/oozie-examples.tar.gz
Chapter Topics
▪ Conclusion
Hands-On Exercise: Running an Oozie Workflow
▪ In this Hands-On Exercise you will run Oozie jobs
NOTE: a common problem in this lab is where a student will copy and paste the command from the
PDF, so the command may contain special typographic characters (such as a long dash ‘–’) instead of the
actual characters intended (such as the minus sign ‘-’). This distinction can be subtle, so your first step in
troubleshooting should be to type the command verbatim instead of copying and pasting it.
Chapter Topics
▪ Conclusion
Key Points
▪ Oozie is a workflow engine for Hadoop
▪ Supports Java and Streaming MapReduce jobs, Sqoop jobs, Hive queries, Pig
scripts, and HDFS file manipulation
Conclusion
Chapter 18
Chapter Goal
This chapter concludes the course.
Course Chapters
▪ Introduction
▪ Conclusion
During this course, you have learned
▪ The core technologies of Hadoop
▪ How HDFS and MapReduce work
▪ How to develop and unit test MapReduce applications
▪ How to use MapReduce combiners, partitioners, and the distributed cache
▪ Best practices for developing and debugging MapReduce applications
▪ How to implement custom data input and output formats in MapReduce
applications
▪ Algorithms for common MapReduce tasks
▪ How to join datasets in MapReduce
▪ How Hadoop integrates into the data center
▪ How Hive, Impala, and Pig can be used for rapid application development
▪ How to create workflows using Oozie
Cloudera Enterprise
Appendix A
Chapter Goal
Cloudera Enterprise
▪ Cloudera Enterprise
─ Subscription product
including CDH and
Cloudera Manager
▪ Extra Manager features
─ Rolling upgrades
─ SNMP support
─ LDAP integration
─ Etc.
▪ Includes support
─ Add-on support
modules: Impala,
HBase, Backup and
Disaster Recovery,
Cloudera Navigator
[Copied from Essentials 201309 ch 8]

Cloudera Enterprise is a subscription service we offer to make your Hadoop deployment successful. With it,
you’ll get an Enterprise Edition of Cloudera Manager, which offers everything you get in Cloudera Standard,
plus adds support for tracking configuration changes, enhanced user administration, extensive service
monitoring and integration with our support services.
And with Cloudera Enterprise, our experienced support staff is available when you need them – 24x7
support is available.
Copyright © 2010–2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. A-2
Cloudera Manager (1)
▪ Wizard-based installation of Hadoop
▪ Real-time monitoring of nodes and services

▪ Live configuration
management
▪ Validation and error checking
▪ Automated expansion of
Hadoop services when new
nodes are added
▪ Included in Cloudera Standard
(free) and Cloudera Enterprise
[Copied from Essentials 201309 6-13]

Instructors are strongly encouraged to watch the following short videos to become more familiar with the
features available in Cloudera Manager. These videos are available to the public, so you can reference them
in class (and even use a video to demonstrate a particular feature) as needed:
http://www.cloudera.com/blog/2012/02/cloudera-manager-service-and-
configuration-management-demo-videos/
http://www.cloudera.com/blog/2012/02/cloudera-manager-log-management-
event-management-and-alerting-demo-video/
http://www.cloudera.com/blog/2012/02/cloudera-manager-hadoop-service-
monitoring-demo-video/
http://www.cloudera.com/blog/2012/03/cloudera-manager-activity-
monitoring-operational-reports-demo-video/
Cloudera Manager (2)
[Copied from Essentials 201309 6-14]
Key Points
▪ Cloudera Enterprise makes it easy to run open source Hadoop in production
▪ Includes
─ CDH (Cloudera’s Distribution including Apache Hadoop)
─ Cloudera Manager
─ Production Support
▪ Cloudera Manager enables you to:
─ Simplify and accelerate Hadoop deployment
─ Reduce the costs and risks of adopting Hadoop in production
─ Reliably operate Hadoop in production with repeatable success
─ Apply SLAs to Hadoop
─ Increase control over Hadoop cluster provisioning and management
201403
Cloudera Developer Training for

Apache Hadoop: Hands-On Exercises
Instructor Guide
Table of Contents
General Notes ............................................................................................................. 1
Hands-On Exercise: Using HDFS .............................................................................. 3
Hands-On Exercise: Running a MapReduce Job ...................................................... 8
Hands-On Exercise: Writing a MapReduce Java Program .................................... 12
Hands-On Exercise: More Practice With MapReduce Java Programs ................. 19
Optional Hands-On Exercise: Writing a MapReduce Streaming Program .......... 20
Hands-On Exercise: Writing Unit Tests With the MRUnit Framework ............... 23
Hands-On Exercise: Using ToolRunner and Passing Parameters ........................ 24
Optional Hands-On Exercise: Using a Combiner .................................................. 26
Hands-On Exercise: Testing with LocalJobRunner ............................................... 27
Optional Hands-On Exercise: Logging ................................................................... 30
Hands-On Exercise: Using Counters and a Map-Only Job .................................... 33
Hands-On Exercise: Writing a Partitioner ............................................................. 34
Hands-On Exercise: Implementing a Custom WritableComparable ................... 37
Hands-On Exercise: Using SequenceFiles and File Compression ........................ 39
Hands-On Exercise: Creating an Inverted Index ................................................... 43
Hands-On Exercise: Calculating Word Co-Occurrence ......................................... 46
Hands-On Exercise: Importing Data With Sqoop ................................................. 48
Hands-On Exercise: Manipulating Data With Hive ............................................... 51
Hands-On Exercise: Running an Oozie Workflow ................................................ 56
Bonus Exercises ....................................................................................................... 58
Bonus Exercise: Exploring a Secondary Sort Example ......................................... 59

Not to be reproduced or shared without prior written consent from Cloudera. ii
1
General Notes
Cloudera’s training courses use a Virtual Machine running the CentOS 6.3 Linux
distribution. This VM has CDH (Cloudera’s Distribution, including Apache Hadoop)
installed in Pseudo-Distributed mode. Pseudo-Distributed mode is a method of running
Hadoop whereby all Hadoop daemons run on the same machine. It is, essentially, a
cluster consisting of a single machine. It works just like a larger Hadoop cluster, the
only key difference (apart from speed, of course!) being that the block replication factor
is set to 1, since there is only a single DataNode available.
Getting Started
1. The VM is set to automatically log in as the user training. Should you log out at
any time, you can log back in as the user training with the password training.
Working with the Virtual Machine

1. Should you need it, the root password is training. You may be prompted for this
if, for example, you want to change the keyboard layout. In general, you should not
need this password since the training user has unlimited sudo privileges.
2. In some command-line steps in the exercises, you will see lines like this:
$ hadoop fs -put shakespeare \

/user/training/shakespear
The dollar sign ($) at the beginning of each line indicates the Linux shell
prompt. The actual prompt will include additional information (e.g.,
[training@localhost workspace]$ ) but this is omitted from these
instructions for brevity.
The backslash (\) at the end of the first line signifies that the command is not
completed, and continues on the next line. You can enter the code exactly as shown
(on two lines), or you can enter it on a single line. If you do the latter, you should
not type in the backslash.

2
Points to note during the exercises

1. For most exercises, three folders are provided. Which you use will depend on how
you would like to work on the exercises:
• stubs: contains minimal skeleton code for the Java classes you’ll need to write.
These are best for those with Java experience.
• hints: contains Java class stubs that include additional hints about what’s
required to complete the exercise. These are best for developers with limited
Java experience.
• solution: Fully implemented Java code which may be run “as-is”, or you may
wish to compare your own solution to the examples provided.
2. As the exercises progress, and you gain more familiarity with Hadoop and
MapReduce, we provide fewer step-by-step instructions; as in the real world,
we merely give you a requirement and it’s up to you to solve the problem! You
should feel free to refer to the hints or solutions provided, ask your instructor for
assistance, or consult with your fellow students!
3. There are additional challenges for some of the Hands-On Exercises. If you finish
the main exercise, please attempt the additional steps.

3
Hands-On Exercise: Using HDFS

Files Used in This Exercise:
Data files (local)
~/training_materials/developer/data/shakespeare.tar.gz
~/training_materials/developer/data/access_log.gz
In this exercise you will begin to get acquainted with the Hadoop tools. You will
manipulate files in HDFS, the Hadoop Distributed File System.
Set Up Your Environment

1. Before starting the exercises, run the course setup script in a terminal window:
$ ~/scripts/developer/training_setup_dev.sh
Hadoop
Hadoop is already installed, configured, and running on your virtual machine.
Most of your interaction with the system will be through a command-line wrapper
called hadoop. If you run this program with no arguments, it prints a help message. To
try this, run the following command in a terminal window:
$ hadoop
The hadoop command is subdivided into several subsystems. For example, there is
a subsystem for working with files in HDFS and another for launching and managing
MapReduce processing jobs.
Step 1: Exploring HDFS

The subsystem associated with HDFS in the Hadoop wrapper program is called
FsShell. This subsystem can be invoked with the command hadoop fs.
1. Open a terminal window (if one is not already open) by double-clicking the
Terminal icon on the desktop.
2. In the terminal window, enter:
$ hadoop fs

4
You see a help message describing all the commands associated with the FsShell
subsystem.
3. Enter:
$ hadoop fs -ls /
This shows you the contents of the root directory in HDFS. There will be multiple
entries, one of which is /user. Individual users have a “home” directory under this
directory, named after their username; your username in this course is training,
therefore your home directory is /user/training.
4. Try viewing the contents of the /user directory by running:
$ hadoop fs -ls /user
You will see your home directory in the directory listing.
5. List the contents of your home directory by running:
$ hadoop fs -ls /user/training
There are no files yet, so the command silently exits. This is different than if you ran
hadoop fs -ls /foo, which refers to a directory that doesn’t exist and which
would display an error message.
Note that the directory structure in HDFS has nothing to do with the directory
structure of the local filesystem; they are completely separate namespaces.
Step 2: Uploading Files

Besides browsing the existing filesystem, another important thing you can do with
FsShell is to upload new data into HDFS.
1. Change directories to the local filesystem directory containing the sample data we
will be using in the course.
$ cd ~/training_materials/developer/data
If you perform a regular Linux ls command in this directory, you will see a few
files, including two named shakespeare.tar.gz and

5
shakespeare-stream.tar.gz. Both of these contain the complete works of

Shakespeare in text format, but with different formats and organizations. For now
we will work with shakespeare.tar.gz.
2. Unzip shakespeare.tar.gz by running:
$ tar zxvf shakespeare.tar.gz
This creates a directory named shakespeare/ containing several files on your

local filesystem.
3. Insert this directory into HDFS:
$ hadoop fs -put shakespeare /user/training/shakespeare
This copies the local shakespeare directory and its contents into a remote, HDFS
directory named /user/training/shakespeare.
4. List the contents of your HDFS home directory now:
$ hadoop fs -ls /user/training
You should see an entry for the shakespeare directory.
5. Now try the same fs -ls command but without a path argument:
$ hadoop fs -ls
You should see the same results. If you don’t pass a directory name to the -ls
command, it assumes you mean your home directory, i.e. /user/training.
Relative paths
If you pass any relative (non-absolute) paths to FsShell
commands (or use relative paths in MapReduce programs), they
are considered relative to your home directory.
6. We also have a Web server log file, which we will put into HDFS for use in future
exercises. This file is currently compressed using GZip. Rather than extract the file

6
to the local disk and then upload it, we will extract and upload in one step. First,
create a directory in HDFS in which to store it:
$ hadoop fs -mkdir weblog
7. Now, extract and upload the file in one step. The -c option to gunzip
uncompresses to standard output, and the dash (-) in the hadoop fs -put
command takes whatever is being sent to its standard input and places that data in
HDFS.
$ gunzip -c access_log.gz \
| hadoop fs -put - weblog/access_log
8. Run the hadoop fs -ls command to verify that the log file is in your HDFS home
directory.
9. The access log file is quite large – around 500 MB. Create a smaller version of this
file, consisting only of its first 5000 lines, and store the smaller version in HDFS. You
can use the smaller version for testing in subsequent exercises.
$ hadoop fs -mkdir testlog

$ gunzip -c access_log.gz | head -n 5000 \
| hadoop fs -put - testlog/test_access_log
Step 3: Viewing and Manipulating Files

Now let’s view some of the data you just copied into HDFS.
1. Enter:
$ hadoop fs -ls shakespeare
This lists the contents of the /user/training/shakespeare HDFS directory,

which consists of the files comedies, glossary, histories, poems, and
tragedies.
2. The glossary file included in the compressed file you began with is not strictly a
work of Shakespeare, so let’s remove it:
$ hadoop fs -rm shakespeare/glossary

7
Note that you could leave this file in place if you so wished. If you did, then it would
be included in subsequent computations across the works of Shakespeare, and
would skew your results slightly. As with many real-world big data problems, you
make trade-offs between the labor to purify your input data and the precision of
your results.
3. Enter:
$ hadoop fs -cat shakespeare/histories | tail -n 50
This prints the last 50 lines of Henry IV, Part 1 to your terminal. This command is
handy for viewing the output of MapReduce programs. Very often, an individual
output file of a MapReduce program is very large, making it inconvenient to view
the entire file in the terminal. For this reason, it’s often a good idea to pipe the
output of the fs -cat command into head, tail, more, or less.
4. To download a file to work with on the local filesystem use the fs -get command.
This command takes two arguments: an HDFS path and a local path. It copies the
HDFS contents into the local filesystem:
$ hadoop fs -get shakespeare/poems ~/shakepoems.txt

$ less ~/shakepoems.txt
Other Commands
There are several other operations available with the hadoop fs command to
perform most common filesystem manipulations: mv, cp, mkdir, etc.
1. Enter:
$ hadoop fs
This displays a brief usage report of the commands available within FsShell. Try
playing around with a few of these commands if you like.
This is the end of the exercise.

8
Hands-On Exercise: Running a MapReduce Job

Files and Directories Used in this Exercise
Source directory: ~/workspace/wordcount/src/solution
Files:
WordCount.java: A simple MapReduce driver class.
WordMapper.java: A mapper class for the job.
SumReducer.java: A reducer class for the job.
wc.jar: The compiled, assembled WordCount program
In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.
In addition to manipulating files in HDFS, the wrapper program hadoop is used to
launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop
loads the JAR into HDFS and distributes it to the worker nodes, where the individual
tasks of the MapReduce job are executed.
One simple example of a MapReduce job is to count the number of occurrences of each
word in a file or set of files. In this lab you will compile and submit a MapReduce job to
count the number of occurrences of every word in the works of Shakespeare.
Compiling and Submitting a MapReduce Job

1. In a terminal window, change to the exercise source directory, and list the contents:
$ cd ~/workspace/wordcount/src
$ ls
This directory contains three “package” subdirectories: solution, stubs and

hints. In this example we will be using the solution code, so list the files in the
solution package directory:
$ ls solution
The package contains the following Java files:

WordCount.java: A simple MapReduce driver class.
WordMapper.java: A mapper class for the job.
SumReducer.java: A reducer class for the job.
Examine these files if you wish, but do not change them. Remain in this directory
while you execute the following commands.

9
2. Before compiling, examine the classpath Hadoop is configured to use:
$ hadoop classpath
This shows lists the locations where the Hadoop core API classes are installed.
3. Compile the three Java classes:
$ javac -classpath `hadoop classpath` solution/*.java
Note: in the command above, the quotes around hadoop classpath are
backquotes. This runs the hadoop classpath command and uses its output
as part of the javac command.
The compiled (.class) files are placed in the solution directory.
4. Collect your compiled Java files into a JAR file:
$ jar cvf wc.jar solution/*.class
5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of
each word in Shakespeare:
$ hadoop jar wc.jar solution.WordCount \

shakespeare wordcounts
This hadoop jar command names the JAR file to use (wc.jar), the class whose
main method should be invoked (solution.WordCount), and the HDFS input
and output directories to use for the MapReduce job.
Your job reads all the files in your HDFS shakespeare directory, and places its
output in a new HDFS directory called wordcounts.
6. Try running this same command again without any change:

shakespeare wordcounts
Your job halts right away with an exception, because Hadoop automatically fails if
your job tries to write its output into an existing directory. This is by design; since
the result of a MapReduce job may be expensive to reproduce, Hadoop prevents you
from accidentally overwriting previously existing files.

10
7. Review the result of your MapReduce job:
$ hadoop fs -ls wordcounts
This lists the output files for your job. (Your job ran with only one Reducer, so
there should be one file, named part-r-00000, along with a _SUCCESS file and a
_logs directory.)
8. View the contents of the output for your job:
$ hadoop fs -cat wordcounts/part-r-00000 | less
You can page through a few screens to see words and their frequencies in the works
of Shakespeare. (The spacebar will scroll the output by one screen; the letter ’q’ will
quit the less utility.) Note that you could have specified wordcounts/* just as
well in this command.
Wildcards in HDFS file paths

Take care when using wildcards (e.g. *) when specifying HFDS
filenames; because of how Linux works, the shell will attempt
to expand the wildcard before invoking hadoop, and then pass
incorrect references to local files instead of HDFS files. You can
prevent this by enclosing the wildcarded HDFS filenames in single
quotes, e.g. hadoop fs –cat 'wordcounts/*'
9. Try running the WordCount job against a single file:

shakespeare/poems pwords
When the job completes, inspect the contents of the pwords HDFS directory.
10. Clean up the output files produced by your job runs:
$ hadoop fs -rm -r wordcounts pwords
Stopping MapReduce Jobs

It is important to be able to stop jobs that are already running. This is useful if, for
example, you accidentally introduced an infinite loop into your Mapper. An important

11
point to remember is that pressing ^C to kill the current process (which is displaying
the MapReduce job’s progress) does not actually stop the job itself.
A MapReduce job, once submitted to Hadoop, runs independently of the initiating
process, so losing the connection to the initiating process does not kill the job. Instead,
you need to tell the Hadoop JobTracker to stop the job.
11. Start another word count job like you did in the previous section:
$ hadoop jar wc.jar solution.WordCount shakespeare \

count2
12. While this job is running, open another terminal window and enter:
$ mapred job -list
This lists the job ids of all running jobs. A job id looks something like:
job_200902131742_0002
13. Copy the job id, and then kill the running job by entering:
$ mapred job -kill jobid
The JobTracker kills the job, and the program running in the original terminal
completes.

12
Hands-On Exercise: Writing a MapReduce Java

Program
Projects and Directories Used in this Exercise
Eclipse project: averagewordlength
Java files:
AverageReducer.java (Reducer)
LetterMapper.java (Mapper)
AvgWordLength.java (driver)
Test data (HDFS):
shakespeare
Exercise directory: ~/workspace/averagewordlength
In this exercise you write a MapReduce job that reads any text input and
computes the average length of all words that start with each character.
For any text input, the job should report the average length of words that begin with ‘a’,
‘b’, and so forth. For example, for input:
No now is definitely not the time
The output would be:
N 2.0
n 3.0
d 10.
i 2.
t 3.5
(For the initial solution, your program should be case-sensitive as shown in this
example.)

13
The Algorithm
The algorithm for this program is a simple one-pass MapReduce program:
The Mapper
The Mapper receives a line of text for each input value. (Ignore the input key.) For each
word in the line, emit the first letter of the word as a key, and the length of the word as
a value. For example, for input value:
No now is definitely not the time
Your Mapper should emit:
N 2
n 3
i 2
d 10
n 3
t 3
t 4
The Reducer
Thanks to the shuffle and sort phase built in to MapReduce, the Reducer receives the
keys in sorted order, and all the values for one key are grouped together. So, for the
Mapper output above, the Reducer receives this:
N (2)
d (10)
i (2)
n (3,3)
t (3,4)
The Reducer output should be:
N 2.0
d 10.0
i 2.0
n 3.0
t 3.5

14
Step 1: Start Eclipse

We have created Eclipse projects for each of the Hands-On Exercises that use Java.
We encourage you to use Eclipse in this course. Using Eclipse will speed up your
development time.
1. Be sure you have run the course setup script as instructed in the General Notes
section at the beginning of this manual. This script sets up the exercise workspace
and copies in the Eclipse projects you will use for the remainder of the course.
2. Start Eclipse using the icon on your VM desktop. The projects for this course will
appear in the Project Explorer on the left.
Step 2: Write the Program in Java

We’ve provided stub files for each of the Java classes for this exercise:
LetterMapper.java (the Mapper), AverageReducer.java (the Reducer), and
AvgWordLength.java (the driver).
If you are using Eclipse, open the stub files (located in the src/stubs package) in
the averagewordlength project. If you prefer to work in the shell, the files are in
~/workspace/averagewordlength/src/stubs.
You may wish to refer back to the wordcount example (in the wordcount project in
Eclipse or in ~/workspace/wordcount) as a starting point for your Java code. Here
are a few details to help you begin your Java programming:
3. Define the driver

This class should configure and submit your basic job. Among the basic steps here,
configure the job with the Mapper class and the Reducer class you will write, and
the data types of the intermediate and final keys.
4. Define the Mapper

Note these simple string operations in Java:
str.substring(0, 1) // String : first letter of str

str.length() // int : length of str
5. Define the Reducer

In a single invocation the reduce() method receives a string containing one letter
(the key) along with an iterable collection of integers (the values), and should emit
a single key-value pair: the letter and the average of the integers.

15
6. Compile your classes and assemble the jar file

To compile and jar, you may either use the command line javac command as you
did earlier in the “Running a MapReduce Job” exercise, or follow the steps below
(“Using Eclipse to Compile Your Solution”) to use Eclipse.
Step 3: Use Eclipse to Compile Your Solution

Follow these steps to use Eclipse to complete this exercise.
Note: These same steps will be used for all subsequent exercises. The instructions
will not be repeated each time, so take note of the steps.
1. Verify that your Java code does not have any compiler errors or warnings.
The Eclipse software in your VM is pre-configured to compile code automatically
without performing any explicit steps. Compile errors and warnings appear as red
and yellow icons to the left of the code.
2. In the Package Explorer, open the Eclipse project for the current exercise (i.e.
averagewordlength). Right-click the default package under the src entry and
select Export.

16
3. Select Java > JAR file from the Export dialog box, then click Next.
4. Specify a location for the JAR file. You can place your JAR files wherever you like,
e.g.:

17
Note: for more information about using Eclipse in this course, see the Eclipse Exercise
Guide.
Step 3: Test your program

1. In a terminal window, change to the directory where you placed your JAR file. Run
the hadoop jar command as you did previously in the “Running a MapReduce
Job” exercise. Make sure you use the correct package name depending on whether
you are working with the provided stubs, stubs with additional hints, or just
running the solution as is.
(Throughout the remainder of the exercises, the instructions will assume you are
working in the stubs package. Remember to replace this with the correct package
name if you are using hints or solution.)
$ hadoop jar avgwordlength.jar stubs.AvgWordLength \

shakespeare wordlengths
2. List the results:
$ hadoop fs -ls wordlengths
A single reducer output file should be listed.
3. Review the results:
$ hadoop fs -cat wordlengths/*

18
The file should list all the numbers and letters in the data set, and the average
length of the words starting with them, e.g.:
1 1.02
2 1.0588235294117647
3 1.0
4 1.5
5 1.5
6 1.5
7 1.0
8 1.5
9 1.0
A 3.891394576646375
B 5.139302507836991
C 6.629694233531706
…
This example uses the entire Shakespeare dataset for your input; you can also try it
with just one of the files in the dataset, or with your own test data.
Solution
You can view the code for the solution in Eclipse in the
averagewordlength/src/solution folder.

19
Hands-On Exercise: More Practice With MapReduce

Java Programs
Eclipse project: log_file_analysis
Java files: SumReducer.java – the Reducer
LogFileMapper.java – the Mapper
ProcessLogs.java – the driver class
Test data (HDFS):
weblog (full version)
testlog (test sample set)
Exercise directory: ~/workspace/log_file_analysis
In this exercise, you will analyze a log file from a web server to count the number
of hits made from each unique IP address.
Your task is to count the number of hits made from each IP address in the sample
(anonymized) web server log file that you uploaded to the /user/training/
weblog directory in HDFS when you completed the “Using HDFS” exercise.
In the log_file_analysis directory, you will find stubs for the Mapper and Driver.
1. Using the stub files in the log_file_analysis project directory, write Mapper
and Driver code to count the number of hits made from each IP address in the
access log file. Your final result should be a file in HDFS containing each IP address,
and the count of log hits from that address. Note: The Reducer for this exercise
performs the exact same function as the one in the WordCount program you
ran earlier. You can reuse that code or you can write your own if you prefer.
2. Build your application jar file following the steps in the previous exercise.
3. Test your code using the sample log data in the /user/training/weblog
directory. Note: You may wish to test your code against the smaller version of
the access log you created in a prior exercise (located in the /user/training/
testlog HDFS directory) before you run your code against the full log which can
be quite time consuming.

20
Optional Hands-On Exercise: Writing a MapReduce

Streaming Program
Project directory: ~/workspace/averagewordlength
Test data (HDFS):
shakespeare
In this exercise you will repeat the same task as in the previous exercise: writing
a program to calculate average word lengths for letters. However, you will write
this as a streaming program using a scripting language of your choice rather than
using Java.
Your virtual machine has Perl, Python, PHP, and Ruby installed, so you can choose any
of these—or even shell scripting—to develop a Streaming solution.
For your Hadoop Streaming program you will not use Eclipse. Launch a text editor to
write your Mapper script and your Reducer script. Here are some notes about solving
the problem in Hadoop Streaming:
1. The Mapper Script

The Mapper will receive lines of text on stdin. Find the words in the lines to
produce the intermediate output, and emit intermediate (key, value) pairs by
writing strings of the form:
key <tab> value <newline>
These strings should be written to stdout.
2. The Reducer Script

For the reducer, multiple values with the same key are sent to your script on stdin
as successive lines of input. Each line contains a key, a tab, a value, and a newline.
All lines with the same key are sent one after another, possibly followed by lines
with a different key, until the reducing input is complete. For example, the reduce
script may receive the following:
t 3
t 4
w 4
w 6

21
For this input, emit the following to stdout:
t 3.5
w 5.0
Observe that the reducer receives a key with each input line, and must “notice”
when the key changes on a subsequent line (or when the input is finished) to know
when the values for a given key have been exhausted. This is different than the Java
version you worked on in the previous exercise.
3. Run the streaming program:
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/\

contrib/streaming/hadoop-streaming*.jar \
-inputinputDir -output outputDir \
-filepathToMapScript -file pathToReduceScript \
-mappermapBasename -reducer reduceBasename
(Remember, you may need to delete any previous output before running your
program with hadoop fs -rm -r dataToDelete.)
4. Review the output in the HDFS directory you specified (outputDir).
Note: The Perl example that was covered in class is in

~/workspace/wordcount/perl_solution.
Solution in Python
You can find a working solution to this exercise written in Python in the directory
~/workspace/averagewordlength/python_sample_solution.
To run the solution, change directory to ~/workspace/averagewordlength and
run this command:
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce\

/contrib/streaming/hadoop-streaming*.jar \
-input shakespeare -output avgwordstreaming \
-file python_sample_solution/mapper.py \
-file python_sample_solution/reducer.py \
-mapper mapper.py -reducer reducer.py

22

23
Hands-On Exercise: Writing Unit Tests With the

MRUnit Framework
Projects Used in this Exercise
Eclipse project: mrunit
Java files:
SumReducer.java (Reducer from WordCount)
WordMapper.java (Mapper from WordCount)
TestWordCount.java (Test Driver)
In this Exercise, you will write Unit Tests for the WordCount code.
1. Launch Eclipse (if necessary) and expand the mrunit folder.
2. Examine the TestWordCount.java file in the mrunit project stubs package.

Notice that three tests have been created, one each for the Mapper, Reducer, and
the entire MapReduce flow. Currently, all three tests simply fail.
3. Run the tests by right-clicking on TestWordCount.java in the Package Explorer

panel and choosing Run As > JUnit Test.
4. Observe the failure. Results in the JUnit tab (next to the Package Explorer tab)
should indicate that three tests ran with three failures.
5. Now implement the three tests. (If you need hints, refer to the code in the hints or
solution packages.)
6. Run the tests again. Results in the JUnit tab should indicate that three tests ran with
no failures.
7. When you are done, close the JUnit tab.

24
Hands-On Exercise: Using ToolRunner and Passing

Parameters
Eclipse project: toolrunner
Java files:
AverageReducer.java (Reducer from AverageWordLength)
LetterMapper.java (Mapper from AverageWordLength)
AvgWordLength.java (driver from AverageWordLength)
Exercise directory: ~/workspace/toolrunner
In this Exercise, you will implement a driver using ToolRunner.

Follow the steps below to start with the Average Word Length program you wrote in
an earlier exercise, and modify the driver to use ToolRunner. Then modify the Mapper
to reference a Boolean parameter called caseSensitive; if true, the mapper should
treat upper and lower case letters as different; if false or unset, all letters should be
converted to lower case.
Modify the Average Word Length Driver to use Toolrunner

1. Copy the Reducer, Mapper and driver code you completed in the “Writing Java
MapReduce Programs” exercise earlier, in the averagewordlength project. (If
you did not complete the exercise, use the code from the solution package.)
Copying Source Files

You can use Eclipse to copy a Java source file from one project
or package to another by right-clicking on the file and selecting
Copy, then right-clicking the new package and selecting Paste.
If the packages have different names (e.g. if you copy from
averagewordlength.solution to toolrunner.stubs),
Eclipse will automatically change the package directive at the top
of the file. If you copy the file using a file browser or the shell, you
will have to do that manually.
2. Modify the AvgWordLength driver to use ToolRunner. Refer to the slides for
details.
a. Implement the run method
b. Modify main to call run

25
3. Jar your solution and test it before continuing; it should continue to function exactly
as it did before. Refer to the Writing a Java MapReduce Program exercise for how to
assemble and test if you need a reminder.
Modify the Mapper to use a configuration parameter

4. Modify the LetterMapper class to
a. Override the setup method to get the value of a configuration parameter called
caseSensitive, and use it to set a member variable indicating whether to do
case sensitive or case insensitive processing.
b. In the map method, choose whether to do case sensitive processing (leave the
letters as-is), or insensitive processing (convert all letters to lower-case) based
on that variable.
Pass a parameter programmatically

5. Modify the driver’s run method to set a Boolean configuration parameter called
caseSensitive. (Hint: use the Configuration.setBoolean method.)
6. Test your code twice, once passing false and once passing true. When set to
true, your final output should have both upper and lower case letters; when false, it
should have only lower case letters.
Hint: Remember to rebuild your Jar file to test changes to your code.
Pass a parameter as a runtime parameter

7. Comment out the code that sets the parameter programmatically. (Eclipse hint:
select the code to comment and then select Source > Toggle Comment). Test again,
this time passing the parameter value using –D on the Hadoop command line, e.g.:
$ hadoop jar toolrunner.jar stubs.AvgWordLength \

-DcaseSensitive=true shakespeare toolrunnerout
8. Test passing both true and false to confirm the parameter works correctly.

26
Optional Hands-On Exercise: Using a Combiner

Eclipse project: combiner
Java files:
WordCountDriver.java (Driver from WordCount)
WordMapper.java (Mapper from WordCount)
Exercise directory: ~/workspace/combiner
In this exercise, you will add a Combiner to the WordCount program to reduce the
amount of intermediate data sent from the Mapper to the Reducer.
Because summing is associative and commutative, the same class can be used for both
the Reducer and the Combiner.
Implement a Combiner
1. Copy WordMapper.java and SumReducer.java from the wordcount project
to the combiner project.
2. Modify the WordCountDriver.java code to add a Combiner for the WordCount

program.
3. Assemble and test your solution. (The output should remain identical to the
WordCount application without a combiner.)

27
Hands-On Exercise: Testing with LocalJobRunner

Eclipse project: toolrunner
Test data (local):
~/training_materials/developer/data/shakespeare
Exercise directory: ~/workspace/toolrunner
In this Hands-On Exercise, you will practice running a job locally for debugging
and testing purposes.
In the “Using ToolRunner and Passing Parameters” exercise, you modified the Average
Word Length program to use ToolRunner. This makes it simple to set job configuration
properties on the command line.
Run the Average Word Length program using LocalJobRunner on

the command line
1. Run the Average Word Length program again. Specify –jt=local to run the job
locally instead of submitting to the cluster, and –fs=file:/// to use the local file
system instead of HDFS. Your input and output files should refer to local files rather
than HDFS files.
Note: If you successfully completed the ToolRunner exercise, you may use your
version in the toolrunner stubs or hints package; otherwise use the version
in the solution package as shown below.
$ hadoop jar toolrunner.jar solution.AvgWordLength \

-fs=file:/// -jt=local \
~/training_materials/developer/data/shakespeare \
localout
2. Review the job output in the local output folder you specified.
Optional: Run the Average Word Length program using

LocalJobRunner in Eclipse
1. In Eclipse, locate the toolrunner project in the Package Explorer. Open the
solution package (or the stubs or hints package if you completed the
ToolRunner exercise).

28
2. Right click on the driver class (AvgWordLength) and select Run As > Run
Configurations…
3. Ensure that Java Application is selected in the run types listed in the left pane.
4. In the Run Configuration dialog, click the New launch configuration button:
5. On the Main tab, confirm that the Project and Main class are set correctly for your
project, e.g.:
6. Select the Arguments tab and enter the input and output folders. (These are
local, not HDFS, folders, and are relative to the run configuration’s working
folder, which by default is the project folder in the Eclipse workspace: e.g.
~/workspace/toolrunner.)

29
7. Click the Run button. The program will run locally with the output displayed in the
Eclipse console window.
8. Review the job output in the local output folder you specified.
Note: You can re-run any previous configurations using the Run or Debug history
buttons on the Eclipse tool bar.

30
Optional Hands-On Exercise: Logging

Eclipse project: logging
Java files:
AverageReducer.java (Reducer from ToolRunner)
LetterMapper.java (Mapper from ToolRunner)
AvgWordLength.java (driver from ToolRunner)
Test data (HDFS):
shakespeare
Exercise directory: ~/workspace/logging
In this Hands-On Exercise, you will practice using log4j with MapReduce.
Modify the Average Word Length program you built in the Using ToolRunner and
Passing Parameters exercise so that the Mapper logs a debug message indicating
whether it is comparing with or without case sensitivity.
Enable Mapper Logging for the Job

1. Before adding additional logging messages, try re-running the toolrunner exercise
solution with Mapper debug logging enabled by adding
-Dmapred.map.child.log.level=DEBUG
to the command line. E.g.
$ hadoop jar toolrunner.jar solution.AvgWordLength \

-Dmapred.map.child.log.level=DEBUG shakespeareoutdir
2. Take note of the Job ID in the terminal window or by using the maprep job
command.
3. When the job is complete, view the logs. In a browser on your VM, visit the Job
Tracker UI: http://localhost:50030/jobtracker.jsp. Find the job you
just ran in the Completed Jobs list and click its Job ID. E.g.:

31
4. In the task summary, click map to view the map tasks.
5. In the list of tasks, click on the map task to view the details of that task.
6. Under Task Logs, click “All”. The logs should include both INFO and DEBUG
messages. E.g.:
Add Debug Logging Output to the Mapper

7. Copy the code from the toolrunner project to the logging project stubs
package. (You may either use your solution from the ToolRunner exercise, or the
code in the solution package.)
8. Use log4j to output a debug log message indicating whether the Mapper is doing
case sensitive or insensitive mapping.
Build and Test Your Code

9. Following the earlier steps, test your code with Mapper debug logging enabled.
View the map task logs in the Job Tracker UI to confirm that your message is
included in the log. (Hint: search for LetterMapper in the page to find your
message.)

32
10. Optional: Try running map logging set to INFO (the default) or WARN instead of
DEBUG and compare the log output.

33
Hands-On Exercise: Using Counters and a Map-Only

Job
Eclipse project: counters
Java files:
ImageCounter.java (driver)
ImageCounterMapper.java (Mapper)
Test data (HDFS):
weblog (full web server access log)
testlog (partial data set for testing)
Exercise directory: ~/workspace/counters
In this exercise you will create a Map-only MapReduce job.

Your application will process a web server’s access log to count the number of times
gifs, jpegs, and other resources have been retrieved. Your job will report three figures:
number of gif requests, number of jpeg requests, and number of other requests.
Hints
1. You should use a Map-only MapReduce job, by setting the number of Reducers to 0
in the driver code.
2. For input data, use the Web access log file that you uploaded to the HDFS /user/
training/weblog directory in the “Using HDFS” exercise.
Note: We suggest you test your code against the smaller version of the access log in
the /user/training/testlog directory before you run your code against the
full log in the /user/training/weblog directory.
3. Use a counter group such as ImageCounter, with names gif, jpeg and other.
4. In your driver code, retrieve the values of the counters after the job has completed
and report them using System.out.println.
5. The output folder on HDFS will contain Mapper output files which are empty,
because the Mappers did not write any data.

34
Hands-On Exercise: Writing a Partitioner

Eclipse project: partitioner
Java files:
MonthPartitioner.java (Partitioner)
ProcessLogs.java (driver)
CountReducer.java (Reducer)
LogMonthMapper.java (Mapper)
Test data (HDFS):
testlog (partial data set for testing)
Exercise directory: ~/workspace/partitioner
In this Exercise, you will write a MapReduce job with multiple Reducers, and
create a Partitioner to determine which Reducer each piece of Mapper output is
sent to.
The Problem
In the “More Practice with Writing MapReduce Java Programs” exercise you did
previously, you built the code in log_file_analysis project. That program counted
the number of hits for each different IP address in a web log file. The final output was a
file containing a list of IP addresses, and the number of hits from that address.
This time, we want to perform a similar task, but we want the final output to consist of
12 files, one each for each month of the year: January, February, and so on. Each file will
contain a list of IP address, and the number of hits from that address in that month.
We will accomplish this by having 12 Reducers, each of which is responsible for
processing the data for a particular month. Reducer 0 processes January hits, Reducer 1
processes February hits, and so on.
Note: we are actually breaking the standard MapReduce paradigm here, which says that
all the values from a particular key will go to the same Reducer. In this example, which
is a very common pattern when analyzing log files, values from the same key (the IP
address) will go to multiple Reducers, based on the month portion of the line.
Write the Mapper

1. Starting with the LogMonthMapper.java stub file, write a Mapper that maps a
log file output line to an IP/month pair. The map method will be similar to that in
the LogFileMapper class in the log_file_analysis project, so you may wish
to start by copying that code.

35
2. The Mapper should emit a Text key (the IP address) and Text value (the month).
E.g.:
Input: 96.7.4.14 - - [24/Apr/2011:04:20:11 -0400]

"GET /cat.jpg HTTP/1.1" 200 12433
Output key: 96.7.4.14
Output value: Apr
Hint: in the Mapper, you may use a regular expression to parse to log file data if you
are familiar with regex processing. Otherwise we suggest following the tips in the
hints code, or just copy the code from the solution package.
Remember that the log file may contain unexpected data – that is, lines that do not
conform to the expected format. Be sure that your code copes with such lines.
Write the Partitioner

3. Modify the MonthPartitioner.java stub file to create a Partitioner that sends
the (key, value) pair to the correct Reducer based on the month. Remember that
the Partitioner receives both the key and value, so you can inspect the value to
determine which Reducer to choose.
Modify the Driver

4. Modify your driver code to specify that you want 12 Reducers.
5. Configure your job to use your custom Partitioner.
Test your Solution

6. Build and test your code. Your output directory should contain 12 files named
part-r-000xx. Each file should contain IP address and number of hits for
month xx.
Hints
• Write unit tests for your Partitioner!
• You may wish to test your code against the smaller version of the access log in the
/user/training/testlog directory before you run your code against the full
log in the /user/training/weblog directory. However, note that the test data
may not include all months, so some result files will be empty.

36

37
Hands-On Exercise: Implementing a Custom

WritableComparable
Eclipse project: writables
Java files:
StringPairWritable – implements a WritableComparable type
StringPairMapper – Mapper for test job
StringPairTestDriver – Driver for test job
Data file:
~/training_materials/developer/data/nameyeartestdata (small
set of data for the test job)
Exercise directory: ~/workspace/writables
In this exercise, you will create a custom WritableComparable type that holds two
strings.
Test the new type by creating a simple program that reads a list of names (first and last)
and counts the number of occurrences of each name.
The mapper should accepts lines in the form:
lastname firstname other data
The goal is to count the number of times a lastname/firstname pair occur within the
dataset. For example, for input:
Smith Joe 1963-08-12 Poughkeepsie, NY
Smith Joe 1832-01-20 Sacramento, CA
Murphy Alice 2004-06-02 Berlin, MA
We want to output:
(Smith,Joe) 2
(Murphy,Alice) 1
Note: You will use your custom WritableComparable type in a

future exercise, so make sure it is working with the test job now.
StringPairWritable
You need to implement a WritableComparable object that holds the two strings. The
stub provides an empty constructor for serialization, a standard constructor that will
be given two strings, a toString method, and the generated hashCode and equals

38
methods. You will need to implement the readFields, write, and compareTo
methods required by WritableComparables.
Note that Eclipse automatically generated the hashCode and equals methods in the
stub file. You can generate these two methods in Eclipse by right-clicking in the source
code and choosing ‘Source’ > ‘Generate hashCode() and equals()’.
Name Count Test Job

The test job requires a Reducer that sums the number of occurrences of each key.
This is the same function that the SumReducer used previously in wordcount,
except that SumReducer expects Text keys, whereas the reducer for this job will get
StringPairWritable keys. You may either re-write SumReducer to accommodate other
types of keys, or you can use the LongSumReducer Hadoop library class, which does
exactly the same thing.
You can use the simple test data in ~/training_materials/developer/data/
nameyeartestdata to make sure your new type works as expected.
You may test your code using local job runner or submitting a Hadoop job to the
(pseudo-)cluster as usual. If you submit the job to the cluster, note that you will need to
copy your test data to HDFS first.

39
Hands-On Exercise: Using SequenceFiles and File

Compression
Eclipse project: createsequencefile
Java files:
CreateSequenceFile.java (a driver that converts a text file to a sequence
file)
ReadCompressedSequenceFile.java (a driver that converts a compressed
sequence file to text)
Test data (HDFS):
Exercise directory: ~/workspace/createsequencefile
In this exercise you will practice reading and writing uncompressed and
compressed SequenceFiles.
First, you will develop a MapReduce application to convert text data to a SequenceFile.
Then you will modify the application to compress the SequenceFile using Snappy file
compression.
When creating the SequenceFile, use the full access log file for input data. (You
uploaded the access log file to the HDFS /user/training/weblog directory when
you performed the “Using HDFS” exercise.)
After you have created the compressed SequenceFile, you will write a second
MapReduce application to read the compressed SequenceFile and write a text file that
contains the original log file text.
Write a MapReduce program to create sequence files from text

files
1. Determine the number of HDFS blocks occupied by the access log file:
a. In a browser window, start the Name Node Web UI. The URL is http://
localhost:50070.
b. Click “Browse the filesystem.”
c. Navigate to the /user/training/weblog/access_log file.
d. Scroll down to the bottom of the page. The total number of blocks occupied by
the access log file appears in the browser window.

40
2. Complete the stub file in the createsequencefile project to read the access log
file and create a SequenceFile. Records emitted to the SequenceFile can have any
key you like, but the values should match the text in the access log file. (Hint: you
can use Map-only job using the default Mapper, which simply emits the data passed
to it.)
Note: If you specify an output key type other than LongWritable, you must
call job.setOutputKeyClass – not job.setMapOutputKeyClass.
If you specify an output value type other than Text, you must call
job.setOutputValueClass – not job.setMapOutputValueClass.
3. Build and test your solution so far. Use the access log as input data, and specify the
uncompressedsf directory for output.
Note: The CreateUncompressedSequenceFile.java file in the solution
package contains the solution for the preceding part of the exercise.
4. Examine the initial portion of the output SequenceFile using the following
command:
$ hadoop fs -cat uncompressedsf/part-m-00000 | less
Some of the data in the SequenceFile is unreadable, but parts of the SequenceFile
should be recognizable:
• The string SEQ, which appears at the beginning of a SequenceFile
• The Java classes for the keys and values
• Text from the access log file

5. Verify that the number of files created by the job is equivalent to the number of
blocks required to store the uncompressed SequenceFile.
Compress the Output

6. Modify your MapReduce job to compress the output SequenceFile. Add statements
to your driver to configure the output as follows:
• Compress the output file.
• Use block compression.
• Use the Snappy compression codec.

41
7. Compile the code and run your modified MapReduce job. For the MapReduce
output, specify the compressedsf directory.
Note: The CreateCompressedSequenceFile.java file in the solution
8. Examine the first portion of the output SequenceFile. Notice the differences
between the uncompressed and compressed SequenceFiles:
• The compressed SequenceFile specifies the

org.apache.hadoop.io.compress.SnappyCodec compression codec in
its header.
• You cannot read the log file text in the compressed file.
9. Compare the file sizes of the uncompressed and compressed SequenceFiles

in the uncompressedsf and compressedsf directories. The compressed
SequenceFiles should be smaller.
Write another MapReduce program to uncompress the files

10. Starting with the provided stub file, write a second MapReduce program to read the
compressed log file and write a text file. This text file should have the same text data
as the log file, plus keys. The keys can contain any values you like.
11. Compile the code and run your MapReduce job.

For the MapReduce input, specify the compressedsf directory in which you
created the compressed SequenceFile in the previous section.
For the MapReduce output, specify the compressedsftotext directory.
Note: The ReadCompressedSequenceFile.java file in the solution
12. Examine the first portion of the output in the compressedsftotext directory.
You should be able to read the textual log file entries.
Optional: Use command line options to control compression

13. If you used ToolRunner for your driver, you can control compression
using command line arguments. Try commenting out the code
in your driver where you call setCompressOutput (or use the

42
solution.CreateUncompressedSequenceFile program). Then test setting

the mapred.output.compressed option on the command line, e.g.:
$ hadoop jar sequence.jar \

solution.CreateUncompressedSequenceFile \
-Dmapred.output.compressed=true \
weblog outdir
14. Review the output to confirm the files are compressed.

43
Hands-On Exercise: Creating an Inverted Index

Eclipse project: inverted_index
Java files:
IndexMapper.java (Mapper)
IndexReducer.java (Reducer)
InvertedIndex.java (Driver)
Data files:
~/training_materials/developer/data/invertedIndexInput.tgz
Exercise directory: ~/workspace/inverted_index
In this exercise, you will write a MapReduce job that produces an inverted index.
For this lab you will use an alternate input, provided in the file
invertedIndexInput.tgz. When decompressed, this archive contains a directory
of files; each is a Shakespeare play formatted as follows:
0 HAMLET
1
2
3 DRAMATIS PERSONAE
4
5
6 CLAUDIUS king of Denmark. (KING CLAUDIUS:)
7
8 HAMLET son to the late, and nephew to the present
king.
9
10 POLONIUS lord chamberlain. (LORD POLONIUS:)
...
Each line contains:

Line number
separator: a tab character
value: the line of text
This format can be read directly using the KeyValueTextInputFormat class
provided in the Hadoop API. This input format presents each line as one record to your
Mapper, with the part before the tab character as the key, and the part after the tab as
the value.
Given a body of text in this form, your indexer should produce an index of all the words
in the text. For each word, the index should have a list of all the locations where the

44
word appears. For example, for the word ‘honeysuckle’ your output should look like
this:
honeysuckle 2kinghenryiv@1038,midsummernightsdream@2175,...
The index should contain such an entry for every word in the text.
Prepare the Input Data

1. Extract the invertedIndexInput directory and upload to HDFS:
$ cd ~/training_materials/developer/data
$ tar zxvf invertedIndexInput.tgz
$ hadoop fs -put invertedIndexInput invertedIndexInput
Define the MapReduce Solution

Remember that for this program you use a special input format to suit the form of your
data, so your driver class will include a line like:
job.setInputFormatClass(KeyValueTextInputFormat.class);
Don’t forget to import this class for your use.
Retrieving the File Name

Note that the exercise requires you to retrieve the file name - since that is the name of
the play. The Context object can be used to retrieve the name of the file like this:
FileSplit fileSplit = (FileSplit) context.getInputSplit();

Path path = fileSplit.getPath();
String fileName = path.getName();
Build and Test Your Solution

Test against the invertedIndexInput data you loaded above.

45
Hints
You may like to complete this exercise without reading any further, or you may find the
following hints about the algorithm helpful.
The Mapper
Your Mapper should take as input a key and a line of words, and emit as intermediate
values each word as key, and the key as value.
For example, the line of input from the file ‘hamlet’:
282 Have heaven and earth together
produces intermediate output:
Have hamlet@282
heaven hamlet@282
and hamlet@282
earth hamlet@282
together hamlet@282
The Reducer
Your Reducer simply aggregates the values presented to it for the same key, into one
value. Use a separator like ‘,’ between the values listed.

46
Hands-On Exercise: Calculating Word Co-Occurrence

Eclipse project: word_co-occurrence
Java files:
WordCoMapper.java (Mapper)
WordCo.java (Driver)
Test directory (HDFS):
shakespeare
Exercise directory: ~/workspace/word_co-occurence
In this exercise, you will write an application that counts the number of times
words appear next to each other.
Test your application using the files in the shakespeare folder you previously copied
into HDFS in the “Using HDFS” exercise.
Note that this implementation is a specialization of Word Co-Occurrence as we describe
it in the notes; in this case we are only interested in pairs of words which appear
directly next to each other.
1. Change directories to the word_co-occurrence directory within the

exercises directory.
2. Complete the Driver and Mapper stub files; you can use the standard SumReducer
from the WordCount project as your Reducer. Your Mapper’s intermediate output
should be in the form of a Text object as the key, and an IntWritable as the value;
the key will be word1,word2, and the value will be 1.
Extra Credit
If you have extra time, please complete these additional challenges:
Challenge 1: Use the StringPairWritable key type from the “Implementing
a Custom WritableComparable” exercise. If you completed the exercise (in the
writables project) copy that code to the current project. Otherwise copy the class
from the writables solution package.
Challenge 2: Write a second MapReduce job to sort the output from the first job so
that the list of pairs of words appears in ascending frequency.
Challenge 3: Sort by descending frequency instead (sort that the most frequently
occurring word pairs are first in the output.) Hint: you’ll need to extend
org.apache.hadoop.io.LongWritable.Comparator.

47

48
Hands-On Exercise: Importing Data With Sqoop

In this exercise you will import data from a relational database using Sqoop. The
data you load here will be used subsequent exercises.
Consider the MySQL database movielens, derived from the MovieLens project from
University of Minnesota. (See note at the end of this exercise.) The database consists
of several related tables, but we will import only two of these: movie, which contains
about 3,900 movies; and movierating, which has about 1,000,000 ratings of those
movies.
Review the Database Tables

First, review the database tables to be loaded into Hadoop.
1. Log on to MySQL:
$ mysql --user=training --password=training movielens
2. Review the structure and contents of the movie table:
mysql> DESCRIBE movie;

. . .
mysql> SELECT * FROM movie LIMIT 5;
3. Note the column names for the table:

____________________________________________________________________________________________
4. Review the structure and contents of the movierating table:
mysql> DESCRIBE movierating;

…
mysql> SELECT * FROM movierating LIMIT 5;
5. Note these column names:

____________________________________________________________________________________________
6. Exit mysql:
mysql> quit

49
Import with Sqoop

You invoke Sqoop on the command line to perform several commands. With it you
can connect to your database server to list the databases (schemas) to which you have
access, and list the tables available for loading. For database access, you provide a
connect string to identify the server, and - if required - your username and password.
1. Show the commands available in Sqoop:
$ sqoop help
2. List the databases (schemas) in your database server:
$ sqoop list-databases \
--connect jdbc:mysql://localhost \
--username training --password training
(Note: Instead of entering --password training on your command line, you

may prefer to enter -P, and let Sqoop prompt you for the password, which is then
not visible when you type it.)
3. List the tables in the movielens database:
$ sqoop list-tables \
--connect jdbc:mysql://localhost/movielens \
--username training --password training
4. Import the movie table into Hadoop:
$ sqoop import \
--connect jdbc:mysql://localhost/movielens \
--username training --password training \
--fields-terminated-by '\t' --table movie
5. Verify that the command has worked.
$ hadoop fs -ls movie

$ hadoop fs -tail movie/part-m-00000
6. Import the movierating table into Hadoop.

Repeat the last two steps, but for the movierating table.

50
Note:
This exercise uses the MovieLens data set, or subsets thereof. This
data is freely available for academic purposes, and is used and
distributed by Cloudera with the express permission of the UMN
GroupLens Research Group. If you would like to use this data for
your own research purposes, you are free to do so, as long as you
cite the GroupLens Research Group in any resulting publications. If
you would like to use this data for commercial purposes, you must
obtain explicit permission. You may find the full dataset, as well as
detailed license terms, at http://www.grouplens.org/node/73

51
Hands-On Exercise: Manipulating Data With Hive

Test data (HDFS):
movie
movierating
Exercise directory: ~/workspace/hive
In this exercise, you will practice data processing in Hadoop using Hive.
The data sets for this exercise are the movie and movierating data imported from
MySQL into Hadoop in the “Importing Data with Sqoop” exercise.
Review the Data

1. Make sure you’ve completed the “Importing Data with Sqoop” exercise. Review the
data you already loaded into HDFS in that exercise:
$ hadoop fs -cat movie/part-m-00000 | head

…
$ hadoop fs -cat movierating/part-m-00000 | head
Prepare The Data For Hive

For Hive data sets, you create tables, which attach field names and data types to your
Hadoop data for subsequent queries. You can create external tables on the movie and
movierating data sets, without having to move the data at all.
Prepare the Hive tables for this exercise by performing the following steps:
2. Invoke the Hive shell:
$ hive
3. Create the movie table:
hive> CREATE EXTERNAL TABLE movie

(id INT, name STRING, year INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/training/movie';

52
4. Create the movierating table:
hive> CREATE EXTERNAL TABLE movierating

(userid INT, movieid INT, rating INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/training/movierating';
5. Quit the Hive shell:
hive> QUIT;
Practicing HiveQL
If you are familiar with SQL, most of what you already know is applicably to HiveQL.
Skip ahead to section called “The Questions” later in this exercise, and see if you can
solve the problems based on your knowledge of SQL.
If you are unfamiliar with SQL, follow the steps below to learn how to use HiveSQL to
solve problems.
1. Start the Hive shell.
2. Show the list of tables in Hive:
hive> SHOW TABLES;
The list should include the tables you created in the previous steps.
Note: By convention, SQL (and similarly HiveQL) keywords are

shown in upper case. However, HiveQL is not case sensitive, and
you may type the commands in any case you wish.
3. View the metadata for the two tables you created previously:
hive> DESCRIBE movie;

hive> DESCRIBE movieratings;
Hint: You can use the up and down arrow keys to see and edit your command
history in the hive shell, just as you can in the Linux command shell.

53
4. The SELECT * FROM TABLENAME command allows you to query data from a
table. Although it is very easy to select all the rows in a table, Hadoop generally
deals with very large tables, so it is best to limit how many you select. Use LIMIT to
view only the first N rows:
hive> SELECT * FROM movie LIMIT 10;
5. Use the WHERE clause to select only rows that match certain criteria. For example,
select movies released before 1930:
hive> SELECT * FROM movie WHERE year < 1930;
6. The results include movies whose year field is 0, meaning that the year is unknown
or unavailable. Exclude those movies from the results:
hive> SELECT * FROM movie WHERE year < 1930

AND year != 0;
7. The results now correctly include movies before 1930, but the list is unordered.
Order them alphabetically by title:
hive> SELECT * FROM movie WHERE year < 1930

AND year != 0 ORDER BY name;
8. Now let’s move on to the movierating table. List all the ratings by a particular user,
e.g.
hive> SELECT * FROM movierating WHERE userid=149;
9. SELECT * shows all the columns, but as we’ve already selected by userid,
display the other columns but not that one:
hive> SELECT movieid,rating FROM movierating WHERE

userid=149;
10. Use the JOIN function to display data from both tables. For example, include the
name of the movie (from the movie table) in the list of a user’s ratings:
hive> select movieid,rating,name from movierating join

movie on movierating.movieid=movie.id where userid=149;

54
11. How tough a rater is user 149? Find out by calculating the average rating she gave
to all movies using the AVG function:
hive> SELECT AVG(rating) FROM movierating WHERE userid=149;
12. List each user who rated movies, the number of movies they’ve rated, and their
average rating.
hive> SELECT userid, COUNT(userid),AVG(rating) FROM

movierating GROUP BY userid;
13. Take that same data, and copy it into a new table called userrating.
hive> CREATE TABLE USERRATING (userid INT,

numratings INT, avgrating FLOAT);
hive> insert overwrite table userrating
SELECT userid,COUNT(userid),AVG(rating)
FROM movierating GROUP BY userid
Now that you’ve explored HiveQL, you should be able to answer the questions
below.
The Questions
Now that the data is imported and suitably prepared, write a HiveQL command to
implement each of the following queries.
Working Interactively or In Batch

Hive:
You can enter Hive commands interactively in the Hive shell:
$ hive
. . .
hive> Enter interactive commands here
Or you can execute text files containing Hive commands with:
$ hive -f file_to_execute
1. What is the oldest known movie in the database? Note that movies with unknown
years have a value of 0 in the year field; these do not belong in your answer.
2. List the name and year of all unrated movies (movies where the movie data has no
related movierating data).

55
3. Produce an updated copy of the movie data with two new fields:
numratings - the number of ratings for the movie

avgrating - the average rating for the movie
Unrated movies are not needed in this copy.
4. What are the 10 highest-rated movies? (Notice that your work in step 3 makes this
question easy to answer.)
Note: The solutions for this exercise are in ~/workspace/hive.

56
Hands-On Exercise: Running an Oozie Workflow

Exercise directory: ~/workspace/oozie_labs
Oozie job folders:
lab1-java-mapreduce
lab2-sort-wordcount
In this exercise, you will inspect and run Oozie workflows.
1. Start the Oozie server
$ sudo /etc/init.d/oozie start
2. Change directories to the exercise directory:
$ cd ~/workspace/oozie-labs
3. Inspect the contents of the job.properties and workflow.xml files in the

lab1-java-mapreduce/job folder. You will see that this is the standard
WordCount job.
In the job.properties file, take note of the job’s base directory (lab1-java-
mapreduce), and the input and output directories relative to that. (These are HDFS
directories.)
4. We have provided a simple shell script to submit the Oozie workflow. Inspect the
run.sh script and then run:
$ ./run.sh lab1-java-mapreduce
Notice that Oozie returns a job identification number.
5. Inspect the progress of the job:
$ oozie job -oozie http://localhost:11000/oozie \

-info job_id
6. When the job has completed, review the job output directory in HDFS to confirm
that the output has been produced as expected.
7. Repeat the above procedure for lab2-sort-wordcount. Notice when you

inspect workflow.xml that this workflow includes two MapReduce jobs which

57
run one after the other, in which the output of the first is the input for the second.
When you inspect the output in HDFS you will see that the second job sorts the
output of the first job into descending numerical order.

58
Bonus Exercises
The exercises in this section are provided as a way to explore topics in further
depth than they were covered in classes. You may work on these exercises at your
convenience: during class if you have extra time, or after the course is over.

59
Bonus Exercise: Exploring a Secondary Sort Example

Eclipse project: secondarysort
Data files:
~/training_materials/developer/data/nameyeartestdata
Exercise directory: ~/workspace/secondarysort
In this exercise, you will run a MapReduce job in different ways to see the effects
of various components in a secondary sort program.
The program accepts lines in the form
lastname firstname birthdate
The goal is to identify the youngest person with each last name. For example, for input:
Murphy Joanne 1963-08-12
Murphy Douglas 1832-01-20
Murphy Alice 2004-06-02
We want to write out:
Murphy Alice 2004-06-02
All the code is provided to do this. Following the steps below you are going to
progressively add each component to the job to accomplish the final goal.
Build the Program

1. In Eclipse, review but do not modify the code in the secondarysort project
example package.
2. In particular, note the NameYearDriver class, in which the code to set the
partitioner, sort comparator and group comparator for the job is commented out.
This allows us to set those values on the command line instead.
3. Export the jar file for the program as secsort.jar.
4. A small test datafile called nameyeartestdata has been provided for you,
located in the secondary sort project folder. Copy the datafile to HDFS, if you did not
already do so in the Writables exercise.

60
Run as a Map-only Job

5. The Mapper for this job constructs a composite key using the
StringPairWritable type. See the output of just the mapper by running this
program as a Map-only job:
$ hadoop jar secsort.jar example.NameYearDriver \

-Dmapred.reduce.tasks=0 nameyeartestdata secsortout
6. Review the output. Note the key is a string pair of last name and birth year.
Run using the default Partitioner and Comparators

7. Re-run the job, setting the number of reduce tasks to 2 instead of 0.
8. Note that the output now consists of two files; one each for the two reduce tasks.
Within each file, the output is sorted by last name (ascending) and year (ascending).
But it isn’t sorted between files, and records with the same last name may be in
different files (meaning they went to different reducers).
Run using the custom partitioner

9. Review the code of the custom partitioner class: NameYearPartitioner.
10. Re-run the job, adding a second parameter to set the partitioner class to use:
-Dmapreduce.partitioner.class=example.NameYearPartitioner
11. Review the output again, this time noting that all records with the same last name
have been partitioned to the same reducer.
However, they are still being sorted into the default sort order (name, year
ascending). We want it sorted by name ascending/year descending.
Run using the custom sort comparator

12. The NameYearComparator class compares Name/Year pairs, first comparing
the names and, if equal, compares the year (in descending order; i.e. later years
are considered “less than” earlier years, and thus earlier in the sort order.) Re-
run the job using NameYearComparator as the sort comparator by adding a third
parameter:

61
-D mapred.output.key.comparator.class=
example.NameYearComparator
13. Review the output and note that each reducer’s output is now correctly partitioned
and sorted.
Run with the NameYearReducer

14. So far we’ve been running with the default reducer, which is the Identity Reducer,
which simply writes each key/value pair it receives. The actual goal of this job
is to emit the record for the youngest person with each last name. We can do this
easily if all records for a given last name are passed to a single reduce call, sorted in
descending order, which can then simply emit the first value passed in each call.
15. Review the NameYearReducer code and note that it emits
16. Re-run the job, using the reducer by adding a fourth parameter:
-Dmapreduce.reduce.class=example.NameYearReducer
Alas, the job still isn’t correct, because the data being passed to the reduce method
is being grouped according to the full key (name and year), so multiple records with
the same last name (but different years) are being output. We want it to be grouped
by name only.
Run with the custom group comparator

17. The NameComparator class compares two string pairs by comparing only
the name field and disregarding the year field. Pairs with the same name will
be grouped into the same reduce call, regardless of the year. Add the group
comparator to the job by adding a final parameter:
-Dmapred.output.value.groupfn.class=
example.NameComparator
18. Note the final output now correctly includes only a single record for each different
last name, and that that record is the youngest person with that last name.


Cloudera Developer Training For Apache Hadoop Instructor Guide PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloudera Developer Training For Apache Hadoop Instructor Guide PDF

Uploaded by

Copyright:

Available Formats

1

Cloudera Developer Training for

© Copyright 2010–2017 Cloudera. All Rights Reserved.

Suggested Course Timings

© Copyright 2010–2017 Cloudera. All Rights Reserved.

© Copyright 2010–2017 Cloudera. All Rights Reserved.

© Copyright 2010–2017 Cloudera. All Rights Reserved.

Most Recent Changes

Exercise Manual Changes

© Copyright 2010–2017 Cloudera. All Rights Reserved.

Lab Environment Changes

General Slide Changes

Minor Exercise Manual Bugs

Minor Slide Bugs

© Copyright 2010–2017 Cloudera. All Rights Reserved.

• [CUR-1377] Slide 3-19 should say "copy" instead of "move"

© Copyright 2010–2017 Cloudera. All Rights Reserved.

© Copyright 2010–2017 Cloudera. All Rights Reserved.

Copyright © 2010–2017 Cloudera. All rights reserved. Not to be re-

▪ About This Course

▪ About This Course

▪ The leader in Apache Hadoop-based software and services

Discussing OnDemand offerings, you can give them https://ondemand.cloudera.com/, which

▪ 100% open source,

PROCESS, ANALYZE, SERVE

BATCH S TR E A M SQL SEARCH

FILESYSTEM RELATIONAL N oSQ L OTHER

• LDAP = Lightweight Directory Access Protocol

▪ About This Course

Your instructor will give you details on how to access the

▪ About This Course

And here’s what we’re going to learn in this chapter…

▪ Problems with Traditional Large-Scale Systems

“In pioneer days they used oxen

Public domain image from http://commons.wikimedia.org/wiki/

1. The network is reliable.

▪ Problems with Traditional Large-Scale Systems

Hadoop is based on papers published by Google

• Applications are written in high-level code.

Horizontally scalable = add more computers, not make computers bigger.

▪ Problems with Traditional Large-Scale Systems

▪ Problems with Traditional Large-Scale Systems

▪ The Hadoop Project and Hadoop Components

▪ The Hadoop Project and Hadoop Components

When the files are stored:

While the files are accessed:

$ hadoop fs -put foo.txt foo.txt

─ This will copy the file to /user/username/foo.txt

▪ Get a directory listing of the HDFS root directory

$ hadoop fs -cat /user/fred/bar.txt

▪ Copy that file to the local disk, named as baz.txt

$ hadoop fs -cat /user/fred/bar.txt baz.txt

▪ Create a directory called input under the user’s home directory

$ hadoop fs -mkdir input

$ hadoop fs -rm -r input_old

▪ The Hadoop Project and Hadoop Components

▪ The Hadoop Project and Hadoop Components

Why is the simple Word Count example relevant?

intermediate key 1 value 1

intermediate key 2 value 2

bugaboo an object of BUGABOO AN OBJECT OF

001 hadoop 6 hadoop

002 aim 3 aim