You are on page 1of 6

#117 Get More Refcardz! Visit refcardz.


Getting Started with

Apache Hadoop

Apache Hadoop

Hadoop Quick Reference
Hadoop Quick How-To
Staying Current
Hot Tips and more...
By Eugene Ciurana and Masoud Kalali


This Refcard presents a basic blueprint for applying

MapReduce to solving large-scale, unstructured data
processing problems by showing how to deploy and use an
Apache Hadoop computational cluster. It complements DZone
Refcardz #43 and #103, which provide introductions to high-
performance computational scalability and high-volume data
handling techniques, including MapReduce.

What Is MapReduce?
MapReduce refers to a framework that runs on a computational
cluster to mine large datasets. The name derives from the
application of map() and reduce() functions repurposed from

functional programming languages.

• “Map” applies to all the members of the dataset and
returns a list of results
Apache Hadoop is an open source, Java framework for
• “Reduce” collates and resolves the results from one or
implementing reliable and scalable computational networks.
more mapping operations executed in parallel
Hadoop includes several subprojects:
• Very large datasets are split into large subsets called splits
• MapReduce
• A parallelized operation performed on all splits yields • Pig
the same results as if it were executed against the larger • ZooKeeper
dataset before turning it into splits • HBase
• Implementations separate business logic from multi-
• Hive
processing logic
• Chukwa
Getting Started with Apache Hadoop

• MapReduce framework developers focus on process

dispatching, locking, and logic flow This Refcard presents how to deploy and use the common
tools, MapReduce, and HDFS for application development
• App developers focus on implementing the business logic after a brief overview of all of Hadoop’s components.
without worrying about infrastructure or scalability issues

Implementation patterns
The Map(k1, v1) -> list(k2, v2) function is applied to every
item in the split. It produces a list of (k2, v2) pairs for each call.
The framework groups all the results with the same key
Get over 90 DZone Refcardz
together in a new split. FREE from!
The Reduce(k2, list(v2)) -> list(v3) function is applied
to each intermediate results split to produce a collection
of values v3 in the same domain. This collection may have
zero or more values. The desired result consists of all the v3
collections, often aggregated into one result file.

MapReduce frameworks produce lists of values.

Hot Users familiar with functional programming
Tip mistakenly expect a single result from the
mapping operations.

DZone, Inc. |

2 Getting Started with Apache Hadoop

 ooKeeper - a distributed application management tool
Hot is the authoritative for configuration, event synchronization, naming, and
Tip reference for all things Hadoop. group services used for managing the nodes in a Hadoop
computational network.

Hadoop comprises tools and utilities for data serialization, file

Sqoop is a product released by Cloudera, the most
system access, and interprocess communication pertaining
influential Hadoop commercial vendor, under the
to MapReduce implementations. Single and clustered Hot Apache 2.0 license. The source code and binary
configurations are possible. This configuration almost Tip packages are available at:
always includes HDFS because it’s better optimized for high
throughput MapReduce I/O than general-purpose file systems.

Components Hadoop Cluster Building Blocks

Figure 2 shows how the various Hadoop components relate to Hadoop clusters may be deployed in three basic configurations:
one another:
Mode Description Usage
Local Multi-threading components, single Development,
(default) JVM test, debug
Pseudo- Multiple JVMs, single node Development,
distributed test, debug
Distributed All components run in separate Staging,
nodes production

Figure 3 shows how the components are deployed for any of

these configurations:
 DFS - a scalable, high-performance distributed file
system. It stores its data blocks on top of the native file
system. HDFS is designed for consistency; commits aren’t
considered “complete” until data is written to at least two
different configurable volumes. HDFS presents a single
view of multiple physical disks or file systems.
 apReduce - A Java-based job tracking, node
management, and application container for mappers and
reducers written in Java or in any scripting language that
supports STDIN and STDOUT for job interaction.

Hadoop also supports other file systems

likeAmazon Simple Storage Service (S3), Kosmix’s
Hot CloudStore, and IBM’s General Parallel File System.  
Tip These may be cheaper alternatives to hosting data in Each node in a Hadoop installation runs one or more daemons
the local data center. executing MapReduce code or HDFS commands. Each
daemon’s responsibilities in the cluster are:
 hukwa - a data collection system for monitoring, displaying, •N
 ameNode: manages HDFS and communicates with every
and analyzing logs from large distributed systems. DataNode daemon in the cluster

 ive - structured data warehousing infrastructure that • J obTracker: dispatches jobs and assigns splits (splits) to
provides a mechanisms for storage, data extraction, mappers or reducers as each stage completes
transformation, and loading (ETL), and a SQL-like •T
 askTracker: executes tasks sent by the JobTracker and
language for querying and analysis. reports status
 Base - a column-oriented (NoSQL) database designed for •D
 ataNode: Manages HDFS content in the node and
real-time storage, retrieval, and search of very large tables updates status to the NameNode
(billions of rows/millions of columns) running atop HDFS.
These daemons execute in the three distinct processing
Utilities layers of a Hadoop cluster: master (Name Node), slaves (Data
•P  ig - a set of tools for programmatic flat-file data Nodes), and user applications.
analysis that provides a programming language, data
transformation, and parallelized processing. Name Node (Master)
• Manages the file system name space
 qoop - a tool for importing and exporting data stored in
• Keeps track of job execution
relational databases into Hadoop or Hive, and vice versa
using MapReduce tools and standard JDBC drivers. • Manages the cluster

DZone, Inc. |

3 Getting Started with Apache Hadoop

• Replicates data blocks and keeps them evenly distributed configuration from the Hadoop site. All the configuration
files are located in the directory $HADOOP_HOME/conf; the
 anages lists of files, list of blocks in each file, list of
minimum configuration requirements for each file are:
blocks per node, and file attributes and other meta-data
 racks HDFS file creation and deletion operations in an •h — environmental configuration,
activity log JVM configuration, logging, master and slave
configuration files
Depending on system load, the NameNode and JobTracker
• c ore-site.xml — site wide configuration, such as users,
daemons may run on separate computers.
groups, sockets
Although there can be two or more Name Nodes in •h
 dfs-site.xml — HDFS block size, Name and Data
a cluster, Hadoop supports only one Name Node. node directories
Hot Secondary nodes, at the time of writing, only log
Tip what happened in the primary. The Name Node is a
 apred-site.xml — total MapReduce tasks,
JobTracker address
single point of failure that requires manual fail-over!
 asters, slaves files — NameNode, JobTracker,
Data Nodes (Slaves) DataNodes, and TaskTrackers addresses, as appropriate
• Store blocks of data in their local file system
Test the Installation
• Store meta-data for each block Log on to each server without a passphrase:
• Serve data and meta-data to the job they execute ssh servername or ssh localhost

• Send periodic status reports to the Name Node Format a new distributed file system:
hadoop namenode -format
 end data blocks to other nodes required by the
Name Node Start the Hadoop daemons:
Data nodes execute the DataNode and TaskTracker daemons
described earlier in this section. Check the logs for errors at $HADOOP_HOME/logs!

User Applications Browse the NameNode and JobTracker interfaces at

•D ispatch mappers and reducers to the Name Node for (localhost is a valid name for local configurations):
execution in the Hadoop cluster
 xecute implementation contracts for Java and for •
scripting languages mappers and reducers
• Provide application-specific execution parameters HADOOP QUICK REFERENCE

 et Hadoop runtime configuration parameters with
semantics that apply to the Name or the Data nodes The official commands guide is available from:
A user application may be a stand-alone executable, a script, a manual.html
web application, or any combination of these. The application
is required to implement either the Java or the streaming APIs. Usage
Hadoop Installation hadoop [--config confdir] [COMMAND]

Cygwin is a requirement for any Windows systems

Hot running Hadoop — install it before continuing if
Hadoop can parse generic options and run classes from the
Tip command line. confdir can override the default $HADOOP_HOME/
you’re using this OS. conf directory.

Required detailed instructions for this section are available at: Generic Options -conf <config file> App configuration file
 nsure that Java 6 and both ssh and sshd are running in -D <property=value> Set a property
all nodes
-fs <local|namenode:port> Specify a namenode
 et the most recent, stable release from
-jg <local|jobtracker:port> Specify a job tracker; applies only
to a job
• Decide on local, pseudo-distributed or distributed mode
-files <file1, file2, .., fileN> Files to copy to the cluster (job only)
• Install the Hadoop distribution on each server
-libjars <file1, file2, ..,fileN> .jar files to include in the classpath
 et the HADOOP_HOME environment variable to the directory (job only)
where the distribution is installed -archives <file1, file2, .., fileN> Archives to unbundle on the
• Add $HADOOP_HOME/bin to PATH computational nodes (job only)

Follow the instructions for local, pseudo-clustered, or clustered $HADOOP_HOME/bin/hadoop precedes all commands.

DZone, Inc. |

4 Getting Started with Apache Hadoop

User Commands Wildcard expansion happens in the host’s shell, not

archive -archiveName file.har Create an archive Hot in the HDFS shell! A command issued to a directory
/var/data1 /var/data2 Tip will affect the directory and all the files in it,
distcp Distributed copy from one or more inclusive. Remember this to prevent surprises.
hdfs://node1:8020/dir_a node/dirs to a target
To leverage this quick reference, review and understand all the
Hadoop configuration, deployment, and HDFS management
fsck -locations /var/data1 File system checks: list block/ concepts. The complete documentation is available from
fsck -move /var/data1 location, move corrupted files to /
fsck /var/data lost+found, and general check
job -list [all] Job list, dispatching, status check, HADOOP APPS QUICK HOW-TO
job -submit job_file and kill; submitting a job returns
job -status 42 its ID
A Hadoop application is made up of one or more jobs. A job
job -kill 42
consists of a configuration file and one or more Java classes or
pipes -conf file Use HDFS and MapReduce from a a set of scripts. Data must already exist in HDFS.
pipes -map File.class C++ program
pipes -map M.class -reduce Figure 4 shows the basic building blocks of a Hadoop
R.class -files application written in Java:
queue -list List job queues

Administrator Commands
balancer -threshold 50 Cluster balancing at percent of
disk capacity
daemonlog -getlevel host name Fetch http://host/
datanode Run a new datanode
jobtracker Run a new job tracker
namenode -format Format, start a new instance,
namenode -regular upgrade from a previous version  
namenode -upgrade of Hadoop, or remove previous
namenode -finalize version's files and complete An application has one or more mappers and reducers and a
configuration container that describes the job, its stages, and
HDFS shell commands apply to local or HDFS file systems and intermediate results. Classes are submitted and monitored
take the form: using the tools described in the previous section.

hadoop dfs -command dfs_command_options Input Formats and Types

 eyValueTextInputFormat — Each line represents a key
HDFS Shell and value delimited by a separator; if the separator is
du /var/data1 hdfs://node/data2 Display cumulative of files and missing the key and value are empty
• TextInputFormat — The key is the line number, the value
lsr Recursive directory list is the text itself for each line
cat hdfs://node/file Types a file to stdout •N
 LineInputFormat — N sequential lines represent the
value, the offset is the key
count hdfs://node/data Count the directories, files,
and bytes in a path •M
 ultiFileInputFormat — An abstraction that the user
chmod, chgrp, chown Permissions overrides to define the keys and values in terms of
multiple files
expunge Empty file system trash
 equence Input Format — Raw format serialized
get hdfs://node/data2 /var/data2 Recursive copy files to the key/value pairs
local system
 BInputFormat — JDBC driver fed data input
put /var/data2 hdfs://node/data2 Recursive copy files to the
target file system Output Formats
cp, mv, rm Copy, move, or delete files in The output formats have a 1:1 correspondence with the
HDFS only input formats and types. The complete list is available from:
mkdir hdfs://node/path Recursively create a new direc-
tory in the target
Word Indexer Job Example
setrep -R -w 3 Recursively set a file or direc- Applications are often required to index massive amounts
tory replication factor (number of text. This sample application shows how to build a simple
of copies of the file)
indexer for text files. The input is free-form text such as:

DZone, Inc. |

5 Getting Started with Apache Hadoop

hamlet@11141\tKING CLAUDIUS\tWe doubt it nothing: heartily Job Driver

farewell. public class Driver {
public static void main(String… argV) {
The map function output should be something like: Job job = new Job(new Configuration(), “test”);
<KING, hamlet@11141>
<CLAUDIUS, hamlet@11141>
<We, hamlet@11141>
<doubt, hamlet@11141>
} // Driver
The number represents the line in which the text occurred. The
mapper and reducer/combiner implementations in this section
This driver is submitted to the Hadoop cluster for processing,
require the documentation from:
along with the rest of the code in a .jar file. One or more files must be available in a reachable hdfs://node/path before
submitting the job using the command:
The Mapper
The basic Java code implementation for the mapper has the hadoop jar shakespeare_indexer.jar
Using the Streaming API
public class LineIndexMapper
The streaming API is intended for users with very limited Java
extends MapReduceBase
implements Mapper<LongWritable, knowledge and interacts with any code that supports STDIN
Text, Text, Text> { and STDOUT streaming. Java is considered the best choice for
“heavy duty” jobs. Development speed could be a reason for
public void map(LongWritable k,
Text v, OutputCollector<Text, Text> o,
using the streaming API instead. Some scripted languages may
Reporter r) throws IOException { /* implementation here work as well or better than Java in specific problem domains.
*/ } This section shows how to implement the same mapper and
reducer using awk and compares its performance against
} Java’s.

The implementation itself uses standard Java text manipulation The Mapper
tools; you can use regular expressions, scanners, whatever is #!/usr/bin/gawk -f
necessary. {
for (n = 2;n <= NF;n++) {
There were significant changes to the method if (length($n) > 0) printf(“%s\t%s\n”, $n, $1);
Hot signatures in Hadoop 0.18, 0.20, and 0.21 - check

Tip the documentation to get the exact signature for the

version you use. The output is mapped with the key, a tab separator, then the
index occurrence.
The Reducer/Combiner
The combiner is an output handler for the mapper to reduce The Reducer
the total data transferred over the network. It can be thought #!/usr/bin/gawk -f
of as a reducer on the local node. { wordsList[$1] = ($1 in wordsList) ?
sprintf(“%s,%s”,wordsList[$1], $2) : $2; }
public class LineIndexReducer
extends MapReduceBase END {
implements Reducer<Text, for (key in wordsList)
Text, Text, Text> { printf(“%s\t%s\n”, key,wordsList[key]);
public void reduce(Text k,
Iterator<Text> v, The output is a list of all entries for a given word, like in the
OutputCollector<Text, Text> o, previous section:
Reporter r) throws IOException {
/* implementation */ }
} Awk’s main advantage is conciseness and raw text processing
power over other scripting languages and Java. Other
The reducer iterates over keys and values generated in the languages, like Python and Perl, are supported if they are
previous step adding a line number to each word’s occurrence installed in the Data Nodes. It’s all about balancing speed of
index. The reduction results have the form: development and deployment vs. speed of execution.

<KING, hamlet@11141; hamlet@42691; lear@31337> Job Driver

hadoop jar hadoop-streaming.jar -mapper shakemapper.awk
A complete index shows the line where each word occurs, and -reducer shakereducer.awk -input hdfs://node/shakespeare-
the file/work where it occurred. works

DZone, Inc. |

6 Getting Started with Apache Hadoop

Performance Tradeoff
The streamed awk invocation vs. Java are
functionally equivalent and the awk version is only
Hot about 5% slower. This may be a good tradeoff if the
scripted version is significantly faster to develop and
is continuously maintained.


Do you want to know about specific projects and use cases

where NoSQL and data scalability are the hot topics? Join the
scalability newsletter:


Eugene Ciurana ( is an open-source Hadoop: The Definitive Guide helps you harness the
evangelist who specializes in the design and implementation of power of your data. Ideal for processing large datasets,
mission-critical, high-availability large scale systems. Over the the Apache Hadoop framework is an open source
last two years, Eugene designed and built hybrid cloud scalable implementation of the MapReduce algorithm on which
systems and computational networks for leading financial, Google built its empire. This comprehensive resource
software, insurance, and healthcare companies in the US, Japan, demonstrates how to use Hadoop to build reliable,
Mexico, and Europe. scalable, distributed systems: programmers will find
details for analyzing large datasets, and administrators will
learn how to set up and run Hadoop clusters.
• Developing with Google App Engine, Apress
• DZone Refcard #105: NoSQL and Data Scalability
• DZone Refcard #43: Scalability and High Availability
• DZone Refcard #38: SOA Patterns
• The Tesla Testament: A Thriller, CIMEntertainment

Masoud Kalali ( is a software engineer and

author. He has been working on software development
projects since 1998. He is experienced in a variety of
technologies and platforms.

Masoud is the author of several DZone Refcardz, including:

Using XML in Java, Berkeley DB Java Edition, Java EE Security,
and GlassFish v3. Masoud is also the author of a book on
GlassFish Security published by Packt. He is one of the founding
members of the NetBeans Dream Team and is a GlassFish
community spotlighted developer.


Browse our collection of over 100 Free Cheat Sheets

Get More Refcardz! Visit


About Cloud Computing
Usage Scenarios Getting Started with

Aldon Cloud#64Computing

Underlying Concepts

Upcoming Refcardz
youTechnologies ®

t toTier
brough Comply.
Platform Management and more...

ge. Colla By Daniel Rubio

dz. com

also minimizes the need to make design changes to support


tegra ternvasll

us Ind Anti-PPaat
Automated growthHTM
ref car

Web applications have always been deployed on servers & scalable

L vs XHT technologies


ation one time events, cloud ML
connected to what is now deemed the ‘cloud’. Having the capability to support

Network Security

ul M.
computing platforms alsoulfacilitate
Open the gradual growth curves

n an
Page Source


Vis it

However, the demands and technology used on such servers Structure

faced by web applications. Tools

By Key ■

E: has changed substantially in recent years, especially with ents Structur
LUD al Elem
TS INC gration
P the entrance of service providers like Amazon, Google and Large scale growth scenarios involvingents
and mor equipment
rdz !

ON TEN tinuous Inte Change

ry Microsoft. e chang
es e... away by
(e.g. load balancers and clusters) are all but abstracted
Con at Eve
About ns to isolat
relying on a cloud computing platform’s technology.
Software i-patter

Re fca

e Work
and Ant
These companies have a Privat
are in long deployedtrol repos
webmana applications
ge HTM
Patterns Control lop softw n-con to

that adapt and scale
les toto large user
a versio ing and making them
bases, In addition, several cloud computing ICSplatforms support data
ize merg
Version e... Patte it all fi minim le HTM
Manage s and mor e Work knowledgeable in amany
ine to
mainl aspects related tos multip
cloud computing. tier technologies that Lexceed the precedent set by Relational
space Comm and XHT


Privat lop on that utilize HTML MLReduce,

Practice Database Systems (RDBMS): is usedMap are web service APIs,

Deve code lines a system
d of work
as thescalethe By An
prog foundati
Ge t Mo

Buil Repo
This Refcard active
will introduce are to you to clouddcomputing,within units with an
RATION etc. Some platforms ram support large grapRDBMS deployments.

The src
dy Ha
and Java s written in on of
e ine loping hical
task-o it attribute
Mainl emphasis onDeve es by all
S these
ines providers, so youComm can better understand
also rece JavaScri user interfac web develop and the rris

codel desc
INUOU Task Level
ding e code as the

NT of buil trol what it is a cloudnize

line Policy sourc es as aplatform
computing can offer your ut web output ive data pt. Server-s e in clien ment. the ima alt attribute ribes whe
T CO cess ion con Code Orga it chang e witho likew ide lang t-side ge is describe re the ima
ise use mec hanism. fromAND
e name
and subm sourc
ABOU (CI) is evel Comm
the build
with uniqu softw
are from
was HTM
The eme pages and uages like Nested
ble. s alte ge file
a pro um
build L and
ed to ies to use HTM PHP tags
Task-L Label minim e text that
blem activit the bare standard a very loos XHTML rging Tags found,
ous Inte committ to a pro USAGE SCENARIOS ate all
Autom configurat
ion cies to t
ely-defi as thei Ajax
tech L can is disp
Continu ry change ization, cannot be (and freq
nden ymen
tion layed
need ned lang r visual eng nologies
Build al depe
a solu ineffective
deplo t
Label manu d tool the same for stan but as if
(i.e., ) nmen overlap
eve , Build stalle
t, use target enviro Amazon EC2: Industry standard it has
software and uag
virtualization ine. HTM uen
with erns ns (i.e. problem ated ce pre-in ymen whether dards e with b></ , so <a>< tly are) nest
ory. via patt icular
Autom Redu d deploEAR) in each has bec become very little L
a> is

reposit s that Pay only cieswhat you consume
tagge or Amazon’s cloud you cho platform
the curr isome
heavily basedmoron fine. b></ ed insid
lained ) and anti the part solution duce nden For each (e.g. WAR es t
ent stan ose to writ more e impo a></ e
not lega each othe
exp x” al Depe ge nmen b> is
be text to “fi ns are Web application deployment until
nden a few years
t librari ago was similar
t enviro industry standard
that will softwaredard and virtualization app
e HTM technology. rtant,
to pro Minim packa
Mo re

CI can ticular con used i-patter they tend but can

rity all depe all targe
s will L or XHT arent. Reg the HTM l, but r. Tags
etimes s. Ant tices,
to most phone services:
y Integ alizeplans with le that
late fialloted resources, ts with an and XHT simplify all help ML, und ardless
XHTM <a><
in a par hes som , Binar Centr
temp you prov b></
proces in the end bad prac lementing
nt nmen your
incurred costgeme whether e a single on ML of L
Creatsuchare resources were consumed or
t enviro Virtualization allows a physical
are actu pieceothe of hardware to be erst HTML
based much web cod ide a solid
approac ed with the cial, but,
cy Mana nt targe es to anding
essarily ed to imp nden rties
into differe chang of the ally simp r This has
efi Depe prope utilized by multiple operating
function systems. ing.resourcesfoundati
ler thanallows job adm been arou
not nec compar
associat to be ben
Ge t

late Verifi te builds e comm commo Fortuna

y are n Cloud computing asRun it’sremo
known etoday has changed this.
befor etc. alitytohas they on irably, nd for
Temp Build (e.g. bandwidth, nmemory, CPU) be allocated exclusively totely exp som
appear effects. The results whe that
rm a
Privat y, contin Every elem mov used
to be, HTML
ected. job has e time
ed The various s resourcesPerfo consumed by webperio applications
dicall (e.g. nt team
individual operating entsinstances. ed to CSS
system Brow Early . Whi
gration expand
adverse unintend
d Build
common e (HTML because HTM
ed far le it has don
sitory . ser
Build r to devel
web dev manufacture L had very
ous Inte
e bandwidth, memory, CPU) areIntegtallied
ration on a per-unit CI serve basis or XHT
produc Continu Refcard
e Build rm an ack from extensio .) All are limited more than e its

ML shar
tern. term
(starting from zero) by Perfo all majorated cloud
feedb computing platforms. on As a user of Amazon’s
n HT EC2 essecloud
nti l computing es c platform, you are result elopers rs add
ed layo anyb
he pat tion f he le this h s d d utom they occur ld based i c d

DZone, Inc.
ISBN-13: 978-1-934238-75-2
140 Preston Executive Dr. ISBN-10: 1-934238-75-9
Suite 100
Cary, NC 27513

DZone communities deliver over 6 million pages each month to 888.678.0399

more than 3.3 million software developers, architects and decision
Refcardz Feedback Welcome

makers. DZone offers something for everyone, including news,
tutorials, cheatsheets, blogs, feature articles, source code and more. 9 781934 238752
Sponsorship Opportunities
“DZone is a developer’s dream,” says PC Magazine.
Copyright © 2010 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, Version 1.0
photocopying, or otherwise, without prior written permission of the publisher.