Interested in learning more

about Apache Hadoop?
Cloudera offers comprehensive Apache Hadoop
training and certification. We offer live public
Hadoop training sessions and certification exams
regularly around the globe. We also provide private
group on-site Hadoop training sessions.
Upcoming Classes
Hadoop Training for System Administrators
Redwood City, CA - Feb 17-18
Hadoop Training for Developers
NYC - Feb 22-24
Hadoop Training for Developers
Chicago - Feb 28-Mar 2
Hadoop Training for System Administrators
Chicago - Mar 3-4
Hadoop Training for Developers
Seattle - Mar 7-9
Analyzing Data with Hive & Pig
Redwood City, CA - Mar 8-9

For a full list of scheduled training visit:
www.cloudera.com/info/training


DZone, Inc. | www.dzone.com
G
e
t

M
o
r
e

R
e
f
c
a
r
d
z
!

V
i
s
i
t

r
e
f
c
a
r
d
z
.
c
o
m
#133
A
p
a
c
h
e

H
a
d
o
o
p

D
e
p
l
o
y
m
e
n
t
CONTENTS INCLUDE:
n
Introduction
n
Which Hadoop Distribution?
n
Apache Hadoop Installation
n
Hadoop Monitoring Ports
n
Apache Hadoop Production Deployment
n
Hot Tips and more...
By Eugene Ciurana
Apache Hadoop Deployment:
A Blueprint for Reliable Distributed Computing
INTRODUCTION
This Refcard presents a basic blueprint for deploying
Apache Hadoop HDFS and MapReduce in development and
production environments. Check out Refcard #117, Getting
Started with Apache Hadoop, for basic terminology and for an
overview of the tools available in the Hadoop Project.
WHICH HADOOP DISTRIBUTION?
Apache Hadoop is a scalable framework for implementing
reliable and scalable computational networks. This Refcard
presents how to deploy and use development and production
computational networks. HDFS, MapReduce, and Pig are the
foundational tools for developing Hadoop applications.
There are two basic Hadoop distributions:
• Apache Hadoop is the main open-source, bleeding-edge
distribution from the Apache foundation.
• The Cloudera Distribution for Apache Hadoop (CDH) is an
open-source, enterprise-class distribution for production-
ready environments.
The decision of using one or the other distributions depends
on the organization’s desired objective.
• The Apache distribution is fne for experimental learning
exercises and for becoming familiar with how Hadoop is
put together.
• CDH removes the guesswork and offers an almost turnkey
product for robustness and stability; it also offers some
tools not available in the Apache distribution.
Hot
Tip
Cloudera offers professional services and puts
out an enterprise distribution of Apache
Hadoop. Their toolset complements Apache’s.
Documentation about Cloudera’s CDH is available
from http://docs.cloudera.com.
The Apache Hadoop distribution assumes that the person
installing it is comfortable with confguring a system manually.
CDH, on the other hand, is designed as a drop-in component for
all major Linux distributions.
Hot
Tip
Linux is the supported platform for production
systems. Windows is adequate but is not
supported as a development platform.
Minimum Prerequisites
• Java 1.6 from Oracle, version 1.6 update 8 or later; identify
your current JAVA_HOME
• sshd and ssh for managing Hadoop daemons across
multiple systems
• rsync for fle and directory synchronization across the nodes
in the cluster
• Create a service account for user hadoop where $HOME=/
home/hadoop
SSH Access
Every system in a Hadoop deployment must provide SSH
access for data exchange between nodes. Log in to the node
as the Hadoop user and run the commands in Listing 1 to
validate or create the required SSH confguration.
Listing 1 - Hadoop SSH Prerequisits
keyFile=$HOME/.ssh/id_rsa.pub
pKeyFile=$HOME/.ssh/id_rsa
authKeys=$HOME/.ssh/authorized_keys
if ! ssh localhost -C true ; then \
if [ ! -e “$keyFile” ]; then \
ssh-keygen -t rsa -b 2048 -P ‘’ \
-f “$pKeyFile”; \
f; \
cat “$keyFile” >> “$authKeys”; \
chmod 0640 “$authKeys”; \
echo “Hadoop SSH confgured”; \
else echo “Hadoop SSH OK”; f
The public key for this example is left blank. If this were to run
on a public network it could be a security hole. Distribute the
public key from the master node to all other nodes for data
exchange. All nodes are assumed to run in a secure network
behind the frewall.
Find out how Cloudera’s
Distribution for Apache
Hadoop makes it easier
to run Hadoop in your
enterprise.
www.cloudera.com/downloads/
Comprehensive Apache
Hadoop Training and
Certification
brought to you by..
3 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
DZone, Inc. | www.dzone.com
Hot
Tip
All the bash shell commands in this Refcard are
available for cutting and pasting from:
http://ciurana.eu/DeployingHadoopDZone
Enterprise: CDH Prerequisites
Cloudera simplifed the installation process by offering packages
for Ubuntu Server and Red Hat Linux distributions.
Hot
Tip
CDH packages have names like CDH2, CDH3, and
so on, corresponding to the CDH version. The
examples here use CDH3. Use the appropriate
version for your installation.
CDH on Ubuntu Pre-Install Setup
Execute these commands as root or via sudo to add the
Cloudera repositories:
Listing 2 - Ubuntu Pre-Install Setup
DISTRO=$(lsb_release -c | cut -f 2)
REPO=/etc/apt/sources.list.d/cloudera.list
echo “deb \
http://archive.cloudera.com/debian \
$DISTRO-cdh3 contrib” > “$REPO”
echo “deb-src \
http://archive.cloudera.com/debian \
$DISTRO-cdh3 contrib” >> “$REPO”
apt-get update
CDH on Red Hat Pre-Install Setup
Run these commands as root or through sudo to add the yum
Cloudera repository:
Listing 3 - Red Hat Pre-Install Setup
curl -sL http://is.gd/3ynKY7 | tee \
/etc/yum.repos.d/cloudera-cdh3.repo | \
awk ‘/^name/’
yum update yum
Ensure that all the pre-required software and confguration
are installed on every machine intended to be a Hadoop
node. Don’t mix and match operating systems, distributions,
Hadoop, or Java versions!
Hadoop for Development
• Hadoop runs as a single Java process, in non-distributed
mode, by default. This confguration is optimal for
development and debugging.
• Hadoop also offers a pseudo-distributed mode, in which
every Hadoop daemon runs in a separate Java process.
This confguration is optimal for development and will be
used for the examples in this guide.
Hot
Tip
If you have an OS X or a Windows development
workstation, consider using a Linux distribution
hosted on VirtualBox for running Hadoop. It will
help prevent support or compatibility headaches.
Hadoop for Production
• Production environments are deployed across a group
of machines that make the computational network.
Hadoop must be confgured to run in fully distributed,
clustered mode.
APACHE HADOOP INSTALLATION
This Refcard is a reference for development and production
deployment of the components shown in Figure 1. It includes
the components available in the basic Hadoop distribution and
the enhancements that Cloudera released.
HDFS
Disk Disk Disk Disk
MapReduce HBase
Sqoop
Chukwa Hive PIG
Z
o
o
K
e
e
p
e
r
Flume
logs
logs
logs
database
Hadoop components
CDH components
Figure 1 - Hadoop Components
Hot
Tip
Whether the user intends to run Hadoop in
non-distributed or distributed modes, it’s best
to install every required component in every
machine in the computational network. Any
computer may assume any role thereafter.
A non-trivial, basic Hadoop installation includes at least
these components:
• Hadoop Common: the basic infrastructure necessary for
running all components and applications
• HDFS: the Hadoop Distributed File System
• MapReduce: the framework for large data set
distributed processing
• Pig: an optional, high-level language for parallel
computation and data fow
Enterprise users often chose CDH because of:
• Flume: a distributed service for effcient large data
transfers in real-time
• Sqoop: a tool for importing relational databases into
Hadoop clusters
Apache Hadoop Development Deployment
The steps in this section must be repeated for every node in
a Hadoop cluster. Downloads, installation, and confguration
could be automated with shell scripts. All these steps
are performed as the service user hadoop, defned in the
prerequisites section.
http://hadoop.apache.org/common/releases.html has the latest
version of the common tools. This guide used version 0.20.2.
1. Download Hadoop from a mirror and unpack it in the
/home/hadoop work directory.
2. Set the JAVA_HOME environment variable.
3. Set the run-time environment:
4 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
DZone, Inc. | www.dzone.com
Listing 4 - Set the Hadoop Runtime Environment
version=0.20.2 # change if needed
identity=”hadoop-dev”
runtimeEnv=”runtime/conf/hadoop-env.sh”
ln -s hadoop-”$version” runtime
ln -s runtime/logs .
export HADOOP_HOME=”$HOME”
cp “$runtimeEnv” “$runtimeEnv”.org
echo “export \
HADOOP_SLAVES=$HADOOP_HOME/slaves” \
>> “$runtimeEnv”
mkdir “$HADOOP_HOME”/slaves
echo \
“export HADOOP_IDENT_STRING=$identity” >> \
“$runtimeEnv”
echo \
“export JAVA_HOME=$JAVA_HOME” \
>>”$runtimeEnv”
export \
PATH=$PATH:”$HADOOP_HOME”/runtime/bin
unset version; unset identity; unset runtimeEnv
Confguration
Pseudo-distributed operation (each daemon runs in a separate
Java process) requires updates to core-site.xml, hdfs-site.xml,
and the mapred-site.xml. These fles confgure the master,
the fle system, and the MapReduce framework and live in the
runtime/conf directory.
Listing 5 - Pseudo-Distributed Operation Confg
<!-- core-site.xml -->
<confguration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</confguration>
<!-- hdfs-site.xml -->
<confguration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</confguration>
<!-- mapred-site.xml -->
<confguration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</confguration>
These fles are documented in the Apache Hadoop Clustering
reference, http://is.gd/E32L4s — some parameters are discussed
in this Refcard’s production deployment section.
Test the Hadoop Installation
Hadoop requires a formatted HDFS cluster to do its work:
hadoop namenode -format
The HDFS volume lives on top of the standard fle system. The
format command will show this upon successful completion:
/tmp/dfs/name has been successfully formatted.
Start the Hadoop processes and perform these operations to
validate the installation:
• Use the contents of runtime/conf as known input
• Use Hadoop for fnding all text matches in the input
• Check the output directory to ensure it works
Listing 6 - Testing the Hadoop Installation
start-all.sh ; sleep 5
hadoop fs -put runtime/conf input
hadoop jar runtime/hadoop-*-examples.jar\
grep input output ‘dfs[a-z.]+’
Hot
Tip
You may ignore any warnings or errors about a
missing slaves file.
• View the output fles in the HDFS volume and stop the
Hadoop daemons to complete testing the install
Listing 7 - Job Completion and Daemon Termination
hadoop fs -cat output/*
stop-all.sh
That’s it! Apache Hadoop is installed in your system and ready
for development.
CDH Development Deployment
CDH removes a lot of grueling work from the Hadoop
installation process by offering ready-to-go packages
for mainstream Linux server distributions. Compare the
instructions in Listing 8 against the previous section. CDH
simplifes installation and confguration for huge time savings.
Listing 8 - Installing CDH
ver=”0.20”
command=”/usr/bin/aptitude”
if [ ! -e “$command” ];
then command=”/usr/bin/yum”; f
“$command” install\
hadoop-”$ver”-conf-pseudo
unset command ; unset ver
Leveraging some or all of the extra components in Hadoop
or CDH is another good reason for using it over the Apache
version. Install Flume or Pig with the instructions in Listing 9.
Listing 9 - Adding Optional Components
apt-get install hadoop-pig
apt-get install fume
apt-get install sqoop
Test the CDH Installation
The CDH daemons are ready to be executed as services.
There is no need to create a service account for executing
them. They can be started or stopped as any other Linux
service, as shown in Listing 10.
Listing 10 - Starting the CDH Daemons
for s in /etc/init.d/hadoop* ; do \
“$s” start; done
CDH will create an HDFS partition when its daemons start. It’s
another convenience it offers over regular Hadoop. Listing 11
shows how to validate the installation by:
• Listing the HDFS module
• Moving fles to the HDFS volume
• Running an example job
• Validating the output
5 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
DZone, Inc. | www.dzone.com
Listing 11 - Testing the CDH Installation
hadoop fs -ls /
# run a job:
pushd /usr/lib/hadoop
hadoop fs -put /etc/hadoop/conf input
hadoop fs -ls input
hadoop jar hadoop-*-examples.jar \
grep input output ‘dfs[a-z.]+’
# Validate it ran OK:
hadoop fs -cat output/*
The daemons will continue to run until the server stops. All the
Hadoop services are available.
Monitoring the Local Installation
Use a browser to check the NameNode or the JobTracker state
through their web UI and web services interfaces. All daemons
expose their data over HTTP. The users can chose to monitor a
node or daemon interactively using the web UI, like in Figure
2. Developers, monitoring tools, and system administrators can
use the same ports for tracking the system performance and
state using web service calls.
Figure 2 - NameNode status web UI
The web interface can be used for monitoring the JobTracker,
which dispatches tasks to specifc nodes in a cluster, the
DataNodes, or the NameNode, which manages directory
namespaces and fle nodes in the fle system.
HADOOP MONITORING PORTS
Use the information in Table 1 for confguring a development
workstation or production server frewall.
Port Service
50030 JobTracker
50060 TaskTrackers
50070 NameNode
50075 DataNodes
50090 Secondary NameNode
50105 Backup Node
Table 1 - Hadoop ports
Plugging a Monitoring Agent
The Hadoop daemons also expose internal data over a RESTful
interface. Automated monitoring tools like Nagios, Splunk, or
SOBA can use them. Listing 12 shows how to fetch a daemon’s
metrics as a JSON document:
Listing 12 - Fetching Daemon Metrics
http://localhost:50070/metrics?format=json
All the daemons expose these useful resource paths:
• /metrics - various data about the system state
• /stacks - stack traces for all threads
• /logs - enables fetching logs from the fle system
• /logLevel - interface for setting log4j logging levels
Each daemon type also exposes one or more resource paths
specifc to its operation. A comprehensive list is available from:
http://is.gd/MBN4qz
APACHE HADOOP PRODUCTION DEPLOYMENT
The fastest way to deploy a Hadoop cluster is by using the
prepackaged tools in CDH. They include all the same software
as the Apache Hadoop distribution but are optimized to run in
production servers and with tools familiar to system administrators.
Hot
Tip
Detailed guides that complement this Refcard are
available from Cloudera at http://is.gd/RBWuxm
and from Apache at http://is.gd/ckUpu1.
bucket
bucket
bucket
bucket
Data Node
Tasks:
HDFS
blocks
local write
Data Node
Tasks:
HDFS
blocks
local write
Data Node
Tasks:
HDFS
blocks
local write
Data Node
Tasks: Map
HDFS
blocks
local write
Data Node
Tasks: Reduce
HDFS
blocks
local write
Data Node
Tasks: Reduce
HDFS
blocks
local write
Output
Output
Output
Name Node
Jobs, Names
status
Assigns
User Application
Mapper,
Reducer
(Java)
Mapper,
Reducer
(scripts)
status
Figure 3 - Hadoop Computational Network
The deployment diagram in Figure 3 describes all the
participating nodes in a computational network. The basic
procedure for deploying a Hadoop cluster is:
• Pick a Hadoop distribution
• Prepare a basic confguration on one node
• Deploy the same pre-confgured package across all
machines in the cluster
• Confgure each machine in the network according to its role
The Apache Hadoop documentation shows this as a rather
involved process. The value-added in CDH is that most of that
work is already in place. Role-based confguration is very easy
to accomplish. The rest of this Refcard will be based on CDH.
Handling Multiple Confgurations: Alternatives
Each server role will be determined by its confguration,
since they will all have the same software installed. CDH
supports the Ubuntu and Red Hat mechanism for handling
alternative confgurations.
6 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
DZone, Inc. | www.dzone.com
Hot
Tip
Check the main page to learn more about
alternatives.
Ubuntu: man update-alternatives
Red Hat: man alternatives
The Linux alternatives mechanism ensures that all fles
associated with a specifc package are selected as a system
default. This customization is where all the extra work went
into CDH. The CDH installation uses alternatives to set the
effective CDH confguration.
Setting Up the Production Confguration
Listing 13 takes a basic Hadoop confguration and sets it up
for production.
Listing 13 - Set the Production Confguration
ver=”0.20”
prodConf=”/etc/hadoop-$ver/conf.prod”
cp -Rfv /etc/hadoop-”$ver”/conf.empty \
“$prodConf”
chown hadoop:hadoop “$prodConf”
# activate the new confguration:
alt=”/usr/sbin/update-alternatives”
if [ ! -e “$alt” ]; then alt=”/usr/sbin/alternatives”; f
“$alt” --install /etc/hadoop-”$ver”/conf \
hadoop-”$ver”-conf “$prodConf” 50
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” restart; done
The server will restart all the Hadoop daemons using the new
production confguration.
NameNode
HDFS
DataNode DataNode DataNode DataNode DataNode
Figure 4 - Hadoop Conceptual Topology
Readying the NameNode for Hadoop
Pick a node from the cluster to act as the NameNode
(see Figure 3). All Hadoop activity depends on having a valid
R/W fle system. Format the distributed fle system from the
NameNode, using user hdfs:
Listing 14 - Create a New File System
sudo -u hdfs hadoop namenode -format
Stop all the nodes to complete the fle system, permissions, and
ownership confguration. Optionally, set daemons for automatic
startup using rc.d.
Listing 15 - Stop All Daemons
# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” stop ;\
# Optional command for auto-start:
update-rc.d “$h” defaults; \
done
File System Setup
Every node in the cluster must be confgured with appropriate
directory ownership and permissions. Execute the commands in
Listing 16 in every node:
Listing 16 - File System Setup
mkdir -p /data/1/dfs/nn /data/2/dfs/nn
mkdir -p /data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
mkdir -p /data/1/mapred/local \
/data/2/mapred/local
chown -R hdfs:hadoop /data/1/dfs/nn \
/data/2/dfs/nn /data/1/dfs/dn \
/data/2/dfs/dn /data/3/dfs/dn \
/data/4/dfs/dn
chown -R mapred:hadoop \
/data/1/mapred/local \
/data/2/mapred/local
chmod -R 755 /data/1/dfs/nn \
/data/2/dfs/nn \
/data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
chmod -R 755 /data/1/mapred/local \
/data/2/mapred/local
Starting the Cluster
• Start the NameNode to make HDFS available to all nodes
• Set the MapReduce owner and permissions in the
HDFS volume
• Start the JobTracker
• Start all other nodes
CDH daemons are defned in /etc/init.d — they can be
confgured to start along with the operating system or they can
be started manually. Execute the command appropriate for each
node type using this example:
Listing 17 - Starting a Node Example
# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” stop ; done
Use jobtracker, datanode, tasktracker, etc. corresponding to the
node you want to start or stop.
Hot
Tip
Refer to the Linux distribution’s documentation
for information on how to start the /etc/init.d
daemons with the chkconfig tool.
Listing 18 - Set the MapReduce Directory Up
sudo -u hdfs hadoop fs -mkdir \
/mapred/system
sudo -u hdfs hadoop fs -chown mapred \
/mapred/system
Update the Hadoop Confguration Files
Listing 19 - Minimal HDFS Confg Update
<!-- hdfs-site.xml -->
<property>
<name>dfs.name.dir</name>
<value>/data/1/dfs/nn,/data/2/dfs/nn
</value>
<fnal>true</fnal>
</property>
<property>
<name>dfs.data.dir</name>
<value>
/data/1/dfs/dn,/data/2/dfs/dn,
/data/3/dfs/dn,/data/4/dfs/dn
</value>
<fnal>true</fnal>
</property>
7 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing

DZone, Inc.
140 Preston Executive Dr.
Suite 100
Cary, NC 27513
888.678.0399
919.678.0300
Refcardz Feedback Welcome
refcardz@dzone.com
Sponsorship Opportunities
sales@dzone.com
Copyright © 2011 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise,
without prior written permission of the publisher.
Version 1.0
$
7
.
9
5
DZone communities deliver over 6 million pages each month to
more than 3.3 million software developers, architects and decision
makers. DZone offers something for everyone, including news,
tutorials, cheat sheets, blogs, feature articles, source code and more.
“DZone is a developer’s dream,” says PC Magazine.
RECOMMENDED BOOK ABOUT THE AUTHOR
ISBN-13: 978-1-936502-03-5
ISBN-10: 1-936502-03-8
9 781936 502035
50795
The last step consists of confguring the MapReduce nodes to
fnd their local working and system directories:
Listing 20 - Minimal MapReduce Confg Update
<!-- mapred-site.xml -->
<property>
<name>mapred.local.dir</name>
<value>
/data/1/mapred/local,
/data/2/mapred/local
</value>
<fnal>true</fnal>
</property>
<property>
<name>mapred.systemdir</name>
<value>
/mapred/system
</value>
<fnal>true</fnal>
</property>
Start the JobTracker and all other nodes. You now have a working
Hadoop cluster. Use the commands in Listing 11 to validate that
it’s operational.
WHAT’S NEXT?
The instructions in this Refcard result in a working development
or production Hadoop cluster. Hadoop is a complex framework
and requires attention to confgure and maintain it. Review
the Apache Hadoop and Cloudera CDH documentation. Pay
particular attention to the sections on:
• How to write MapReduce, Pig, or Hive applications
• Multi-node cluster management with ZooKeeper
• Hadoop ETL with Sqoop and Flume
Happy Hadoop computing!
STAYING CURRENT
Do you want to know about specifc projects and use cases
where Hadoop and data scalability are the hot topics? Join the
scalability newsletter:
http://ciurana.eu/scalablesystems


By Paul M. Duvall
ABOUT CONTINUOUS INTEGRATION
G
e
t M
o
r e
R
e
f c a
r d
z ! V
i s i t r e
f c a
r d
z . c o
m

#84
Continuous Integration:
Patterns and Anti-Patterns
CONTENTS INCLUDE:
■ About Continuous Integration
■ Build Software at Every Change
■ Patterns and Anti-patterns
■ Version Control
■ Build Management
■ Build Practices and more...
Continuous Integration (CI) is the process of building software
with every change committed to a project’s version control
repository.
CI can be explained via patterns (i.e., a solution to a problem
in a particular context) and anti-patterns (i.e., ineffective
approaches sometimes used to “fi x” the particular problem)
associated with the process. Anti-patterns are solutions that
appear to be benefi cial, but, in the end, they tend to produce
adverse effects. They are not necessarily bad practices, but can
produce unintended results when compared to implementing
the pattern.
Continuous Integration
While the conventional use of the term Continuous Integration
efers to the “build and test” cycle, this Refcard
expands on the notion of CI to include concepts such as
Aldon®
Change. Collaborate. Comply.
Pattern
Description
Private Workspace
Develop software in a Private Workspace to isolate changes
Repository
Commit all fi les to a version-control repository
Mainline
Develop on a mainline to minimize merging and to manage
active code lines
Codeline Policy
Developing software within a system that utilizes multiple
codelines
Task-Level Commit
Organize source code changes by task-oriented units of work
and submit changes as a Task Level Commit
Label Build
Label the build with unique name
Automated Build
Automate all activities to build software from source without
manual confi guration
Minimal Dependencies Reduce pre-installed tool dependencies to the bare minimum
Binary Integrity
For each tagged deployment, use the same deployment
package (e.g. WAR or EAR) in each target environment
Dependency Management Centralize all dependent libraries
Template Verifi er
Create a single template fi le that all target environment
properties are based on
Staged Builds
Run remote builds into different target environments
Private Build
Perform a Private Build before committing changes to the
Repository
Integration Build
Perform an Integration Build periodically, continually, etc.
Send automated feedback from CI server to development team
ors as soon as they occur
Generate developer documentation with builds based on
brought to you by...


By Andy Harris
HTML BASICS
e
.co
m
G
e
t M
o
re
R
e
fca
rd
z! V
isit re
fca
rd
z.co
m

#64
C
o
re H
TM
L
HTML and XHTML are the foundation of all web development.
HTML is used as the graphical user interface in client-side
programs written in JavaScript. Server-side languages like PHP
and Java also receive data from web pages and use HTML
as the output mechanism. The emerging Ajax technologies
likewise use HTML and XHTML as their visual engine. HTML
was once a very loosely-defi ned language with very little
standardization, but as it has become more important, the
need for standards has become more apparent. Regardless of
whether you choose to write HTML or XHTML, understanding
the current standards will help you provide a solid foundation
that will simplify all your other web coding. Fortunately HTML
and XHTML are actually simpler than they used to be, because
much of the functionality has moved to CSS.
common elements
Every page (HTML or XHTML shares certain elements in
common.) All are essentially plain text
extension. HTML fi les should not be cr
processor
CONTENTS INCLUDE:
■ HTML Basics ■ HTML vs XHTML ■ Validation ■ Useful Open Source Tools
■ Page Structure Elements
■ Key Structural Elements and more...
The src attribute describes where the image fi le can be found,
and the alt attribute describes alternate text that is displayed if
the image is unavailable. Nested tags Tags can be (and frequently are) nested inside each other. Tags
cannot overlap, so <a><b></a></b> is not legal, but <a><b></
b></a> is fi ne.
HTML VS XHTML HTML has been around for some time. While it has done its
job admirably, that job has expanded far more than anybody
expected. Early HTML had very limited layout support.
Browser manufacturers added many competing standar
web developers came up with clever workar
result is a lack of standar
The latest web standar
Browse our collection of over 100 Free Cheat Sheets
Upcoming Refcardz
RichFaces
CSS3
Windows Azure Platform
Spring Roo


By Daniel Rubio
ABOUT CLOUD COMPUTING
C
lo
u
d
C
o
m
p
u
tin
g
w
w
w
.d
z
o
n
e
.co
m
G
e
t M
o
re
R
e
fc
a
rd
z
! V
isit re
fc
a
rd
z.c
o
m

#82
Getting Started with
Cloud Computing
CONTENTS INCLUDE:
■ About Cloud Computing
■ Usage Scenarios
■ Underlying Concepts
■ Cost
■ Data Tier Technologies
■ Platform Management and more...
Web applications have always been deployed on servers
connected to what is now deemed the ‘cloud’.
However, the demands and technology used on such servers
has changed substantially in recent years, especially with
the entrance of service providers like Amazon, Google and
Microsoft.
These companies have long deployed web applications
that adapt and scale to large user bases, making them
knowledgeable in many aspects related to cloud computing.
This Refcard will introduce to you to cloud computing, with an
emphasis on these providers, so you can better understand
what it is a cloud computing platform can offer your web
applications.
USAGE SCENARIOS
Pay only what you consume
Web application deployment until a few years ago was similar
to most phone services: plans with alloted resources, with an
incurred cost whether such resources were consumed or not.
Cloud computing as it’s known today has changed this.
The various resources consumed by web applications (e.g.
bandwidth, memory, CPU) are tallied on a per-unit basis
(starting from zero) by all major cloud computing platforms.
also minimizes the need to make design changes to support
one time events.
Automated growth & scalable technologies
Having the capability to support one time events, cloud
computing platforms also facilitate the gradual growth curves
faced by web applications.
Large scale growth scenarios involving specialized equipment
(e.g. load balancers and clusters) are all but abstracted away by
relying on a cloud computing platform’s technology.
In addition, several cloud computing platforms support data
tier technologies that exceed the precedent set by Relational
Database Systems (RDBMS): Map Reduce, web service APIs,
etc. Some platforms support large scale RDBMS deployments.
CLOUD COMPUTING PLATFORMS AND
UNDERLYING CONCEPTS
Amazon EC2: Industry standard software and virtualization
Amazon’s cloud computing platform is heavily based on
industry standard software and virtualization technology.
Virtualization allows a physical piece of hardware to be
utilized by multiple operating systems. This allows resources
(e.g. bandwidth, memory, CPU) to be allocated exclusively to
individual operating system instances.
As a user of Amazon’s EC2 cloud computing platform, you are
assigned an operating system in the same way as on all hosting
Eugene Ciurana (http://eugeneciurana.eu) is
the VP of Technology at Badoo.com, the largest
dating site worldwide, and cofounder of SOBA
Labs, the most sophisticated public and private
clouds management software. Eugene is also an
open-source evangelist who specializes in the
design and implementation of mission-critical, high-availability
systems. He recently built scalable computational networks for
leading fnancial, software, insurance, SaaS, government, and
healthcare companies in the US, Japan, Mexico, and Europe.
Publications
• Developing with Google App Engine, Apress
• DZone Refcard #117: Getting Started with Apache Hadoop
• DZone Refcard #105: NoSQL and Data Scalability
• DZone Refcard #43: Scalability and High Availability
• The Tesla Testament: A Thriller, CIMEntertainment
Thank You!
Thanks to all the technical reviewers, especially to Pavel Dovbush
at http://dpp.su
Hadoop: The Defnitive Guide helps you harness
the power of your data. Ideal for processing large
datasets, the Apache Hadoop framework is an
open-source implementation of the MapReduce
algorithm on which Google built its empire. This
comprehensive resource demonstrates how to
use Hadoop to build reliable, scalable, distributed systems;
programmers will fnd details for analyzing large datasets, and
administrators will learn how to set up and run Hadoop clusters.
BUY NOW
http://oreilly.com/catalog/0636920010388/

SSH Access Every system in a Hadoop deployment must provide SSH access for data exchange between nodes.#133 Get More Refcardz! Visit refcardz. then \ ssh-keygen -t rsa -b 2048 -P ‘’ \ -f “$pKeyFile”. n n n n Introduction Which Hadoop Distribution? Apache Hadoop Installation Hadoop Monitoring Ports Apache Hadoop Production Deployment Hot Tips and more.com CONTENTS INCLUDE: n n brought to you by.com.dzone. Documentation about Cloudera’s CDH is available from http://docs. version 1. \ echo “Hadoop SSH configured”. | www. The Apache Hadoop distribution assumes that the person installing it is comfortable with configuring a system manually.com/downloads/ Comprehensive Apache Hadoop Training and Certification Hot Tip Linux is the supported platform for production systems.com . \ cat “$keyFile” >> “$authKeys”.ssh/id_rsa. \ else echo “Hadoop SSH OK”. This Refcard presents how to deploy and use development and production computational networks. CDH. All nodes are assumed to run in a secure network behind the firewall. www. MapReduce.cloudera..Hadoop SSH Prerequisits keyFile=$HOME/. Inc. Their toolset complements Apache’s. Windows is adequate but is not supported as a development platform.cloudera. \ fi.6 update 8 or later.. The decision of using one or the other distributions depends on the organization’s desired objective. is designed as a drop-in component for all major Linux distributions. fi The public key for this example is left blank. • CDH removes the guesswork and offers an almost turnkey product for robustness and stability. enterprise-class distribution for productionready environments. \ chmod 0640 “$authKeys”. Getting Started with Apache Hadoop.6 from Oracle. Find out how Cloudera’s Distribution for Apache Hadoop makes it easier to run Hadoop in your enterprise. There are two basic Hadoop distributions: • Apache Hadoop is the main open-source. identify your current JAVA_HOME • sshd and ssh for managing Hadoop daemons across multiple systems • rsync for file and directory synchronization across the nodes in the cluster • Create a service account for user hadoop where $HOME=/ home/hadoop INTRODUCTION This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Distribute the public key from the master node to all other nodes for data exchange.ssh/authorized_keys if ! ssh localhost -C true . on the other hand. • The Apache distribution is fine for experimental learning exercises and for becoming familiar with how Hadoop is put together. for basic terminology and for an overview of the tools available in the Hadoop Project. WHICH HADOOP DISTRIBUTION? Apache Hadoop is a scalable framework for implementing reliable and scalable computational networks. HDFS. and Pig are the foundational tools for developing Hadoop applications. Log in to the node as the Hadoop user and run the commands in Listing 1 to validate or create the required SSH configuration. it also offers some tools not available in the Apache distribution. then \ if [ ! -e “$keyFile” ]. Apache Hadoop Deployment Hot Tip Cloudera offers professional services and puts out an enterprise distribution of Apache Hadoop.ssh/id_rsa authKeys=$HOME/. Check out Refcard #117. Listing 1 . If this were to run on a public network it could be a security hole. • The Cloudera Distribution for Apache Hadoop (CDH) is an open-source..pub pKeyFile=$HOME/. DZone. Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing By Eugene Ciurana Minimum Prerequisites • Java 1. bleeding-edge distribution from the Apache foundation.

Ubuntu Pre-Install Setup DISTRO=$(lsb_release -c | cut -f 2) REPO=/etc/apt/sources. This Refcard is a reference for development and production deployment of the components shown in Figure 1.Red Hat Pre-Install Setup curl -sL http://is.repos. by default. installation.html has the latest version of the common tools. All these steps are performed as the service user hadoop.eu/DeployingHadoopDZone APACHE HADOOP INSTALLATION Enterprise: CDH Prerequisites Cloudera simplified the installation process by offering packages for Ubuntu Server and Red Hat Linux distributions. 2. Download Hadoop from a mirror and unpack it in the /home/hadoop work directory. defined in the prerequisites section.3 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing Hot Tip All the bash shell commands in this Refcard are available for cutting and pasting from: http://ciurana.list.20. Hadoop for Production • Production environments are deployed across a group of machines that make the computational network. and configuration could be automated with shell scripts.d/cloudera-cdh3. This guide used version 0. high-level language for parallel computation and data flow Enterprise users often chose CDH because of: • Flume: a distributed service for efficient large data transfers in real-time • Sqoop: a tool for importing relational databases into Hadoop clusters CDH on Red Hat Pre-Install Setup Run these commands as root or through sudo to add the yum Cloudera repository: Listing 3 .org/common/releases. A non-trivial. | www. It will help prevent support or compatibility headaches. and so on. Set the JAVA_HOME environment variable. in which every Hadoop daemon runs in a separate Java process.Hadoop Components Hot Tip Whether the user intends to run Hadoop in non-distributed or distributed modes.com .cloudera. Inc. The examples here use CDH3.com/debian \ $DISTRO-cdh3 contrib” > “$REPO” echo “deb-src \ http://archive. distributions. This configuration is optimal for development and will be used for the examples in this guide.d/cloudera. DZone. Apache Hadoop Development Deployment The steps in this section must be repeated for every node in a Hadoop cluster. clustered mode. CDH on Ubuntu Pre-Install Setup Execute these commands as root or via sudo to add the Cloudera repositories: Listing 2 . Hadoop. CDH3. 3. Hot Tip CDH packages have names like CDH2.apache.2.com/debian \ $DISTRO-cdh3 contrib” >> “$REPO” apt-get update Figure 1 . in non-distributed mode. Don’t mix and match operating systems. This configuration is optimal for development and debugging. http://hadoop. consider using a Linux distribution hosted on VirtualBox for running Hadoop. 1. Set the run-time environment: Hot Tip If you have an OS X or a Windows development workstation.gd/3ynKY7 | tee \ /etc/yum. corresponding to the CDH version. or Java versions! Hadoop for Development • Hadoop runs as a single Java process. Downloads.repo | \ awk ‘/^name/’ yum update yum Ensure that all the pre-required software and configuration are installed on every machine intended to be a Hadoop node. It includes the components available in the basic Hadoop distribution and the enhancements that Cloudera released. Use the appropriate version for your installation. basic Hadoop installation includes at least these components: • Hadoop Common: the basic infrastructure necessary for running all components and applications • HDFS: the Hadoop Distributed File System • MapReduce: the framework for large data set distributed processing • Pig: an optional.list echo “deb \ http://archive. Any computer may assume any role thereafter.cloudera. it’s best to install every required component in every machine in the computational network. Hadoop must be configured to run in fully distributed. • Hadoop also offers a pseudo-distributed mode.dzone.

20.hdfs-site.xml --> <configuration> <property> <name>mapred. Listing 8 . then command=”/usr/bin/yum”.jar\ grep input output ‘dfs[a-z.sh .d/hadoop* . Listing 9 .Set the Hadoop Runtime Environment version=0.xml --> <configuration> <property> <name>dfs. and the MapReduce framework and live in the runtime/conf directory. as shown in Listing 10. export HADOOP_HOME=”$HOME” cp “$runtimeEnv” “$runtimeEnv”. Listing 11 shows how to validate the installation by: • Listing the HDFS module • Moving files to the HDFS volume • Running an example job • Validating the output Start the Hadoop processes and perform these operations to validate the installation: • Use the contents of runtime/conf as known input • Use Hadoop for finding all text matches in the input • Check the output directory to ensure it works DZone. CDH Development Deployment CDH removes a lot of grueling work from the Hadoop installation process by offering ready-to-go packages for mainstream Linux server distributions. do \ “$s” start.sh” ln -s hadoop-”$version” runtime ln -s runtime/logs . Inc. Test the Hadoop Installation Hadoop requires a formatted HDFS cluster to do its work: hadoop namenode -format The HDFS volume lives on top of the standard file system.Installing CDH ver=”0. and the mapred-site. unset identity.Job Completion and Daemon Termination hadoop fs -cat output/* stop-all. hdfs-site. sleep 5 hadoop fs -put runtime/conf input hadoop jar runtime/hadoop-*-examples.tracker</name> <value>localhost:9001</value> </property> </configuration> That’s it! Apache Hadoop is installed in your system and ready for development.4 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing Listing 4 .com .20” command=”/usr/bin/aptitude” if [ ! -e “$command” ].Starting the CDH Daemons for s in /etc/init. http://is.]+’ Hot Tip You may ignore any warnings or errors about a missing slaves file.dzone. Compare the instructions in Listing 8 against the previous section.gd/E32L4s — some parameters are discussed in this Refcard’s production deployment section.replication</name> <value>1</value> </property> </configuration> <!-. done These files are documented in the Apache Hadoop Clustering reference. fi “$command” install\ hadoop-”$ver”-conf-pseudo unset command .Adding Optional Components apt-get install hadoop-pig apt-get install flume apt-get install sqoop Test the CDH Installation The CDH daemons are ready to be executed as services. Install Flume or Pig with the instructions in Listing 9. CDH will create an HDFS partition when its daemons start. CDH simplifies installation and configuration for huge time savings. Listing 5 . unset runtimeEnv Listing 6 .org echo “export \ HADOOP_SLAVES=$HADOOP_HOME/slaves” \ >> “$runtimeEnv” mkdir “$HADOOP_HOME”/slaves echo \ “export HADOOP_IDENT_STRING=$identity” >> \ “$runtimeEnv” echo \ “export JAVA_HOME=$JAVA_HOME” \ >>”$runtimeEnv” export \ PATH=$PATH:”$HADOOP_HOME”/runtime/bin unset version.xml.job.default. • View the output files in the HDFS volume and stop the Hadoop daemons to complete testing the install Listing 7 . the file system.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> <!-. unset ver Leveraging some or all of the extra components in Hadoop or CDH is another good reason for using it over the Apache version. These files configure the master. They can be started or stopped as any other Linux service.xml --> <configuration> <property> <name>fs.core-site.xml. There is no need to create a service account for executing them. It’s another convenience it offers over regular Hadoop. | www.Testing the Hadoop Installation start-all.sh Configuration Pseudo-distributed operation (each daemon runs in a separate Java process) requires updates to core-site.Pseudo-Distributed Operation Config <!-. The format command will show this upon successful completion: /tmp/dfs/name has been successfully formatted.xml. Listing 10 .2 # change if needed identity=”hadoop-dev” runtimeEnv=”runtime/conf/hadoop-env.mapred-site.

the DataNodes.Testing the CDH Installation hadoop fs -ls / # run a job: pushd /usr/lib/hadoop hadoop fs -put /etc/hadoop/conf input hadoop fs -ls input hadoop jar hadoop-*-examples. CDH supports the Ubuntu and Red Hat mechanism for handling alternative configurations. Role-based configuration is very easy to accomplish. Figure 2 . Handling Multiple Configurations: Alternatives Each server role will be determined by its configuration. Developers.stack traces for all threads • /logs .Hadoop Computational Network The deployment diagram in Figure 3 describes all the participating nodes in a computational network. Splunk. The rest of this Refcard will be based on CDH. They include all the same software as the Apache Hadoop distribution but are optimized to run in production servers and with tools familiar to system administrators. or SOBA can use them.gd/ckUpu1.]+’ # Validate it ran OK: hadoop fs -cat output/* Listing 12 . The fastest way to deploy a Hadoop cluster is by using the prepackaged tools in CDH. or the NameNode. Automated monitoring tools like Nagios.gd/RBWuxm and from Apache at http://is.enables fetching logs from the file system • /logLevel . Use the information in Table 1 for configuring a development workstation or production server firewall. The users can chose to monitor a node or daemon interactively using the web UI. All the Hadoop services are available. Monitoring the Local Installation Use a browser to check the NameNode or the JobTracker state through their web UI and web services interfaces.jar \ grep input output ‘dfs[a-z.com .gd/MBN4qz APACHE HADOOP PRODUCTION DEPLOYMENT The daemons will continue to run until the server stops.Hadoop ports Plugging a Monitoring Agent The Hadoop daemons also expose internal data over a RESTful interface. which manages directory namespaces and file nodes in the file system.5 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing Listing 11 .dzone. and system administrators can use the same ports for tracking the system performance and state using web service calls. The basic procedure for deploying a Hadoop cluster is: • Pick a Hadoop distribution • Prepare a basic configuration on one node • Deploy the same pre-configured package across all machines in the cluster • Configure each machine in the network according to its role The Apache Hadoop documentation shows this as a rather involved process. which dispatches tasks to specific nodes in a cluster.Fetching Daemon Metrics http://localhost:50070/metrics?format=json All the daemons expose these useful resource paths: • /metrics . Port 50030 50060 50070 50075 50090 50105 Service JobTracker TaskTrackers NameNode DataNodes Secondary NameNode Backup Node Table 1 . The value-added in CDH is that most of that work is already in place. like in Figure 2. HADOOP MONITORING PORTS Figure 3 . A comprehensive list is available from: http://is. since they will all have the same software installed. | www. Listing 12 shows how to fetch a daemon’s metrics as a JSON document: DZone. Hot Tip Detailed guides that complement this Refcard are available from Cloudera at http://is.NameNode status web UI The web interface can be used for monitoring the JobTracker.interface for setting log4j logging levels Each daemon type also exposes one or more resource paths specific to its operation. All daemons expose their data over HTTP. Inc.various data about the system state • /stacks . monitoring tools.

Hadoop Conceptual Topology Readying the NameNode for Hadoop Pick a node from the cluster to act as the NameNode (see Figure 3). Listing 15 .xml --> <property> <name>dfs.20 for h in /etc/init.Set the MapReduce Directory Up sudo -u hdfs hadoop fs -mkdir \ /mapred/system sudo -u hdfs hadoop fs -chown mapred \ /mapred/system Stop all the nodes to complete the file system. and ownership configuration. Inc.dir</name> <value>/data/1/dfs/nn.empty \ “$prodConf” chown hadoop:hadoop “$prodConf” # activate the new configuration: alt=”/usr/sbin/update-alternatives” if [ ! -e “$alt” ].d/hadoop-”$ver”-*. Format the distributed file system from the NameNode.d “$h” defaults. Figure 4 . \ done Update the Hadoop Configuration Files Listing 19 . do \ “$h” restart. tasktracker.com ./data/4/dfs/dn </value> <final>true</final> </property> File System Setup Every node in the cluster must be configured with appropriate directory ownership and permissions./data/2/dfs/dn. | www. Optionally. Execute the commands in Listing 16 in every node: DZone.d/hadoop-”$ver”-*.data. then alt=”/usr/sbin/alternatives”. Listing 18 . Setting Up the Production Configuration Listing 13 takes a basic Hadoop configuration and sets it up for production.Minimal HDFS Config Update <!-. do \ “$h” stop .dzone. This customization is where all the extra work went into CDH. permissions. fi “$alt” --install /etc/hadoop-”$ver”/conf \ hadoop-”$ver”-conf “$prodConf” 50 for h in /etc/init.20 for h in /etc/init.Set the Production Configuration ver=”0. etc. set daemons for automatic startup using rc. done The server will restart all the Hadoop daemons using the new production configuration. datanode.20” prodConf=”/etc/hadoop-$ver/conf.6 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing Hot Tip Check the main page to learn more about alternatives.File System Setup mkdir -p /data/1/dfs/nn /data/2/dfs/nn mkdir -p /data/1/dfs/dn /data/2/dfs/dn \ /data/3/dfs/dn /data/4/dfs/dn mkdir -p /data/1/mapred/local \ /data/2/mapred/local chown -R hdfs:hadoop /data/1/dfs/nn \ /data/2/dfs/nn /data/1/dfs/dn \ /data/2/dfs/dn /data/3/dfs/dn \ /data/4/dfs/dn chown -R mapred:hadoop \ /data/1/mapred/local \ /data/2/mapred/local chmod -R 755 /data/1/dfs/nn \ /data/2/dfs/nn \ /data/1/dfs/dn /data/2/dfs/dn \ /data/3/dfs/dn /data/4/dfs/dn chmod -R 755 /data/1/mapred/local \ /data/2/mapred/local The Linux alternatives mechanism ensures that all files associated with a specific package are selected as a system default.name.dir</name> <value> /data/1/dfs/dn. using user hdfs: Listing 14 .d — they can be configured to start along with the operating system or they can be started manually. All Hadoop activity depends on having a valid R/W file system.hdfs-site. do \ “$h” stop .d daemons with the chkconfig tool. Execute the command appropriate for each node type using this example: Listing 17 .Stop All Daemons # Run this in every node ver=0./data/2/dfs/nn </value> <final>true</final> </property> <property> <name>dfs. /data/3/dfs/dn. corresponding to the node you want to start or stop.Starting a Node Example # Run this in every node ver=0.d/hadoop-”$ver”-*. Listing 13 .d. The CDH installation uses alternatives to set the effective CDH configuration. done Starting the Cluster • Start the NameNode to make HDFS available to all nodes • Set the MapReduce owner and permissions in the HDFS volume • Start the JobTracker • Start all other nodes CDH daemons are defined in /etc/init. Use jobtracker.prod” cp -Rfv /etc/hadoop-”$ver”/conf. Ubuntu: man update-alternatives Red Hat: man alternatives Listing 16 .\ # Optional command for auto-start: update-rc.Create a New File System sudo -u hdfs hadoop namenode -format Hot Tip Refer to the Linux distribution’s documentation for information on how to start the /etc/init.

d. etc. ineffect ) Build eve t. so <a>< tly are) nest become virtualization HTML a> is s has fine. NC 27513 888. applications (e. temp s. insurance. chang prope Depe into differe itting er necessa pared to efi te builds e comm late Verifi associat to be ben befor are not Cloud computing asRun remo it’s known etoday has changed this.0300 ISBN-13: 978-1-936502-03-5 ISBN-10: 1-936502-03-8 50795 DZone communities deliver over 6 million pages each month to more than 3. ® Data Tier Technologies t to you ply.e. Hadoop: The Definitive Guide helps you harness the power of your data. rnate can be emergin es and use UNDERLYING CONCEPTS and XHT PHP text that e tags found. ce pre-in blem with ymen Redu Autom ns (i.nt team Perfo opme ed resu d Builds sitory Build gration to devel Repo Stage adverse unintend bandwidth. especially to Pavel Dovbush at http://dpp. Eugene is also an open-source evangelist who specializes in the design and implementation of mission-critical. Structur Elem al Elem ents ICS In addition. ortant. Publications • • • • • Developing with Google App Engine. Copyright © 2011 DZone. the largest dating site worldwide. b></ Amazon EC2: ther stanstandard software and uage with whe Industry dard . This comprehensive resource demonstrates how to use Hadoop to build reliable.. Ideal for processing large datasets. is used ML are the prog etc. te. government. Hadoop is a complex framework and requires attention to configure and maintain it. support as the grap foundation The src of all hical and Java ten attribute and the also recein JavaScri user interfac web develop desc as the e in clien ment.dzone. Do you want to know about specific projects and use cases where Hadoop and data scalability are the hot topics? Join the scalability newsletter: http://ciurana.. However.678. CIMEntertainment http://oreilly. This allows resources function job adm been arou ing. especially with the entrance of service providers like Amazon. the e HTM e apparen dards HTM r. Japan. feature articles.95 Refcardz Feedback Welcome refcardz@dzone. blogs. Suite 100 Cary.0399 919. including news. Inc. but can to late alloted resources. and healthcare companies in the US. or Hive applications • Multi-node cluster management with ZooKeeper • Hadoop ETL with Sqoop and Flume Happy Hadoop computing! STAYING CURRENT Start the JobTracker and all other nodes.com Ge t Mo re ref c E: INC LUD gration NTS ous Inte Change CO NTE Continu at Every ns About Software i-patter Build and Ant Patterns Control ment Version e. Tags standard a very loos g Ajax HTML ML as is disp can tech ely-defi thei ization. or proces in the end bad pra enting Creat cy Mana nt targe es rties are rily nden implem approac ed with the cial.eu) is the VP of Technology at Badoo. cture Key ■ ■ ■ HTML Automated growth & scalable technologies vs XHT ■ ■ HTML Basics LUD E: Valid ML ents and Large scale growth scenarios involving specialized equipment more. software. He recently built scalable computational networks for leading financial.7 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing The last step consists of configuring the MapReduce nodes to find their local working and system directories: Listing 20 . (e. No part of this publication may be reproduced.com . Com bora ■ ■ ■ Aldon Chan ge.. “DZone is a developer’s dream. making them rn les to ize merg Patte it all fi to minim space Comm many aspects related tomultiple computing. 140 Preston Executive Dr.g. Use the commands in Listing 11 to validate that it’s operational.mapred-site. and cofounder of SOBA Labs.com/catalog/0636920010388/ BUY NOW Thank You! Thanks to all the technical reviewers. s cloud e Work knowledgeable in a mainline Privat that utilize lop on Deve code lines a system sitory of work active Repo are within units This Refcard will introduce to you to clouddcomputing.” says PC Magazine.com CONTENTS INCLUDE: ■ ■ ■ About Cloud Computing Usage Scenarios Underlying Concepts Cost by. or otherwise.678. Inc. stored in a retrieval system. load balancers and clusters) are all but abstracted away by relying on a cloud computing platform’s technology. The late a lack of stan up with clev y competi n text should ng stan st web in dar er wor not be dar kar standar cr HTM L BAS RichFaces CSS3 Windows Azure Platform Spring Roo DZone. mechanical. memory.. INC ■ Getting Started with Browse our collection of over 100 Free Cheat Sheets Upcoming Refcardz Core HTML By An dy Ha rris ■ ■ Vis it ION EGRAT www. ML physical r othe you prov to L Virtualization allows aare actu piece of hardware idebe understand of HTML much a solid of the ing has ally simp r web cod foundati utilized by multiple operating systems.su #82 Get More Refcardz! Visit refcardz.n elem CPU) to be mov memory. HTML ent instances. or transmitted. brough Platform Management and more. scalable. com ref car Web applications have always been deployed on servers connected to what is now deemed the ‘cloud’. bandwidth. that tely some used Every job time. and it the pro ject’s vers are from with uniqu ABOU softw (CI) is um the build evel Comm build a pro minim Label Task-L ies to gration ed to activit blem the bare ion ate all cies to t ous Inte committ USAGE SCENARIOS to a pro Autom configurat nden ymen Build depe al tion deplo t Continu ry change ive manu Label d tool nmen the same stalle . high-availability systems. 9 781936 502035 Version 1. cloud Useful Open computing platforms also facilitate the gradual growth curves Page Source Stru Tools faced by web applications. Pay particular attention to the sections on: • How to write MapReduce.g.g.com Re fca rd z! Pay only what you consume Ge t Mo re Cloud Computing expand s on efer ion the not e. Google and es Microsoft..g. programmers will find details for analyzing large datasets. the demands and technology used on such servers has changed substantially in recent years.Minimal MapReduce Config Update <!-. use target enviro ated s (i.com $7.xml --> <property> <name>mapred. architects and decision makers.e. They Privat contin lts whe appear effects. with an softw riente loping ine task-o e Deve Mainl es you emphasis oncodelines providers. Mexico. You now have a working Hadoop cluster. nt nmen ctic in a par hes som geme t enviro e a single based on incurred cost whether such resources were consumedto thenot.local. ser (HTM common it has ed far L or XHT done web dev manufacture L had very more its extensio . ality has allocated exclusively to expecte irably. so by Commit better understand these softwar chang Level can S INT e code ding Task what it is a cloud computingaplatform can offer your web INUOU trol ut line Policy nize sourc es as of buil Orga Code e witho it chang cess CONT ion con e name sourc subm T applications. ed to CSS page Brow individual operating system s While y HTM has expand because .) All are ML shar limit than rs add elopers As a user of Amazon’s EC2 cloud computing platform.. distributed systems. Server-s the ima alt attribute ribes whe output likew ge is describe re the ima mechan from CLOUD COMPUTING PLATFORMS AND ide languag t-side ise unavaila ge file ism. Earl to be. CPU) areIntegration on from CI server basis tallied a per-unit e ous Inte e Build rm an ack produc Privat . the most sophisticated public and private clouds management software. Apress DZone Refcard #117: Getting Started with Apache Hadoop DZone Refcard #105: NoSQL and Data Scalability DZone Refcard #43: Scalability and High Availability The Tesla Testament: A Thriller. Du By ContiPatterns an ABOUT CLOUD COMPUTING dz. but t. this the builds Build gra Send of soon such ration ion with ors as Integ entat ous Inte tional use and test concepts docum oper Continu conven “build include the rate devel to to the s Gene While of CI Re fca rdz ! on: grati s s Inte -Patternvall nuou d Anti Paul M. with an Centr Binar most phone services: ts etim . SaaS. DZone offers something for everyone. and administrators will learn how to set up and run Hadoop clusters. ive data pt. All rights reserved. cheat sheets. ory. in any form or by means electronic.. Reg L or XHT simplify will help L VS <a>< and XHT ardless XHTM b></ all you ML. several cloud computing platforms support data HTM tier technologies that L and the precedent set by Relational exceed XHT HTML Database Systems (RDBMS): Map Reduce. Review the Apache Hadoop and Cloudera CDH documentation. d deploEAR) in each lar pro that pattern ticu tagge via or cies -patter s reposit t nden For each (e. but. HTM essentia came is proc lly plai supp L files assigned essor an operating system in the same wayain on all hosting as elements ort. b></ ed insid bec heavily based on very little more Amazon’s cloud computing ose the curr you cho platform isome a></ imp e b> is mor ent stan to writ not lega each othe that will industry standard software and virtualization technology.com. all ated occur pattern based tion autom as as they the term” cycle.eu/scalablesystems ABOUT THE AUTHOR RECOMMENDED BOOK Eugene Ciurana (http://eugeneciurana. Temp n com Build ually. e chang Vis it ation Having the capability to support one time events.3 million software developers. The web pag s alte Nested es like was onc use HTML ble. web service APIs.0 Sponsorship Opportunities sales@dzone. tutorials. and Europe. Continu Refcard feedb on (starting from zero) by Perfomajor cloud computing platforms. source code and more. layed cannot be (and freq need ned lang r visual eng nologies if but as for overlap uen it has ine.dir</name> <value> /data/1/mapred/local. WAR es lained ) and anti the par solution duce al Depe ge t Web application deployment untillibrari yearsenvironmen similar “fi x” packa Minim be exp nden a few target ago was text ns are d to to pro all rity all depe CI can ticular con alizeplans with file that y Integ es use Anti-patter they tend es. Colla #6 Cloud 4Computing By Daniel Rubio also minimizes the need to make design changes to support CON TEN TS one time events. a solu . photocopying. Tags l. Manage s and mor Build Practice Build ■ ■ ■ ■ to isolat space n e Work riptio itory a Desc These companies have Privat deployed webmanage applications are in long trol repos to n-con lop softw ing and Deve a versio that adapt and scale to large user bases. /data/2/mapred/local </value> <final>true</final> </property> <property> <name>mapred. the Apache Hadoop framework is an open-source implementation of the MapReduce algorithm on which Google built its empire. ler than Fortuna commo on nd for they (e.. you are result anybody es cert ed man ed layout n. rm a The various resources consumed by webperiodically. without prior written permission of the publisher. Pig.systemdir</name> <value> /mapred/system </value> <final>true</final> </property> WHAT’S NEXT? The instructions in this Refcard result in a working development or production Hadoop cluster. Some platforms rams writ large scale RDBMS deployments.