Interested in learning more

about Apache Hadoop?
Cloudera offers comprehensive Apache Hadoop
training and certification. We offer live public
Hadoop training sessions and certification exams
regularly around the globe. We also provide private
group on-site Hadoop training sessions.
Upcoming Classes
Hadoop Training for System Administrators
Redwood City, CA - Feb 17-18
Hadoop Training for Developers
NYC - Feb 22-24
Hadoop Training for Developers
Chicago - Feb 28-Mar 2
Hadoop Training for System Administrators
Chicago - Mar 3-4
Hadoop Training for Developers
Seattle - Mar 7-9
Analyzing Data with Hive & Pig
Redwood City, CA - Mar 8-9

For a full list of scheduled training visit:
www.cloudera.com/info/training


DZone, Inc. | www.dzone.com
G
e
t

M
o
r
e

R
e
f
c
a
r
d
z
!

V
i
s
i
t

r
e
f
c
a
r
d
z
.
c
o
m
#133
A
p
a
c
h
e

H
a
d
o
o
p

D
e
p
l
o
y
m
e
n
t
CONTENTS INCLUDE:
n
Introduction
n
Which Hadoop Distribution?
n
Apache Hadoop Installation
n
Hadoop Monitoring Ports
n
Apache Hadoop Production Deployment
n
Hot Tips and more...
By Eugene Ciurana
Apache Hadoop Deployment:
A Blueprint for Reliable Distributed Computing
INTRODUCTION
This Refcard presents a basic blueprint for deploying
Apache Hadoop HDFS and MapReduce in development and
production environments. Check out Refcard #117, Getting
Started with Apache Hadoop, for basic terminology and for an
overview of the tools available in the Hadoop Project.
WHICH HADOOP DISTRIBUTION?
Apache Hadoop is a scalable framework for implementing
reliable and scalable computational networks. This Refcard
presents how to deploy and use development and production
computational networks. HDFS, MapReduce, and Pig are the
foundational tools for developing Hadoop applications.
There are two basic Hadoop distributions:
• Apache Hadoop is the main open-source, bleeding-edge
distribution from the Apache foundation.
• The Cloudera Distribution for Apache Hadoop (CDH) is an
open-source, enterprise-class distribution for production-
ready environments.
The decision of using one or the other distributions depends
on the organization’s desired objective.
• The Apache distribution is fne for experimental learning
exercises and for becoming familiar with how Hadoop is
put together.
• CDH removes the guesswork and offers an almost turnkey
product for robustness and stability; it also offers some
tools not available in the Apache distribution.
Hot
Tip
Cloudera offers professional services and puts
out an enterprise distribution of Apache
Hadoop. Their toolset complements Apache’s.
Documentation about Cloudera’s CDH is available
from http://docs.cloudera.com.
The Apache Hadoop distribution assumes that the person
installing it is comfortable with confguring a system manually.
CDH, on the other hand, is designed as a drop-in component for
all major Linux distributions.
Hot
Tip
Linux is the supported platform for production
systems. Windows is adequate but is not
supported as a development platform.
Minimum Prerequisites
• Java 1.6 from Oracle, version 1.6 update 8 or later; identify
your current JAVA_HOME
• sshd and ssh for managing Hadoop daemons across
multiple systems
• rsync for fle and directory synchronization across the nodes
in the cluster
• Create a service account for user hadoop where $HOME=/
home/hadoop
SSH Access
Every system in a Hadoop deployment must provide SSH
access for data exchange between nodes. Log in to the node
as the Hadoop user and run the commands in Listing 1 to
validate or create the required SSH confguration.
Listing 1 - Hadoop SSH Prerequisits
keyFile=$HOME/.ssh/id_rsa.pub
pKeyFile=$HOME/.ssh/id_rsa
authKeys=$HOME/.ssh/authorized_keys
if ! ssh localhost -C true ; then \
if [ ! -e “$keyFile” ]; then \
ssh-keygen -t rsa -b 2048 -P ‘’ \
-f “$pKeyFile”; \
f; \
cat “$keyFile” >> “$authKeys”; \
chmod 0640 “$authKeys”; \
echo “Hadoop SSH confgured”; \
else echo “Hadoop SSH OK”; f
The public key for this example is left blank. If this were to run
on a public network it could be a security hole. Distribute the
public key from the master node to all other nodes for data
exchange. All nodes are assumed to run in a secure network
behind the frewall.
Find out how Cloudera’s
Distribution for Apache
Hadoop makes it easier
to run Hadoop in your
enterprise.
www.cloudera.com/downloads/
Comprehensive Apache
Hadoop Training and
Certification
brought to you by..
3 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
DZone, Inc. | www.dzone.com
Hot
Tip
All the bash shell commands in this Refcard are
available for cutting and pasting from:
http://ciurana.eu/DeployingHadoopDZone
Enterprise: CDH Prerequisites
Cloudera simplifed the installation process by offering packages
for Ubuntu Server and Red Hat Linux distributions.
Hot
Tip
CDH packages have names like CDH2, CDH3, and
so on, corresponding to the CDH version. The
examples here use CDH3. Use the appropriate
version for your installation.
CDH on Ubuntu Pre-Install Setup
Execute these commands as root or via sudo to add the
Cloudera repositories:
Listing 2 - Ubuntu Pre-Install Setup
DISTRO=$(lsb_release -c | cut -f 2)
REPO=/etc/apt/sources.list.d/cloudera.list
echo “deb \
http://archive.cloudera.com/debian \
$DISTRO-cdh3 contrib” > “$REPO”
echo “deb-src \
http://archive.cloudera.com/debian \
$DISTRO-cdh3 contrib” >> “$REPO”
apt-get update
CDH on Red Hat Pre-Install Setup
Run these commands as root or through sudo to add the yum
Cloudera repository:
Listing 3 - Red Hat Pre-Install Setup
curl -sL http://is.gd/3ynKY7 | tee \
/etc/yum.repos.d/cloudera-cdh3.repo | \
awk ‘/^name/’
yum update yum
Ensure that all the pre-required software and confguration
are installed on every machine intended to be a Hadoop
node. Don’t mix and match operating systems, distributions,
Hadoop, or Java versions!
Hadoop for Development
• Hadoop runs as a single Java process, in non-distributed
mode, by default. This confguration is optimal for
development and debugging.
• Hadoop also offers a pseudo-distributed mode, in which
every Hadoop daemon runs in a separate Java process.
This confguration is optimal for development and will be
used for the examples in this guide.
Hot
Tip
If you have an OS X or a Windows development
workstation, consider using a Linux distribution
hosted on VirtualBox for running Hadoop. It will
help prevent support or compatibility headaches.
Hadoop for Production
• Production environments are deployed across a group
of machines that make the computational network.
Hadoop must be confgured to run in fully distributed,
clustered mode.
APACHE HADOOP INSTALLATION
This Refcard is a reference for development and production
deployment of the components shown in Figure 1. It includes
the components available in the basic Hadoop distribution and
the enhancements that Cloudera released.
HDFS
Disk Disk Disk Disk
MapReduce HBase
Sqoop
Chukwa Hive PIG
Z
o
o
K
e
e
p
e
r
Flume
logs
logs
logs
database
Hadoop components
CDH components
Figure 1 - Hadoop Components
Hot
Tip
Whether the user intends to run Hadoop in
non-distributed or distributed modes, it’s best
to install every required component in every
machine in the computational network. Any
computer may assume any role thereafter.
A non-trivial, basic Hadoop installation includes at least
these components:
• Hadoop Common: the basic infrastructure necessary for
running all components and applications
• HDFS: the Hadoop Distributed File System
• MapReduce: the framework for large data set
distributed processing
• Pig: an optional, high-level language for parallel
computation and data fow
Enterprise users often chose CDH because of:
• Flume: a distributed service for effcient large data
transfers in real-time
• Sqoop: a tool for importing relational databases into
Hadoop clusters
Apache Hadoop Development Deployment
The steps in this section must be repeated for every node in
a Hadoop cluster. Downloads, installation, and confguration
could be automated with shell scripts. All these steps
are performed as the service user hadoop, defned in the
prerequisites section.
http://hadoop.apache.org/common/releases.html has the latest
version of the common tools. This guide used version 0.20.2.
1. Download Hadoop from a mirror and unpack it in the
/home/hadoop work directory.
2. Set the JAVA_HOME environment variable.
3. Set the run-time environment:
4 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
DZone, Inc. | www.dzone.com
Listing 4 - Set the Hadoop Runtime Environment
version=0.20.2 # change if needed
identity=”hadoop-dev”
runtimeEnv=”runtime/conf/hadoop-env.sh”
ln -s hadoop-”$version” runtime
ln -s runtime/logs .
export HADOOP_HOME=”$HOME”
cp “$runtimeEnv” “$runtimeEnv”.org
echo “export \
HADOOP_SLAVES=$HADOOP_HOME/slaves” \
>> “$runtimeEnv”
mkdir “$HADOOP_HOME”/slaves
echo \
“export HADOOP_IDENT_STRING=$identity” >> \
“$runtimeEnv”
echo \
“export JAVA_HOME=$JAVA_HOME” \
>>”$runtimeEnv”
export \
PATH=$PATH:”$HADOOP_HOME”/runtime/bin
unset version; unset identity; unset runtimeEnv
Confguration
Pseudo-distributed operation (each daemon runs in a separate
Java process) requires updates to core-site.xml, hdfs-site.xml,
and the mapred-site.xml. These fles confgure the master,
the fle system, and the MapReduce framework and live in the
runtime/conf directory.
Listing 5 - Pseudo-Distributed Operation Confg
<!-- core-site.xml -->
<confguration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</confguration>
<!-- hdfs-site.xml -->
<confguration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</confguration>
<!-- mapred-site.xml -->
<confguration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</confguration>
These fles are documented in the Apache Hadoop Clustering
reference, http://is.gd/E32L4s — some parameters are discussed
in this Refcard’s production deployment section.
Test the Hadoop Installation
Hadoop requires a formatted HDFS cluster to do its work:
hadoop namenode -format
The HDFS volume lives on top of the standard fle system. The
format command will show this upon successful completion:
/tmp/dfs/name has been successfully formatted.
Start the Hadoop processes and perform these operations to
validate the installation:
• Use the contents of runtime/conf as known input
• Use Hadoop for fnding all text matches in the input
• Check the output directory to ensure it works
Listing 6 - Testing the Hadoop Installation
start-all.sh ; sleep 5
hadoop fs -put runtime/conf input
hadoop jar runtime/hadoop-*-examples.jar\
grep input output ‘dfs[a-z.]+’
Hot
Tip
You may ignore any warnings or errors about a
missing slaves file.
• View the output fles in the HDFS volume and stop the
Hadoop daemons to complete testing the install
Listing 7 - Job Completion and Daemon Termination
hadoop fs -cat output/*
stop-all.sh
That’s it! Apache Hadoop is installed in your system and ready
for development.
CDH Development Deployment
CDH removes a lot of grueling work from the Hadoop
installation process by offering ready-to-go packages
for mainstream Linux server distributions. Compare the
instructions in Listing 8 against the previous section. CDH
simplifes installation and confguration for huge time savings.
Listing 8 - Installing CDH
ver=”0.20”
command=”/usr/bin/aptitude”
if [ ! -e “$command” ];
then command=”/usr/bin/yum”; f
“$command” install\
hadoop-”$ver”-conf-pseudo
unset command ; unset ver
Leveraging some or all of the extra components in Hadoop
or CDH is another good reason for using it over the Apache
version. Install Flume or Pig with the instructions in Listing 9.
Listing 9 - Adding Optional Components
apt-get install hadoop-pig
apt-get install fume
apt-get install sqoop
Test the CDH Installation
The CDH daemons are ready to be executed as services.
There is no need to create a service account for executing
them. They can be started or stopped as any other Linux
service, as shown in Listing 10.
Listing 10 - Starting the CDH Daemons
for s in /etc/init.d/hadoop* ; do \
“$s” start; done
CDH will create an HDFS partition when its daemons start. It’s
another convenience it offers over regular Hadoop. Listing 11
shows how to validate the installation by:
• Listing the HDFS module
• Moving fles to the HDFS volume
• Running an example job
• Validating the output
5 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
DZone, Inc. | www.dzone.com
Listing 11 - Testing the CDH Installation
hadoop fs -ls /
# run a job:
pushd /usr/lib/hadoop
hadoop fs -put /etc/hadoop/conf input
hadoop fs -ls input
hadoop jar hadoop-*-examples.jar \
grep input output ‘dfs[a-z.]+’
# Validate it ran OK:
hadoop fs -cat output/*
The daemons will continue to run until the server stops. All the
Hadoop services are available.
Monitoring the Local Installation
Use a browser to check the NameNode or the JobTracker state
through their web UI and web services interfaces. All daemons
expose their data over HTTP. The users can chose to monitor a
node or daemon interactively using the web UI, like in Figure
2. Developers, monitoring tools, and system administrators can
use the same ports for tracking the system performance and
state using web service calls.
Figure 2 - NameNode status web UI
The web interface can be used for monitoring the JobTracker,
which dispatches tasks to specifc nodes in a cluster, the
DataNodes, or the NameNode, which manages directory
namespaces and fle nodes in the fle system.
HADOOP MONITORING PORTS
Use the information in Table 1 for confguring a development
workstation or production server frewall.
Port Service
50030 JobTracker
50060 TaskTrackers
50070 NameNode
50075 DataNodes
50090 Secondary NameNode
50105 Backup Node
Table 1 - Hadoop ports
Plugging a Monitoring Agent
The Hadoop daemons also expose internal data over a RESTful
interface. Automated monitoring tools like Nagios, Splunk, or
SOBA can use them. Listing 12 shows how to fetch a daemon’s
metrics as a JSON document:
Listing 12 - Fetching Daemon Metrics
http://localhost:50070/metrics?format=json
All the daemons expose these useful resource paths:
• /metrics - various data about the system state
• /stacks - stack traces for all threads
• /logs - enables fetching logs from the fle system
• /logLevel - interface for setting log4j logging levels
Each daemon type also exposes one or more resource paths
specifc to its operation. A comprehensive list is available from:
http://is.gd/MBN4qz
APACHE HADOOP PRODUCTION DEPLOYMENT
The fastest way to deploy a Hadoop cluster is by using the
prepackaged tools in CDH. They include all the same software
as the Apache Hadoop distribution but are optimized to run in
production servers and with tools familiar to system administrators.
Hot
Tip
Detailed guides that complement this Refcard are
available from Cloudera at http://is.gd/RBWuxm
and from Apache at http://is.gd/ckUpu1.
bucket
bucket
bucket
bucket
Data Node
Tasks:
HDFS
blocks
local write
Data Node
Tasks:
HDFS
blocks
local write
Data Node
Tasks:
HDFS
blocks
local write
Data Node
Tasks: Map
HDFS
blocks
local write
Data Node
Tasks: Reduce
HDFS
blocks
local write
Data Node
Tasks: Reduce
HDFS
blocks
local write
Output
Output
Output
Name Node
Jobs, Names
status
Assigns
User Application
Mapper,
Reducer
(Java)
Mapper,
Reducer
(scripts)
status
Figure 3 - Hadoop Computational Network
The deployment diagram in Figure 3 describes all the
participating nodes in a computational network. The basic
procedure for deploying a Hadoop cluster is:
• Pick a Hadoop distribution
• Prepare a basic confguration on one node
• Deploy the same pre-confgured package across all
machines in the cluster
• Confgure each machine in the network according to its role
The Apache Hadoop documentation shows this as a rather
involved process. The value-added in CDH is that most of that
work is already in place. Role-based confguration is very easy
to accomplish. The rest of this Refcard will be based on CDH.
Handling Multiple Confgurations: Alternatives
Each server role will be determined by its confguration,
since they will all have the same software installed. CDH
supports the Ubuntu and Red Hat mechanism for handling
alternative confgurations.
6 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
DZone, Inc. | www.dzone.com
Hot
Tip
Check the main page to learn more about
alternatives.
Ubuntu: man update-alternatives
Red Hat: man alternatives
The Linux alternatives mechanism ensures that all fles
associated with a specifc package are selected as a system
default. This customization is where all the extra work went
into CDH. The CDH installation uses alternatives to set the
effective CDH confguration.
Setting Up the Production Confguration
Listing 13 takes a basic Hadoop confguration and sets it up
for production.
Listing 13 - Set the Production Confguration
ver=”0.20”
prodConf=”/etc/hadoop-$ver/conf.prod”
cp -Rfv /etc/hadoop-”$ver”/conf.empty \
“$prodConf”
chown hadoop:hadoop “$prodConf”
# activate the new confguration:
alt=”/usr/sbin/update-alternatives”
if [ ! -e “$alt” ]; then alt=”/usr/sbin/alternatives”; f
“$alt” --install /etc/hadoop-”$ver”/conf \
hadoop-”$ver”-conf “$prodConf” 50
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” restart; done
The server will restart all the Hadoop daemons using the new
production confguration.
NameNode
HDFS
DataNode DataNode DataNode DataNode DataNode
Figure 4 - Hadoop Conceptual Topology
Readying the NameNode for Hadoop
Pick a node from the cluster to act as the NameNode
(see Figure 3). All Hadoop activity depends on having a valid
R/W fle system. Format the distributed fle system from the
NameNode, using user hdfs:
Listing 14 - Create a New File System
sudo -u hdfs hadoop namenode -format
Stop all the nodes to complete the fle system, permissions, and
ownership confguration. Optionally, set daemons for automatic
startup using rc.d.
Listing 15 - Stop All Daemons
# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” stop ;\
# Optional command for auto-start:
update-rc.d “$h” defaults; \
done
File System Setup
Every node in the cluster must be confgured with appropriate
directory ownership and permissions. Execute the commands in
Listing 16 in every node:
Listing 16 - File System Setup
mkdir -p /data/1/dfs/nn /data/2/dfs/nn
mkdir -p /data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
mkdir -p /data/1/mapred/local \
/data/2/mapred/local
chown -R hdfs:hadoop /data/1/dfs/nn \
/data/2/dfs/nn /data/1/dfs/dn \
/data/2/dfs/dn /data/3/dfs/dn \
/data/4/dfs/dn
chown -R mapred:hadoop \
/data/1/mapred/local \
/data/2/mapred/local
chmod -R 755 /data/1/dfs/nn \
/data/2/dfs/nn \
/data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
chmod -R 755 /data/1/mapred/local \
/data/2/mapred/local
Starting the Cluster
• Start the NameNode to make HDFS available to all nodes
• Set the MapReduce owner and permissions in the
HDFS volume
• Start the JobTracker
• Start all other nodes
CDH daemons are defned in /etc/init.d — they can be
confgured to start along with the operating system or they can
be started manually. Execute the command appropriate for each
node type using this example:
Listing 17 - Starting a Node Example
# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” stop ; done
Use jobtracker, datanode, tasktracker, etc. corresponding to the
node you want to start or stop.
Hot
Tip
Refer to the Linux distribution’s documentation
for information on how to start the /etc/init.d
daemons with the chkconfig tool.
Listing 18 - Set the MapReduce Directory Up
sudo -u hdfs hadoop fs -mkdir \
/mapred/system
sudo -u hdfs hadoop fs -chown mapred \
/mapred/system
Update the Hadoop Confguration Files
Listing 19 - Minimal HDFS Confg Update
<!-- hdfs-site.xml -->
<property>
<name>dfs.name.dir</name>
<value>/data/1/dfs/nn,/data/2/dfs/nn
</value>
<fnal>true</fnal>
</property>
<property>
<name>dfs.data.dir</name>
<value>
/data/1/dfs/dn,/data/2/dfs/dn,
/data/3/dfs/dn,/data/4/dfs/dn
</value>
<fnal>true</fnal>
</property>
7 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing

DZone, Inc.
140 Preston Executive Dr.
Suite 100
Cary, NC 27513
888.678.0399
919.678.0300
Refcardz Feedback Welcome
refcardz@dzone.com
Sponsorship Opportunities
sales@dzone.com
Copyright © 2011 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise,
without prior written permission of the publisher.
Version 1.0
$
7
.
9
5
DZone communities deliver over 6 million pages each month to
more than 3.3 million software developers, architects and decision
makers. DZone offers something for everyone, including news,
tutorials, cheat sheets, blogs, feature articles, source code and more.
“DZone is a developer’s dream,” says PC Magazine.
RECOMMENDED BOOK ABOUT THE AUTHOR
ISBN-13: 978-1-936502-03-5
ISBN-10: 1-936502-03-8
9 781936 502035
50795
The last step consists of confguring the MapReduce nodes to
fnd their local working and system directories:
Listing 20 - Minimal MapReduce Confg Update
<!-- mapred-site.xml -->
<property>
<name>mapred.local.dir</name>
<value>
/data/1/mapred/local,
/data/2/mapred/local
</value>
<fnal>true</fnal>
</property>
<property>
<name>mapred.systemdir</name>
<value>
/mapred/system
</value>
<fnal>true</fnal>
</property>
Start the JobTracker and all other nodes. You now have a working
Hadoop cluster. Use the commands in Listing 11 to validate that
it’s operational.
WHAT’S NEXT?
The instructions in this Refcard result in a working development
or production Hadoop cluster. Hadoop is a complex framework
and requires attention to confgure and maintain it. Review
the Apache Hadoop and Cloudera CDH documentation. Pay
particular attention to the sections on:
• How to write MapReduce, Pig, or Hive applications
• Multi-node cluster management with ZooKeeper
• Hadoop ETL with Sqoop and Flume
Happy Hadoop computing!
STAYING CURRENT
Do you want to know about specifc projects and use cases
where Hadoop and data scalability are the hot topics? Join the
scalability newsletter:
http://ciurana.eu/scalablesystems


By Paul M. Duvall
ABOUT CONTINUOUS INTEGRATION
G
e
t M
o
r e
R
e
f c a
r d
z ! V
i s i t r e
f c a
r d
z . c o
m

#84
Continuous Integration:
Patterns and Anti-Patterns
CONTENTS INCLUDE:
■ About Continuous Integration
■ Build Software at Every Change
■ Patterns and Anti-patterns
■ Version Control
■ Build Management
■ Build Practices and more...
Continuous Integration (CI) is the process of building software
with every change committed to a project’s version control
repository.
CI can be explained via patterns (i.e., a solution to a problem
in a particular context) and anti-patterns (i.e., ineffective
approaches sometimes used to “fi x” the particular problem)
associated with the process. Anti-patterns are solutions that
appear to be benefi cial, but, in the end, they tend to produce
adverse effects. They are not necessarily bad practices, but can
produce unintended results when compared to implementing
the pattern.
Continuous Integration
While the conventional use of the term Continuous Integration
efers to the “build and test” cycle, this Refcard
expands on the notion of CI to include concepts such as
Aldon®
Change. Collaborate. Comply.
Pattern
Description
Private Workspace
Develop software in a Private Workspace to isolate changes
Repository
Commit all fi les to a version-control repository
Mainline
Develop on a mainline to minimize merging and to manage
active code lines
Codeline Policy
Developing software within a system that utilizes multiple
codelines
Task-Level Commit
Organize source code changes by task-oriented units of work
and submit changes as a Task Level Commit
Label Build
Label the build with unique name
Automated Build
Automate all activities to build software from source without
manual confi guration
Minimal Dependencies Reduce pre-installed tool dependencies to the bare minimum
Binary Integrity
For each tagged deployment, use the same deployment
package (e.g. WAR or EAR) in each target environment
Dependency Management Centralize all dependent libraries
Template Verifi er
Create a single template fi le that all target environment
properties are based on
Staged Builds
Run remote builds into different target environments
Private Build
Perform a Private Build before committing changes to the
Repository
Integration Build
Perform an Integration Build periodically, continually, etc.
Send automated feedback from CI server to development team
ors as soon as they occur
Generate developer documentation with builds based on
brought to you by...


By Andy Harris
HTML BASICS
e
.co
m
G
e
t M
o
re
R
e
fca
rd
z! V
isit re
fca
rd
z.co
m

#64
C
o
re H
TM
L
HTML and XHTML are the foundation of all web development.
HTML is used as the graphical user interface in client-side
programs written in JavaScript. Server-side languages like PHP
and Java also receive data from web pages and use HTML
as the output mechanism. The emerging Ajax technologies
likewise use HTML and XHTML as their visual engine. HTML
was once a very loosely-defi ned language with very little
standardization, but as it has become more important, the
need for standards has become more apparent. Regardless of
whether you choose to write HTML or XHTML, understanding
the current standards will help you provide a solid foundation
that will simplify all your other web coding. Fortunately HTML
and XHTML are actually simpler than they used to be, because
much of the functionality has moved to CSS.
common elements
Every page (HTML or XHTML shares certain elements in
common.) All are essentially plain text
extension. HTML fi les should not be cr
processor
CONTENTS INCLUDE:
■ HTML Basics ■ HTML vs XHTML ■ Validation ■ Useful Open Source Tools
■ Page Structure Elements
■ Key Structural Elements and more...
The src attribute describes where the image fi le can be found,
and the alt attribute describes alternate text that is displayed if
the image is unavailable. Nested tags Tags can be (and frequently are) nested inside each other. Tags
cannot overlap, so <a><b></a></b> is not legal, but <a><b></
b></a> is fi ne.
HTML VS XHTML HTML has been around for some time. While it has done its
job admirably, that job has expanded far more than anybody
expected. Early HTML had very limited layout support.
Browser manufacturers added many competing standar
web developers came up with clever workar
result is a lack of standar
The latest web standar
Browse our collection of over 100 Free Cheat Sheets
Upcoming Refcardz
RichFaces
CSS3
Windows Azure Platform
Spring Roo


By Daniel Rubio
ABOUT CLOUD COMPUTING
C
lo
u
d
C
o
m
p
u
tin
g
w
w
w
.d
z
o
n
e
.co
m
G
e
t M
o
re
R
e
fc
a
rd
z
! V
isit re
fc
a
rd
z.c
o
m

#82
Getting Started with
Cloud Computing
CONTENTS INCLUDE:
■ About Cloud Computing
■ Usage Scenarios
■ Underlying Concepts
■ Cost
■ Data Tier Technologies
■ Platform Management and more...
Web applications have always been deployed on servers
connected to what is now deemed the ‘cloud’.
However, the demands and technology used on such servers
has changed substantially in recent years, especially with
the entrance of service providers like Amazon, Google and
Microsoft.
These companies have long deployed web applications
that adapt and scale to large user bases, making them
knowledgeable in many aspects related to cloud computing.
This Refcard will introduce to you to cloud computing, with an
emphasis on these providers, so you can better understand
what it is a cloud computing platform can offer your web
applications.
USAGE SCENARIOS
Pay only what you consume
Web application deployment until a few years ago was similar
to most phone services: plans with alloted resources, with an
incurred cost whether such resources were consumed or not.
Cloud computing as it’s known today has changed this.
The various resources consumed by web applications (e.g.
bandwidth, memory, CPU) are tallied on a per-unit basis
(starting from zero) by all major cloud computing platforms.
also minimizes the need to make design changes to support
one time events.
Automated growth & scalable technologies
Having the capability to support one time events, cloud
computing platforms also facilitate the gradual growth curves
faced by web applications.
Large scale growth scenarios involving specialized equipment
(e.g. load balancers and clusters) are all but abstracted away by
relying on a cloud computing platform’s technology.
In addition, several cloud computing platforms support data
tier technologies that exceed the precedent set by Relational
Database Systems (RDBMS): Map Reduce, web service APIs,
etc. Some platforms support large scale RDBMS deployments.
CLOUD COMPUTING PLATFORMS AND
UNDERLYING CONCEPTS
Amazon EC2: Industry standard software and virtualization
Amazon’s cloud computing platform is heavily based on
industry standard software and virtualization technology.
Virtualization allows a physical piece of hardware to be
utilized by multiple operating systems. This allows resources
(e.g. bandwidth, memory, CPU) to be allocated exclusively to
individual operating system instances.
As a user of Amazon’s EC2 cloud computing platform, you are
assigned an operating system in the same way as on all hosting
Eugene Ciurana (http://eugeneciurana.eu) is
the VP of Technology at Badoo.com, the largest
dating site worldwide, and cofounder of SOBA
Labs, the most sophisticated public and private
clouds management software. Eugene is also an
open-source evangelist who specializes in the
design and implementation of mission-critical, high-availability
systems. He recently built scalable computational networks for
leading fnancial, software, insurance, SaaS, government, and
healthcare companies in the US, Japan, Mexico, and Europe.
Publications
• Developing with Google App Engine, Apress
• DZone Refcard #117: Getting Started with Apache Hadoop
• DZone Refcard #105: NoSQL and Data Scalability
• DZone Refcard #43: Scalability and High Availability
• The Tesla Testament: A Thriller, CIMEntertainment
Thank You!
Thanks to all the technical reviewers, especially to Pavel Dovbush
at http://dpp.su
Hadoop: The Defnitive Guide helps you harness
the power of your data. Ideal for processing large
datasets, the Apache Hadoop framework is an
open-source implementation of the MapReduce
algorithm on which Google built its empire. This
comprehensive resource demonstrates how to
use Hadoop to build reliable, scalable, distributed systems;
programmers will fnd details for analyzing large datasets, and
administrators will learn how to set up and run Hadoop clusters.
BUY NOW
http://oreilly.com/catalog/0636920010388/

bleeding-edge distribution from the Apache foundation. \ else echo “Hadoop SSH OK”.com. Their toolset complements Apache’s. MapReduce. \ cat “$keyFile” >> “$authKeys”.ssh/id_rsa authKeys=$HOME/.. \ chmod 0640 “$authKeys”. | www. Log in to the node as the Hadoop user and run the commands in Listing 1 to validate or create the required SSH configuration. for basic terminology and for an overview of the tools available in the Hadoop Project. This Refcard presents how to deploy and use development and production computational networks.cloudera. The decision of using one or the other distributions depends on the organization’s desired objective.com .com CONTENTS INCLUDE: n n brought to you by. SSH Access Every system in a Hadoop deployment must provide SSH access for data exchange between nodes. www.dzone. Getting Started with Apache Hadoop.ssh/id_rsa. \ fi. If this were to run on a public network it could be a security hole. HDFS. • The Apache distribution is fine for experimental learning exercises and for becoming familiar with how Hadoop is put together. identify your current JAVA_HOME • sshd and ssh for managing Hadoop daemons across multiple systems • rsync for file and directory synchronization across the nodes in the cluster • Create a service account for user hadoop where $HOME=/ home/hadoop INTRODUCTION This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. All nodes are assumed to run in a secure network behind the firewall. it also offers some tools not available in the Apache distribution. Listing 1 . then \ if [ ! -e “$keyFile” ]. \ echo “Hadoop SSH configured”.ssh/authorized_keys if ! ssh localhost -C true . Apache Hadoop Deployment Hot Tip Cloudera offers professional services and puts out an enterprise distribution of Apache Hadoop. version 1. DZone. Documentation about Cloudera’s CDH is available from http://docs. is designed as a drop-in component for all major Linux distributions.com/downloads/ Comprehensive Apache Hadoop Training and Certification Hot Tip Linux is the supported platform for production systems. on the other hand. The Apache Hadoop distribution assumes that the person installing it is comfortable with configuring a system manually. • The Cloudera Distribution for Apache Hadoop (CDH) is an open-source.cloudera. There are two basic Hadoop distributions: • Apache Hadoop is the main open-source. Distribute the public key from the master node to all other nodes for data exchange. Windows is adequate but is not supported as a development platform.6 from Oracle.6 update 8 or later.#133 Get More Refcardz! Visit refcardz. Find out how Cloudera’s Distribution for Apache Hadoop makes it easier to run Hadoop in your enterprise. • CDH removes the guesswork and offers an almost turnkey product for robustness and stability. CDH. Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing By Eugene Ciurana Minimum Prerequisites • Java 1. WHICH HADOOP DISTRIBUTION? Apache Hadoop is a scalable framework for implementing reliable and scalable computational networks.pub pKeyFile=$HOME/. fi The public key for this example is left blank. and Pig are the foundational tools for developing Hadoop applications. Check out Refcard #117. then \ ssh-keygen -t rsa -b 2048 -P ‘’ \ -f “$pKeyFile”.Hadoop SSH Prerequisits keyFile=$HOME/.. enterprise-class distribution for productionready environments. n n n n Introduction Which Hadoop Distribution? Apache Hadoop Installation Hadoop Monitoring Ports Apache Hadoop Production Deployment Hot Tips and more. Inc..

Ubuntu Pre-Install Setup DISTRO=$(lsb_release -c | cut -f 2) REPO=/etc/apt/sources. The examples here use CDH3. clustered mode. Use the appropriate version for your installation.cloudera. This configuration is optimal for development and will be used for the examples in this guide. Downloads. It will help prevent support or compatibility headaches. Set the run-time environment: Hot Tip If you have an OS X or a Windows development workstation. 3. it’s best to install every required component in every machine in the computational network. and configuration could be automated with shell scripts.repo | \ awk ‘/^name/’ yum update yum Ensure that all the pre-required software and configuration are installed on every machine intended to be a Hadoop node. or Java versions! Hadoop for Development • Hadoop runs as a single Java process. Don’t mix and match operating systems.com .gd/3ynKY7 | tee \ /etc/yum. A non-trivial.dzone. | www.html has the latest version of the common tools. This Refcard is a reference for development and production deployment of the components shown in Figure 1. corresponding to the CDH version. Apache Hadoop Development Deployment The steps in this section must be repeated for every node in a Hadoop cluster. • Hadoop also offers a pseudo-distributed mode.2. Inc. Set the JAVA_HOME environment variable.3 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing Hot Tip All the bash shell commands in this Refcard are available for cutting and pasting from: http://ciurana.repos.Hadoop Components Hot Tip Whether the user intends to run Hadoop in non-distributed or distributed modes.com/debian \ $DISTRO-cdh3 contrib” >> “$REPO” apt-get update Figure 1 .Red Hat Pre-Install Setup curl -sL http://is. Hadoop. This guide used version 0. high-level language for parallel computation and data flow Enterprise users often chose CDH because of: • Flume: a distributed service for efficient large data transfers in real-time • Sqoop: a tool for importing relational databases into Hadoop clusters CDH on Red Hat Pre-Install Setup Run these commands as root or through sudo to add the yum Cloudera repository: Listing 3 .eu/DeployingHadoopDZone APACHE HADOOP INSTALLATION Enterprise: CDH Prerequisites Cloudera simplified the installation process by offering packages for Ubuntu Server and Red Hat Linux distributions. 2.d/cloudera.com/debian \ $DISTRO-cdh3 contrib” > “$REPO” echo “deb-src \ http://archive. Download Hadoop from a mirror and unpack it in the /home/hadoop work directory. and so on.apache.cloudera. CDH on Ubuntu Pre-Install Setup Execute these commands as root or via sudo to add the Cloudera repositories: Listing 2 . distributions. Hadoop must be configured to run in fully distributed. This configuration is optimal for development and debugging. CDH3. Hadoop for Production • Production environments are deployed across a group of machines that make the computational network. Any computer may assume any role thereafter. basic Hadoop installation includes at least these components: • Hadoop Common: the basic infrastructure necessary for running all components and applications • HDFS: the Hadoop Distributed File System • MapReduce: the framework for large data set distributed processing • Pig: an optional.list echo “deb \ http://archive.org/common/releases. installation.d/cloudera-cdh3. Hot Tip CDH packages have names like CDH2. in which every Hadoop daemon runs in a separate Java process.20. DZone.list. in non-distributed mode. consider using a Linux distribution hosted on VirtualBox for running Hadoop. by default. It includes the components available in the basic Hadoop distribution and the enhancements that Cloudera released. defined in the prerequisites section. 1. All these steps are performed as the service user hadoop. http://hadoop.

CDH will create an HDFS partition when its daemons start.mapred-site. then command=”/usr/bin/yum”. They can be started or stopped as any other Linux service.xml. and the MapReduce framework and live in the runtime/conf directory.xml.20” command=”/usr/bin/aptitude” if [ ! -e “$command” ].xml --> <configuration> <property> <name>dfs. There is no need to create a service account for executing them.20. Listing 8 .Pseudo-Distributed Operation Config <!-. CDH simplifies installation and configuration for huge time savings. done These files are documented in the Apache Hadoop Clustering reference. unset ver Leveraging some or all of the extra components in Hadoop or CDH is another good reason for using it over the Apache version. Inc. CDH Development Deployment CDH removes a lot of grueling work from the Hadoop installation process by offering ready-to-go packages for mainstream Linux server distributions. as shown in Listing 10. These files configure the master.job. http://is. Listing 9 .xml --> <configuration> <property> <name>fs. The format command will show this upon successful completion: /tmp/dfs/name has been successfully formatted.sh .jar\ grep input output ‘dfs[a-z.]+’ Hot Tip You may ignore any warnings or errors about a missing slaves file. export HADOOP_HOME=”$HOME” cp “$runtimeEnv” “$runtimeEnv”. | www.org echo “export \ HADOOP_SLAVES=$HADOOP_HOME/slaves” \ >> “$runtimeEnv” mkdir “$HADOOP_HOME”/slaves echo \ “export HADOOP_IDENT_STRING=$identity” >> \ “$runtimeEnv” echo \ “export JAVA_HOME=$JAVA_HOME” \ >>”$runtimeEnv” export \ PATH=$PATH:”$HADOOP_HOME”/runtime/bin unset version.hdfs-site.sh Configuration Pseudo-distributed operation (each daemon runs in a separate Java process) requires updates to core-site.d/hadoop* .dzone. Test the Hadoop Installation Hadoop requires a formatted HDFS cluster to do its work: hadoop namenode -format The HDFS volume lives on top of the standard file system. do \ “$s” start.Adding Optional Components apt-get install hadoop-pig apt-get install flume apt-get install sqoop Test the CDH Installation The CDH daemons are ready to be executed as services.default.xml --> <configuration> <property> <name>mapred. Listing 10 .com .gd/E32L4s — some parameters are discussed in this Refcard’s production deployment section.Set the Hadoop Runtime Environment version=0.2 # change if needed identity=”hadoop-dev” runtimeEnv=”runtime/conf/hadoop-env.replication</name> <value>1</value> </property> </configuration> <!-. • View the output files in the HDFS volume and stop the Hadoop daemons to complete testing the install Listing 7 .Job Completion and Daemon Termination hadoop fs -cat output/* stop-all.tracker</name> <value>localhost:9001</value> </property> </configuration> That’s it! Apache Hadoop is installed in your system and ready for development. Listing 11 shows how to validate the installation by: • Listing the HDFS module • Moving files to the HDFS volume • Running an example job • Validating the output Start the Hadoop processes and perform these operations to validate the installation: • Use the contents of runtime/conf as known input • Use Hadoop for finding all text matches in the input • Check the output directory to ensure it works DZone.xml.Testing the Hadoop Installation start-all.sh” ln -s hadoop-”$version” runtime ln -s runtime/logs . the file system. and the mapred-site.4 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing Listing 4 . It’s another convenience it offers over regular Hadoop.core-site. sleep 5 hadoop fs -put runtime/conf input hadoop jar runtime/hadoop-*-examples. unset runtimeEnv Listing 6 .name</name> <value>hdfs://localhost:9000</value> </property> </configuration> <!-.Starting the CDH Daemons for s in /etc/init. Compare the instructions in Listing 8 against the previous section. fi “$command” install\ hadoop-”$ver”-conf-pseudo unset command . Listing 5 . unset identity. hdfs-site. Install Flume or Pig with the instructions in Listing 9.Installing CDH ver=”0.

The rest of this Refcard will be based on CDH. Use the information in Table 1 for configuring a development workstation or production server firewall. Developers. Splunk. or SOBA can use them.interface for setting log4j logging levels Each daemon type also exposes one or more resource paths specific to its operation. Port 50030 50060 50070 50075 50090 50105 Service JobTracker TaskTrackers NameNode DataNodes Secondary NameNode Backup Node Table 1 .gd/RBWuxm and from Apache at http://is. A comprehensive list is available from: http://is. The users can chose to monitor a node or daemon interactively using the web UI. or the NameNode. Monitoring the Local Installation Use a browser to check the NameNode or the JobTracker state through their web UI and web services interfaces.NameNode status web UI The web interface can be used for monitoring the JobTracker.Testing the CDH Installation hadoop fs -ls / # run a job: pushd /usr/lib/hadoop hadoop fs -put /etc/hadoop/conf input hadoop fs -ls input hadoop jar hadoop-*-examples. HADOOP MONITORING PORTS Figure 3 . the DataNodes.Fetching Daemon Metrics http://localhost:50070/metrics?format=json All the daemons expose these useful resource paths: • /metrics .5 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing Listing 11 . They include all the same software as the Apache Hadoop distribution but are optimized to run in production servers and with tools familiar to system administrators.Hadoop ports Plugging a Monitoring Agent The Hadoop daemons also expose internal data over a RESTful interface. Inc. Figure 2 .gd/ckUpu1. Role-based configuration is very easy to accomplish. since they will all have the same software installed.enables fetching logs from the file system • /logLevel . CDH supports the Ubuntu and Red Hat mechanism for handling alternative configurations. All the Hadoop services are available.various data about the system state • /stacks . like in Figure 2. | www.Hadoop Computational Network The deployment diagram in Figure 3 describes all the participating nodes in a computational network. The fastest way to deploy a Hadoop cluster is by using the prepackaged tools in CDH. All daemons expose their data over HTTP. The value-added in CDH is that most of that work is already in place.com . which manages directory namespaces and file nodes in the file system.stack traces for all threads • /logs . monitoring tools. Listing 12 shows how to fetch a daemon’s metrics as a JSON document: DZone. Automated monitoring tools like Nagios. Handling Multiple Configurations: Alternatives Each server role will be determined by its configuration. which dispatches tasks to specific nodes in a cluster.]+’ # Validate it ran OK: hadoop fs -cat output/* Listing 12 .jar \ grep input output ‘dfs[a-z.dzone.gd/MBN4qz APACHE HADOOP PRODUCTION DEPLOYMENT The daemons will continue to run until the server stops. The basic procedure for deploying a Hadoop cluster is: • Pick a Hadoop distribution • Prepare a basic configuration on one node • Deploy the same pre-configured package across all machines in the cluster • Configure each machine in the network according to its role The Apache Hadoop documentation shows this as a rather involved process. Hot Tip Detailed guides that complement this Refcard are available from Cloudera at http://is. and system administrators can use the same ports for tracking the system performance and state using web service calls.

/data/3/dfs/dn.name. The CDH installation uses alternatives to set the effective CDH configuration. | www. Inc.20 for h in /etc/init.20 for h in /etc/init.empty \ “$prodConf” chown hadoop:hadoop “$prodConf” # activate the new configuration: alt=”/usr/sbin/update-alternatives” if [ ! -e “$alt” ]. Execute the commands in Listing 16 in every node: DZone.d/hadoop-”$ver”-*. This customization is where all the extra work went into CDH.Create a New File System sudo -u hdfs hadoop namenode -format Hot Tip Refer to the Linux distribution’s documentation for information on how to start the /etc/init.com .6 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing Hot Tip Check the main page to learn more about alternatives.hdfs-site.prod” cp -Rfv /etc/hadoop-”$ver”/conf. done The server will restart all the Hadoop daemons using the new production configuration. Listing 18 . etc.data.Set the MapReduce Directory Up sudo -u hdfs hadoop fs -mkdir \ /mapred/system sudo -u hdfs hadoop fs -chown mapred \ /mapred/system Stop all the nodes to complete the file system. corresponding to the node you want to start or stop.d/hadoop-”$ver”-*.d daemons with the chkconfig tool. then alt=”/usr/sbin/alternatives”. datanode. Format the distributed file system from the NameNode.Set the Production Configuration ver=”0. Ubuntu: man update-alternatives Red Hat: man alternatives Listing 16 .d — they can be configured to start along with the operating system or they can be started manually./data/2/dfs/nn </value> <final>true</final> </property> <property> <name>dfs.d/hadoop-”$ver”-*. do \ “$h” restart.Minimal HDFS Config Update <!-.Starting a Node Example # Run this in every node ver=0. do \ “$h” stop ./data/4/dfs/dn </value> <final>true</final> </property> File System Setup Every node in the cluster must be configured with appropriate directory ownership and permissions.d.xml --> <property> <name>dfs.Hadoop Conceptual Topology Readying the NameNode for Hadoop Pick a node from the cluster to act as the NameNode (see Figure 3).20” prodConf=”/etc/hadoop-$ver/conf. Setting Up the Production Configuration Listing 13 takes a basic Hadoop configuration and sets it up for production. do \ “$h” stop . set daemons for automatic startup using rc. done Starting the Cluster • Start the NameNode to make HDFS available to all nodes • Set the MapReduce owner and permissions in the HDFS volume • Start the JobTracker • Start all other nodes CDH daemons are defined in /etc/init. Listing 15 . tasktracker.dir</name> <value> /data/1/dfs/dn. Execute the command appropriate for each node type using this example: Listing 17 .File System Setup mkdir -p /data/1/dfs/nn /data/2/dfs/nn mkdir -p /data/1/dfs/dn /data/2/dfs/dn \ /data/3/dfs/dn /data/4/dfs/dn mkdir -p /data/1/mapred/local \ /data/2/mapred/local chown -R hdfs:hadoop /data/1/dfs/nn \ /data/2/dfs/nn /data/1/dfs/dn \ /data/2/dfs/dn /data/3/dfs/dn \ /data/4/dfs/dn chown -R mapred:hadoop \ /data/1/mapred/local \ /data/2/mapred/local chmod -R 755 /data/1/dfs/nn \ /data/2/dfs/nn \ /data/1/dfs/dn /data/2/dfs/dn \ /data/3/dfs/dn /data/4/dfs/dn chmod -R 755 /data/1/mapred/local \ /data/2/mapred/local The Linux alternatives mechanism ensures that all files associated with a specific package are selected as a system default. Optionally.Stop All Daemons # Run this in every node ver=0.\ # Optional command for auto-start: update-rc. All Hadoop activity depends on having a valid R/W file system. permissions.dzone. Use jobtracker.dir</name> <value>/data/1/dfs/nn. Listing 13 . Figure 4 . \ done Update the Hadoop Configuration Files Listing 19 . and ownership configuration. fi “$alt” --install /etc/hadoop-”$ver”/conf \ hadoop-”$ver”-conf “$prodConf” 50 for h in /etc/init.d “$h” defaults./data/2/dfs/dn. using user hdfs: Listing 14 .

su #82 Get More Refcardz! Visit refcardz. The web pag s alte Nested es like was onc use HTML ble. Inc.. and cofounder of SOBA Labs.3 million software developers. with an Centr Binar most phone services: ts etim . Google and es Microsoft. DZone offers something for everyone. This comprehensive resource demonstrates how to use Hadoop to build reliable. use target enviro ated s (i.. so by Commit better understand these softwar chang Level can S INT e code ding Task what it is a cloud computingaplatform can offer your web INUOU trol ut line Policy nize sourc es as of buil Orga Code e witho it chang cess CONT ion con e name sourc subm T applications. Tags standard a very loos g Ajax HTML ML as is disp can tech ely-defi thei ization. Earl to be.g. 140 Preston Executive Dr. etc.0 Sponsorship Opportunities sales@dzone. but can to late alloted resources.7 Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing The last step consists of configuring the MapReduce nodes to find their local working and system directories: Listing 20 . ineffect ) Build eve t. Suite 100 Cary. Du By ContiPatterns an ABOUT CLOUD COMPUTING dz. Do you want to know about specific projects and use cases where Hadoop and data scalability are the hot topics? Join the scalability newsletter: http://ciurana. ed to CSS page Brow individual operating system s While y HTM has expand because . ality has allocated exclusively to expecte irably.0300 ISBN-13: 978-1-936502-03-5 ISBN-10: 1-936502-03-8 50795 DZone communities deliver over 6 million pages each month to more than 3. However. web service APIs. com ref car Web applications have always been deployed on servers connected to what is now deemed the ‘cloud’.e. mechanical. ser (HTM common it has ed far L or XHT done web dev manufacture L had very more its extensio . but. Manage s and mor Build Practice Build ■ ■ ■ ■ to isolat space n e Work riptio itory a Desc These companies have Privat deployed webmanage applications are in long trol repos to n-con lop softw ing and Deve a versio that adapt and scale to large user bases. Pig. te. Use the commands in Listing 11 to validate that it’s operational. ortant. Some platforms rams writ large scale RDBMS deployments. this the builds Build gra Send of soon such ration ion with ors as Integ entat ous Inte tional use and test concepts docum oper Continu conven “build include the rate devel to to the s Gene While of CI Re fca rdz ! on: grati s s Inte -Patternvall nuou d Anti Paul M. the largest dating site worldwide.g. ML physical r othe you prov to L Virtualization allows aare actu piece of hardware idebe understand of HTML much a solid of the ing has ally simp r web cod foundati utilized by multiple operating systems. source code and more.com $7.nt team Perfo opme ed resu d Builds sitory Build gration to devel Repo Stage adverse unintend bandwidth.dzone. Copyright © 2011 DZone. cloud Useful Open computing platforms also facilitate the gradual growth curves Page Source Stru Tools faced by web applications. They Privat contin lts whe appear effects. architects and decision makers.dir</name> <value> /data/1/mapred/local. NC 27513 888. CIMEntertainment http://oreilly. Publications • • • • • Developing with Google App Engine. s cloud e Work knowledgeable in a mainline Privat that utilize lop on Deve code lines a system sitory of work active Repo are within units This Refcard will introduce to you to clouddcomputing. software. including news.Minimal MapReduce Config Update <!-. Pay particular attention to the sections on: • How to write MapReduce.local. e chang Vis it ation Having the capability to support one time events. stored in a retrieval system. without prior written permission of the publisher. layed cannot be (and freq need ned lang r visual eng nologies if but as for overlap uen it has ine. Continu Refcard feedb on (starting from zero) by Perfomajor cloud computing platforms. The late a lack of stan up with clev y competi n text should ng stan st web in dar er wor not be dar kar standar cr HTM L BAS RichFaces CSS3 Windows Azure Platform Spring Roo DZone. distributed systems. Com bora ■ ■ ■ Aldon Chan ge. HTM essentia came is proc lly plai supp L files assigned essor an operating system in the same wayain on all hosting as elements ort. making them rn les to ize merg Patte it all fi to minim space Comm many aspects related tomultiple computing.. ler than Fortuna commo on nd for they (e. memory. CPU) areIntegration on from CI server basis tallied a per-unit e ous Inte e Build rm an ack produc Privat .g. that tely some used Every job time. INC ■ Getting Started with Browse our collection of over 100 Free Cheat Sheets Upcoming Refcardz Core HTML By An dy Ha rris ■ ■ Vis it ION EGRAT www.com/catalog/0636920010388/ BUY NOW Thank You! Thanks to all the technical reviewers. programmers will find details for analyzing large datasets.678.com. especially to Pavel Dovbush at http://dpp.. chang prope Depe into differe itting er necessa pared to efi te builds e comm late Verifi associat to be ben befor are not Cloud computing asRun remo it’s known etoday has changed this. with an softw riente loping ine task-o e Deve Mainl es you emphasis oncodelines providers. feature articles. b></ Amazon EC2: ther stanstandard software and uage with whe Industry dard . d. 9 781936 502035 Version 1.com CONTENTS INCLUDE: ■ ■ ■ About Cloud Computing Usage Scenarios Underlying Concepts Cost by. but t. Mexico. the most sophisticated public and private clouds management software. HTML ent instances.e. rnate can be emergin es and use UNDERLYING CONCEPTS and XHT PHP text that e tags found. ive data pt. rm a The various resources consumed by webperiodically. and healthcare companies in the US.xml --> <property> <name>mapred. WAR es lained ) and anti the par solution duce al Depe ge t Web application deployment untillibrari yearsenvironmen similar “fi x” packa Minim be exp nden a few target ago was text ns are d to to pro all rity all depe CI can ticular con alizeplans with file that y Integ es use Anti-patter they tend es. ce pre-in blem with ymen Redu Autom ns (i. cheat sheets. Colla #6 Cloud 4Computing By Daniel Rubio also minimizes the need to make design changes to support CON TEN TS one time events. photocopying. ® Data Tier Technologies t to you ply. several cloud computing platforms support data HTM tier technologies that L and the precedent set by Relational exceed XHT HTML Database Systems (RDBMS): Map Reduce. This allows resources function job adm been arou ing.eu/scalablesystems ABOUT THE AUTHOR RECOMMENDED BOOK Eugene Ciurana (http://eugeneciurana. or otherwise. tutorials. all ated occur pattern based tion autom as as they the term” cycle.) All are ML shar limit than rs add elopers As a user of Amazon’s EC2 cloud computing platform.” says PC Magazine. Review the Apache Hadoop and Cloudera CDH documentation. Eugene is also an open-source evangelist who specializes in the design and implementation of mission-critical. Reg L or XHT simplify will help L VS <a>< and XHT ardless XHTM b></ all you ML. support as the grap foundation The src of all hical and Java ten attribute and the also recein JavaScri user interfac web develop desc as the e in clien ment. “DZone is a developer’s dream. Hadoop is a complex framework and requires attention to configure and maintain it. insurance. nt nmen ctic in a par hes som geme t enviro e a single based on incurred cost whether such resources were consumedto thenot. the demands and technology used on such servers has changed substantially in recent years.eu) is the VP of Technology at Badoo. a solu . Ideal for processing large datasets. Japan. is used ML are the prog etc. and Europe. the e HTM e apparen dards HTM r. You now have a working Hadoop cluster. He recently built scalable computational networks for leading financial. or Hive applications • Multi-node cluster management with ZooKeeper • Hadoop ETL with Sqoop and Flume Happy Hadoop computing! STAYING CURRENT Start the JobTracker and all other nodes. brough Platform Management and more. All rights reserved. /data/2/mapred/local </value> <final>true</final> </property> <property> <name>mapred.. bandwidth. or proces in the end bad pra enting Creat cy Mana nt targe es rties are rily nden implem approac ed with the cial. especially with the entrance of service providers like Amazon. the Apache Hadoop framework is an open-source implementation of the MapReduce algorithm on which Google built its empire. SaaS. so <a>< tly are) nest become virtualization HTML a> is s has fine. Inc. in any form or by means electronic. government.g. Apress DZone Refcard #117: Getting Started with Apache Hadoop DZone Refcard #105: NoSQL and Data Scalability DZone Refcard #43: Scalability and High Availability The Tesla Testament: A Thriller..n elem CPU) to be mov memory. Server-s the ima alt attribute ribes whe output likew ge is describe re the ima mechan from CLOUD COMPUTING PLATFORMS AND ide languag t-side ise unavaila ge file ism. temp s.678. ory. (e. load balancers and clusters) are all but abstracted away by relying on a cloud computing platform’s technology.95 Refcardz Feedback Welcome refcardz@dzone. applications (e. Hadoop: The Definitive Guide helps you harness the power of your data. or transmitted. No part of this publication may be reproduced. blogs. cture Key ■ ■ ■ HTML Automated growth & scalable technologies vs XHT ■ ■ HTML Basics LUD E: Valid ML ents and Large scale growth scenarios involving specialized equipment more.com .com Re fca rd z! Pay only what you consume Ge t Mo re Cloud Computing expand s on efer ion the not e.. scalable. d deploEAR) in each lar pro that pattern ticu tagge via or cies -patter s reposit t nden For each (e. you are result anybody es cert ed man ed layout n. high-availability systems.. Tags l. and it the pro ject’s vers are from with uniqu ABOU softw (CI) is um the build evel Comm build a pro minim Label Task-L ies to gration ed to activit blem the bare ion ate all cies to t ous Inte committ USAGE SCENARIOS to a pro Autom configurat nden ymen Build depe al tion deplo t Continu ry change ive manu Label d tool nmen the same stalle .0399 919. and administrators will learn how to set up and run Hadoop clusters. Temp n com Build ually. Structur Elem al Elem ents ICS In addition.com Ge t Mo re ref c E: INC LUD gration NTS ous Inte Change CO NTE Continu at Every ns About Software i-patter Build and Ant Patterns Control ment Version e.systemdir</name> <value> /mapred/system </value> <final>true</final> </property> WHAT’S NEXT? The instructions in this Refcard result in a working development or production Hadoop cluster. b></ ed insid bec heavily based on very little more Amazon’s cloud computing ose the curr you cho platform isome a></ imp e b> is mor ent stan to writ not lega each othe that will industry standard software and virtualization technology.mapred-site.

Sign up to vote on this title
UsefulNot useful