Professional Documents
Culture Documents
This post will help users to learn important and useful Hadoop shell commands which are very
close to Unix shell commands., using these commands user can perform different operations on
hdfs. People who are familiar with Unix shell commands can easily get hold of it, people who
are new to Unix or hadoop need not to worry, just follow this article to learn all the commands
used in day to day basis and practice the same.
FS Shell
The FileSystem (FS) shell is invoked by bin/hadoop fs <args>. All the FS shell commands take
path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme
is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If
not specified, the default scheme specified in the configuration is used. An HDFS file or
directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply
as /parent/child.
Administrator Commands:
fsck /
Run a HDFS filesystem checking Utility
version
Prints the hadoop version configured on the machine
ls
For a file returns stat on the file with the following format:
filename <number of replicas> filesize modification_date modification_time permissions userid
groupid For a directory it returns list of its direct children as in Unix. A directory is listed
as: dirname <dir> modification_time modification_time permissions userid groupid
lsr
Recursive version of ls. Similar to Unix ls -R.
mv
Moves files from source to destination. This command allows multiple sources as well in which
case the destination needs to be a directory. Moving files across file systems is not permitted.
put
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads
input from stdin and writes to destination filesystem.
rm
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for
recursive deletes.
rmr
Recursive version of deleting.
cat
The cat command concatenates and display files, it works similar to Unix cat command:
chmod
Change the permissions of files. With -R, make the change recursively through the directory
structure. The user must be the owner of the file, or else a super-user.
chown
Change the owner of files. With -R, make the change recursively through the directory structure.
The user must be a super-user.
copyFromLocal
Copies file form local machine and paste in given hadoop directory
copyToLocal
Coppies file from hadoop directory and paste the file in local direcotry
cp
Copy files from source to destination. This command allows multiple sources as well in which
case the destination must be a directory.
dus
Displays a summary of file lengths.
count:
Count the number of directories, files and bytes under the paths that match the specified file
pattern
setrep
Changes the replication factor of a file. -R option is for recursively increasing the replication
factor of files within a directory.
stat
Returns the stat information on the path.
tail
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.
text
Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
touchz
Create a file of zero length.
Please find the complete step by step process for installing Hadoop 2.2.0 stable version on
Ubuntu as requested by many of this blog visitors, friends and subscribers.
Apache Hadoop 2.2.0 release has significant changes compared to its previous stable release,
which is Apache Hadoop 1.2.1(Setting up Hadoop 1.2.1 can be found here).
In short , this release has a number of changes compared to its earlier version 1.2.1:
Jobtracker has been replaced with Resource Manager and Node Manager
Before starting into setting up Apache Hadoop 2.2.0, please understand the concepts of Big Data
and Hadoop from my previous blog posts:
In this tutorial you will know step by step process for setting up a Hadoop Single Node cluster,
so that you can play around with the framework and learn more about it.
In This tutorial we are using following Software versions, you can download same by clicking
the hyperlinks:
If you are using putty to access your Linux box remotely, please install openssh by running this
command, this also helps in configuring SSH access easily in the later part of the installation:
4. Disabling IPv6.
Before starting of installing any applications or softwares, please makes sure your list of
packages from all repositories and PPAs is up to date or if not update them by using this
command:
sudo apt-get update
a. Download Latest oracle Java Linux version of the oracle website by using this command
wget https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-
x64.tar.gz
If it fails to download, please check with this given command which helps to avoid passing
username and password.
c. Create a Java directory using mkdir under /user/local/ and change the directory to
/usr/local/Java by using this command
mkdir -R /usr/local/Java
cd /usr/local/Java
d. Copy the Oracle Java binaries into the /usr/local/Java directory.
e. Edit the system PATH file /etc/profile and add the following system variables to your system
path
f. Scroll down to the end of the file using your arrow keys and add the following lines below to
the end of your /etc/profile file:
JAVA_HOME=/usr/local/Java/jdk1.7.0_45
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH
g. Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. This will tell
the system that the new Oracle Java version is available for use.
This command notifies the system that Oracle Java JDK is available for use
h. Reload your system wide PATH /etc/profile by typing the following command:
. /etc/profile
Test to see if Oracle Java was installed correctly on your system.
Java -version
a. Adding group:
The need for SSH Key based authentication is required so that the master node can then login to
slave nodes (and the secondary node) to start/stop them and also local machine if you want to use
Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access
to localhost for the hduser user we created in the previous section.
Before this step you have to make sure that SSH is up and running on your machine and
configured it to allow SSH public key authentication.
c. It will ask to provide the file name in which to save the key, just press has entered so that it
will generate the key at /home/hduser/ .ssh
d. Enable SSH access to your local machine with this newly created key.
ssh hduser@localhost
This will add localhost permanently to the list of known hosts
4. Disabling IPv6.
We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations.
You will need to run the following commands using a root account:
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Hadoop Installation:
Go to Apache Downloadsand download Hadoop version 2.2.0 (prefer to download any stable
versions)
wget http://apache.mirrors.pair.com/hadoop/common/stable2/hadoop-2.2..tar.gz
mv hadoop-2.2.0 hadoop
iv. Move hadoop package of your choice, I picked /usr/local for my convenience
v. Make sure to change the owner of all the files to the hduser user and hadoop group by using
this command:
The following are the required files we will use for the perfect configuration of the single node
Hadoop cluster.
a. yarn-site.xml:
b. core-site.xml
c. mapred-site.xml
d. hdfs-site.xml
e. Update $HOME/.bashrc
cd /usr/local/hadoop/etc/hadoop
a.yarn-site.xml:
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
b. core-site.xml:
i. Change the user to hduser. Change the directory to /usr/local/hadoop/conf and edit the core-
site.xml file.
vi core-site.xml
ii. Add the following entry to the file and save and quit the file:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
c. mapred-site.xml:
vi mapred-site.xml
ii. Add the following entry to the file and save and quit the file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
d. hdfs-site.xml:
vi hdfs-site.xml
ii. Create two directories to be used by namenode and datanode.
mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenode
sudo mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenode
mkdir -p $HADOOP_HOME/yarn_data/hdfs/datanode
iii. Add the following entry to the file and save and quit the file:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/datanode</value>
</property>
</configuration>
e. Update $HOME/.bashrc
i. Go back to the root and edit the .bashrc file.
vi .bashrc
i. The first step to starting up your Hadoop installation is formatting the Hadoop filesystem
which is implemented on top of the local filesystem of your cluster. You need to do this the first
time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose
all the data currently in the cluster (in HDFS). To format the filesystem (which simply
initializes the directory specified by the dfs.name.dir variable), run the
Name node:
stop-dfs.sh
stop-yarn.sh
Hadoop Web Interfaces:
Hadoop comes with several web interfaces which are by default available at these locations:
By this we are done in setting up a single node hadoop cluster v2.2.0, hope this step by step
guide helps you to setup same environment at your end.
Please leave a comment/suggestion in the comment section,will try to answer asap and dont
forget to subscribe for the newsletter and a facebook like
Setting up Hive
Posted on October 28, 2013 by aravindu012
No Comments Leave a comment
As I said earlier, Apache Hive is an open-source data warehouse infrastructure built on top of
Hadoop for providing data summary, query, and analyzing large datasets stored in Hadoop files,
it is developed by Facebook and it provides
Access to files stored either directly in Apache HDFSTM or in other data storage systems
such as Apache HBase
Query execution via MapReduce
In this post we will get to know about, how to setup Hive on top of Hadoop cluster
Objective
The objective of this tutorial is for setting up Hive and running HiveQL scripts.
Prerequisites
You should have the latest stable build of Hadoop up and running, to install hadoop, please check
my previous blog article on Hadoop Setup.
Setting up Hive:
Procedure
1. Download a stable version of the hive file from apache download mirrors, For this tutorial we
are using Hive-0.12.0,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.X
wget http://apache.osuosl.org/hive/hive-0.12.0/hive-0.12.0.tar.gz
2. Unpack the compressed hive in home directory:
3. Create a hive directory under usr/local directory as root user and change the ownership to
hduser as shown, this is for our convenience to differentiate each framework,software and
application with different users.
cd /usr/local
mkdir hive
sudo chown -R hduser:hadoop /usr/local/hive
4. Login as hduser and move the uncompressed hive-0.12.0 to /usr/local/hive folder
mv hive-0.12.0/ /usr/local/hive
$ .bashrc
export HIVE_HOME='/usr/local/hive/hive-0.12.0'
export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:PATH
. .bashrc
hive
9. table in hive by the following command. Also after creating check if the table exists.
By this output we know that hive was setup correctly on top of Hadoop cluster, its time to learn
the HiveQL.
Access to files stored either directly in Apache HDFSTM or in other data storage systems
such as Apache HBase
It supports queries expressed in a language called HiveQL, which automatically translates SQL-
like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom
MapReduce scripts to be plugged into queries. Hive also enables data
serialization/deserialization and increases flexibility in schema design by including a system
catalog called Hive-Metastore.
According to the Apache Hive wiki, Hive is not designed for OLTP workloads and does not
offer real-time queries or row-level updates. It is best used for batch jobs over large sets of
append-only data (like web logs).
Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary
key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a
columnar database way.)
There is a SqlDevloper/Toad kind of tool named HiveDeveloper, developed by Stratapps Inc
which gives users the power to visualize their data stored in Hadoop as table views and do many
more operations.
In my next blog I will be explaining about how to setup Hive on top of Hadoop cluster, before
that please check how to setup Hadoop in my previous blog post so that you will be ready to
configure Hive on top of it.
Setting up Pig
Posted on October 28, 2013 by aravindu012
No Comments Leave a comment
Apache Pig is a high-level procedural language platform developed to simplify querying large
data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in
hadoop using Pig Latin language, this layer that enables SQL-like queries to be performed on
distributed datasets within Hadoop applications, due to its simple interface, support for doing
complex operations such as joins and filters, which has the following key properties:
Ease of programming. Pig programs are easy to write and which accomplish huge tasks
as its done with other Map-Reducing programs.
Optimization: System optimize pig jobs execution automatically, allowing the user to
focus on semantics rather than efficiency.
Extensibility: Pig Users can write their own user defined functions (UDF) to do special-
purpose processing as per the requirement using Java/Phyton and JavaScript.
Objective
The objective of this tutorial is for setting up Pig and running Pig scripts.
Prerequisites
The following are the prerequisites for setting up Pig and running Pig scripts.
You should have the latest stable build of Hadoop up and running, to install hadoop,
please check my previous blog article on Hadoop Setup.
Setting up Pig
Procedure
1. Download a stable version of Pig file from apache download mirrors, For this tutorial we
are using pig-0.11.1,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.X
wget http://apache.mirrors.hoobly.com/pig/pig-0.11.1/pig-0.11.1.tar.gz
cp -r pig-0.11.1.tar.gz /usr/local/pig
cd /usr/local/pig
export PIG_HOME=<path_to_pig_home_directory>
e.g.
export PIG_HOME='/usr/local/pig/pig-0.11.1'
export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$JAVA_HOME/bin:$PATH
6. Set the environment variable JAVA_HOME to point to the Java installation directory, which
Pig uses internally.
export JAVA_HOME=<<Java_installation_directory>>
Execution Modes
Pig has two modes of execution local mode and MapReduce mode.
Local Mode
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets
which a single machine could handle. It runs on a single JVM and access the local filesystem.
$ pig -x local
grunt>
MapReduce Mode
This is the default mode Pig translates the queries into MapReduce jobs, which requires access to
a Hadoop cluster.
$ pig
grunt>
You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt
is an interactive shell for your Pig queries. You can run Pig programs in three ways via Script,
Grunt, or embedding the script into Java code. Running in Interactive shell is shown in the
Problem section. To run a batch of pig scripts, it is recommended to place them in a single file
with .pig extension and execute them in batch mode, will explain them in depth in coming posts.
Apache Pig is a high-level procedural language platform developed to simplify querying large
data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in
hadoop using Pig Latin language, this layer that enables SQL-like queries to be performed on
distributed datasets within Hadoop applications, due to its simple interface, support for doing
complex operations such as joins and filters, which has the following key properties:
Ease of programming. Pig programs are easy to write and which accomplish huge tasks
as its done with other Map-Reducing programs.
Optimization: System optimize pig jobs execution automatically, allowing the user to
focus on semantics rather than efficiency.
Extensibility: Pig Users can write their own user defined functions (UDF) to do special-
purpose processing as per the requirement using Java/Phyton and JavaScript.
How it works:
Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System
(HDFS). The language for the platform is called Pig Latin, which abstracts from the Java
MapReduce idiom into a form similar to SQL. Pig Latin is a flow language which allows you to
write a data flow that describes how your data will be transformed. Since Pig Latin scripts can
be graphs it is possible to build complex data flows involving multiple inputs, transforms, and
outputs. Users can extend Pig Latin by writing their own User Defined functions, using Java,
Python, Ruby, or other scripting languages.
Local Mode
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets
which a single machine could handle. It runs on a single JVM and access the local filesystem.
MapReduce Mode
This is the default mode Pig translates the queries into MapReduce jobs, which requires access to
a Hadoop cluster.
we will discuss more about pig, setting up pig with hadoop, running PigLatin scripts in Local and
MapReduce Mode in my next posts.
This document helps to configure Hadoop cluster with help of Cloudera vm in pseudo mode,
using Vmware player on a user machine for there practice.
Step 1: Download the VMware player from the link shown and install it as shown in the
images.
https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0|
PLAYER-600-A|product_downloads
Step 2: Download the Cloudera Setup File from the given url and extract that zipped file onto
your hard drive.
Url to download Cloudera VM :
http://www.cloudera.com/content/support/en/downloads/download-components/download-
products.html?productID=F6mO278Rvo&version=1
There are two reasons why you are getting this error:
1. Your CPU doesnt support it.
2. Your CPU does support it, but you have it disabled in the BIOS
1. Reboot the computer and open the systems BIOS menu. This can usually be done by pressing
the delete key, the F1 key or Alt and F4 keys depending on the system.
2. Enabling the Virtualization extensions in the BIOS
Many of the steps below may vary depending on your motherboard, processor type, chipset and
OEM. Refer to your systems accompanying documentation for the correct information on
configuring your system.
a. Open the Processor submenu The processor settings menu may be hidden in the Chipset,
Advanced CPU Configuration or Northbridge.
b. Enable Intel Virtualization Technology (also known as Intel VT-x). AMD-V extensions cannot
be disabled in the BIOS and should already be enabled. The Virtualization extensions may be
labeled Virtualization Extensions, Vanderpool or various other names depending on the OEM
and system BIOS.
c. Enable Intel VT-d or AMD IOMMU, if the options are available. Intel VT-d and AMD
IOMMU are used for PCI device assignment.
Login credentials:
a. Username admin
b. Password admin
Click on the black box shown below in the image to start a terminal.
Step 4: Checking your Hadoop Cluster
Type: sudo jps to see if all nodes are running (if you see an error, wait for some time and then
try again, your threads are not started yet)
Type: sudo su hdfs
Execute your command ie hadoop fs ls /
Step 5: Download the list of Hadoop commands for reference from the given URL.
The following document describes the required steps for setting up a distributed multi-node
Apache Hadoop cluster on two Ubuntu machines, the best way to install and setup a multi node
cluster is to start installing two individual single node Hadoop clusters by following my previous
tutorial setting up hadoop single node cluster on Ubuntu and merge them together with
minimal configuration changes in which one Ubuntu box will become the designated master and
the other boxs will become a slave, we can add n number of slaves as per our future request.
Please follow my previous blog post for setting up hadoop single node cluster on Ubuntu
1. Prerequisites
i.Networking
Networking plays an important role here, before merging both single node servers into a multi
node cluster we need to make sure that both the node pings each other( they need to be connected
on the same network / hub or both the machines can speak to each other). Once we are done with
this process, we will be moving to the next step in selecting the master node and slave node, here
we are selecting 172.16.17.68 as the master machine(Hadoopmaster) and 172.16.17.61 as a slave
(hadoopnode) . Then we need to add them in /etc/hosts file on each machine as follows.
sudo vi /etc/hosts
172.16.17.68 Haadoopmaster
172.16.17.61 hadoopnode
Note: The addition of more slaves should be updated here in each machine using unique names
for slaves (e.g.: 172.16.17.xx hadoonode01, 172.16.17.xy slave02 so on..).
If you can see the below output when you run the given command on both master and slave, then
we configured it correctly.
ssh Hadoopmaster
ssh hadoopnode
2. Configurations:
The following are the required files we will use for the perfect configuration of the multi node
Hadoop cluster.
a. masters
b. slaves
c. core-site.xml
d. mapred-site.xml
e. hdfs-site.xml
a. masters:
vi masters
Hadoopmaster
b. slaves:
Lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers)
will be running as shown:
Hadoomaster
hadoopnode
If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.
We need to use the same configurations on all the nodes of hadoop cluster, i.e. we need to edit all
*-site.xml files on each and every server accordingly.
c. core-site.xml:
We are changing the host name from localhost to Hadoopmaster, which specifies the
NameNode (the HDFS master) host and port.
vi core-site.xml
d. hdfs-site.xml:
We are changing the replication factor to 2, The default value of dfs.replication is 3. However,
we have only two nodes available, so we set dfs.replication to 2.
vi hdfs-site.xml
e. mapred-site.xml:
We are changing the host name from localhost to Hadoopmaster, which specifies the
JobTracker (MapReduce master) host and port
vi mapred-site.xml
The first step to starting up your multinode Hadoop cluster is formatting the Hadoop filesystem
which is implemented on top of the local filesystem of your cluster. To format the filesystem
(which simply initializes the directory specified by the dfs.name.dir variable), run the given
command.
We begin by starting the HDFS daemons first, the NameNode daemon is started on
Hadoopmaster and DataNode daemons are started on all nodes(slaves).
Then we will start the MapReduce daemons, the JobTracker is started on Hadoomaster and
TaskTracker daemons are started on all nodes (slaves).
start-dfs.sh
start-mapred.sh
This will bring up the MapReduce cluster with the JobTracker running on the machine you ran
the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.
By running jps command, we will see list of java processes including JobTracker and
TaskTracker running on master and slaves:.
stop-mapred.sh
d. To stop HDFS daemons:
stop-dfs.sh
By this we are done in setting up a multi-node hadoop cluster, hope this step by step guide helps
you to setup same environment at your place.
Please leave a comment in the comment section with your doubts, questions and suggestions,
will try to answer asap
Flexible Hadoop is schema-less, and can absorb any type of data, structured or not,
from any number of sources. Data from multiple sources can be joined and aggregated in
arbitrary ways enabling deeper analyses than any one system can provide.
Reliable When you lose a node, the system redirects work to another location of the
data and continues processing without missing a beat.(Credit:Cloudera Blog)
It is also worth examining the applications for which using HDFS does not work so well. While
this may change in the future, these are areas where HDFS is not a good fit today:
References:
http://blog.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
http://en.wikipedia.org/wiki/Apache_Hadoop
http://hadoop.apache.org/
Parallel Processing:
Data is residing on N number of servers and holds the power of N servers and can be
processed parallel for analysis, which helps user to reduce the wait time to generate the
final report or analyzed data.
Fault Tolerance:
One of the primary reasons to use some of the BigData frameworks(ex: Hadoop) to run
your jobs is due to its high degree of fault tolerance. Even when running jobs on a large
cluster where individual nodes or network components may experience high rates of
failure,BigData frameworks can guide jobs toward a successful completion as the data is
replicated into multiple nodes/slaves.