You are on page 1of 130

Hadoop Overview

What is Hadoop?
• Open source software platform for scalable, distributed co
mputing providing fast and reliable analysis of both struct
ured and unstructured data.
• Can be thought of as a very big data warehouse, which is
able to store and process data, with the ability to do analy
tics and reporting.

• Not just a single Open Source Project - Is a whole ecosy


stem of projects that work together to provide a common
set of services.
• Transforms commodity hardware and allows
– storage of large amounts (petabytes) of data reliably.
– huge distribited computations.
• An example of the largest Hadoop clusters have
– 25 Petabytes of data
– 4500 machines
• Easy to program distributed applications
• A cluster is made up of many machines with Hadoop
• Reliability
– Data is typically held on multiple data nodes
– Task that fail are automatically reassigned.
• Scalability
– Same program can run on up to even 1000 machines
– Scales linearly
Interesting facts about Hadoop....
• It is used by a number of popular organisations that handl
e big data; including:
– Yahoo (Yahoo’s Search runs on 10000 core linux cluster)
– Facebook (Their Hadoop cluster hosts 100+ PB of data by July
2012 & growing at 1/2 PB/day
– Amazon
– Netflix
Main applications of Hadoop
• Advertisement (Mining user behaviour to generate recom
mendations).
• Searches (Group related documents)
• Security (Search for uncommon patterns)
• Streaming data (Velocity)
• Batch processing.
Terminology
• Job - A full program / An execution of a Mapper and Redu
cer across a dataset.
• Task - An execution of a Mapper or a Reducer on a slice o
f data.
According to mordern trends...
• Queries may face the following complications:
– The data has to be searched over the internet (Volume).
– The data may come in varying formats from social networking s
ites and other unique scenarios (Veracity).

– An example query that a company may need to answer is:


What is the biggest country of origion of likes that a comapny cou
ld be gettting on its Facebook page? (Find number of likes by cou
ntry).
Setting up
Storage data
Hadoop storage

• Storage master only contains the metadata - Keeping a re


cord of where the data is located in the cluster.
• Also keeps track of which task is being processed where.
Hadoop processing

• The Java program that is supposed to be executed is going to


be copied to every machine.
• The data that is being executed is local to the processing mach
ine.
• In Hadoop 1 it has two components first one is
– HDFS (Hadoop Distributed File System) and
– Map Reduce.
• Whereas in Hadoop 2 it has also two component
– HDFS and
– YARN
Storage and processsing architecture

Map reduce was responsible for Resource Management as well as data processing.
Mapreduce Engine
• Job tracker splits up data into smaller tasks(“Map”) and se
nds it to the Tasktracker process on each node.
• Tasktracker reports back to the Job tracker node and repo
rts on progress, send data (“Reduce”) or requests new job
s.
• In Map Reduce, when Map-reduce stops working then aut
omatically all his slave node will stop working this is the o
ne scenario where job execution can interrupt and it is call
ed a single point of failure.
• YARN overcomes this issue because of its architecture,
• YARN has the concept of Active name node as well as s
tandby name node.
• When active node stop working for some time passive no
de starts working as active node and continue the executi
on.
YARN

• MapReduce replaced with YARN(Mapreduce2).


• YARN is not going to do the processing but allocation of res
ources to a chosen node in the cluster recognised as App M
aster for that specific job.
• In Mapreduce1, scalability was a bottleneck when cluster size
grows beyond 4000, because of the amount of load on the jo
b tracker.
• YARN has the concept of resource manager as well as node
manager which will take of resource management
• Main idea is to split job tracker responsibiliities:
– Resource manager - Job scheduling
– Application master - Task monitoring
• Advantages:
– Increased scalability
– Ability to run more than 1 YARN on the same cluster
– Better memory utilisation through the use of containers
Processing in YARN

• Responsible for making resources available


• The MapReduce is divided into two important tasks, Map
and Reduce.
• The Mapper takes a set of data and converts it into anoth
er set of data, in such a way that individual elements are s
tored as key/value pairs.
• In reduce task, the output from a map is taken as input an
d and the key/value pair are combined into a smaller set o
f key/value pair.
• Apache Hadoop is an open source software framework used to develop data processing application
s which are executed in a distributed computing environment.
•  Applications built using HADOOP are run on large data sets distributed across clusters of commodit
y computers.
• Commodity computers are cheap and widely available. These are mainly useful for achieving greate
r computational power at low cost.
• Similar to data residing in a local file system of a personal computer system, in Hadoop, data reside
s in a distributed file system which is called Hadoop Distributed File system.
• The processing model is based on 'Data Locality' concept wherein computational logic is sent to cl
uster nodes(server) containing data.
• This computational logic is nothing, but a compiled version of a program written in a high-level lan
guage such as Java.
• Such a program, processes data stored in Hadoop HDFS.
• A computer cluster consists of a set of multiple processing units (storage disk + processor) which
are connected to each other and acts as a single system.
Hadoop EcoSystem and Components

• Below diagram shows various components in the Hadoop eco


system-
Apache Hadoop consists of two sub-projects –
• Hadoop MapReduce:
-MapReduce is a computational model and software framework for writing applica
tions which are run on Hadoop.
-These MapReduce programs are capable of processing enormous data in parall
el on large clusters of computation nodes.
• HDFS (Hadoop Distributed File System):
-HDFS takes care of the storage part of Hadoop applications.
-MapReduce applications consume data from HDFS.
-HDFS creates multiple replicas of data blocks and distributes them on compute n
odes in a cluster.
-This distribution enables reliable and extremely rapid computations.
Hadoop Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed data proces
sing using MapReduce and HDFS methods.
• NameNode:
• Represents all files and directories which are used in the namespace
• DataNode:
• DataNode helps you to manage the state of an HDFS node and allows you to interact with the
blocks
• MasterNode:
• The master node allows you to conduct parallel processing of data using Hadoop MapReduce.
• Slave node:
• The slave nodes are the additional machines in the Hadoop cluster which allows you to store
data to conduct complex calculations.
• Moreover, all the slave node comes with Task Tracker and a DataNode.
• This allows you to synchronize the processes with the NameNode and Job Tracker respectivel
y.
Features Of 'Hadoop'
Suitable for Big Data Analysis
• As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best s
uited for analysis of Big Data.
• Since it is processing logic (not the actual data) that flows to the computing nodes, less net
work bandwidth is consumed.
• This concept is called 'data locality' concept which helps increase the efficiency of Hadoop
based applications.
Scalability
• HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes an
d thus allows for the growth of Big Data.
• Also, scaling does not require modifications to application logic.
Fault Tolerance
• HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes.
• That way, in the event of a cluster node failure, data processing can still proceed by using d
ata stored on another cluster node.
Network Topology In Hadoop

• Topology (Arrangment) of the netwo


rk affects the performance of the H
adoop cluster when the size of the
Hadoop cluster grows.
• In addition to the performance, one
also needs to care about the high a
vailability and handling of failures.
• In order to achieve this cluster form
ation, Hadoop makes use of networ
k topology.
Typical enterprise scenario
• In Hadoop, a network is represented as a tree and distance between nodes of this
tree (number of hops) is considered as an important factor in the formation of Hado
op cluster.
• Here, the distance between two nodes is equal to sum of their distance to their clos
est common ancestor.
• Hadoop cluster consists of a data center, the rack and the node which actually exe
cutes jobs.
• Here, data center consists of racks and rack consists of nodes.
• Network bandwidth available to processes varies depending upon the location of th
e processes.
• That is, the bandwidth available becomes lesser as we go away from-
Processes on the same node
Different nodes on the same rack
Nodes on different racks of the same data center
Nodes in different data centers
What is MapReduce?
• MapReduce is a programming model suitable for processing of huge d
ata.
• Hadoop is capable of running MapReduce programs written in various
languages: Java, Ruby, Python, and C++.
• MapReduce programs are parallel in nature, thus are very useful for p
erforming large-scale data analysis using multiple machines in the clu
ster.
• MapReduce programs work in two phases:
1.Map phase
2.Reduce phase.
• An input to each phase is key-value pairs. In addition, every program
mer needs to specify two functions: map function and reduce function.
How MapReduce Works? Complete Process

• The whole process goes through four phases of execution


namely, splitting, mapping, shuffling, and reducing.
• As an example, consider you have following input data for
your Map Reduce Program:
– Welcome to Hadoop Class
– Hadoop is good
– Hadoop is bad
MapReduce Architecture
The final output of the MapReduce task is

bad 1

Class 1

good 1

Hadoop 3

is 2

to 1

Welcome 1
The data goes through the following phases
Input Splits:
• An input to a MapReduce job is divided into fixed-size pieces called input splits 
• Input split is a chunk of the input that is consumed by a single map
Mapping
• This is the very first phase in the execution of map-reduce program. In this phase data in each split is passed to a
mapping function to produce output values. In our example, a job of mapping phase is to count a number of occurr
ences of each word from input splits (more details about input-split is given below) and prepare a list in the form of
<word, frequency>
Shuffling
• This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping ph
ase output. In our example, the same words are clubed together along with their respective frequency.
Reducing
• In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling p
hase and returns a single output value. In short, this phase summarizes the complete dataset.
• In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of each w
ord.
MapReduce Architecture explained in detail
• One map task is created for each split which then executes map function for each record in the split.
• It is always beneficial to have multiple splits because the time taken to process a split is small as compared to the time
taken for processing of the whole input. When the splits are smaller, the processing is better to load balanced since we
are processing the splits in parallel.
• However, it is also not desirable to have splits too small in size. When splits are too small, the overload of managing th
e splits and map task creation begins to dominate the total job execution time.
• For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64 MB, by default).
• Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS.
• Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS store operation.
• Map output is intermediate output which is processed by reduce tasks to produce the final output.
• Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication becomes overkill.
• In the event of node failure, before the map output is consumed by the reduce task, Hadoop reruns the map task on an
other node and re-creates the map output.
• Reduce task doesn't work on the concept of data locality. An output of every map task is fed to the reduce task. Map ou
tput is transferred to the machine where reduce task is running.
• On this machine, the output is merged and then passed to the user-defined reduce function.
• Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the local node and other replicas a
re stored on off-rack nodes). So, writing the reduce output
How MapReduce Organizes Work?
• Hadoop divides the job into tasks. There are two types of tasks:
1.Map tasks (Splits & Mapping)
2.Reduce tasks (Shuffling, Reducing)

• The complete execution process (execution of Map and Reduce tasks, both) is contr
olled by two types of entities called a
3.Jobtracker:
Acts like a master (responsible for complete execution of submitted job)
4.Multiple Task Trackers:
Acts like slaves, each of them performing the job

• For every job submitted for execution in the system, there is one Jobtracker that res
ides on Namenode and there are multiple tasktrackers which reside on Datanode.
l job is divided into multiple tasks
which are then run onto multiple
data nodes in a cluster.
• It is the responsibility of job tracker to coordinate the activity
by scheduling tasks to run on different data nodes.
• Execution of individual task is then to look after by task track
er, which resides on every data node executing part of the jo
b.
• Task tracker's responsibility is to send the progress report to t
he job tracker.
• In addition, task tracker periodically sends 'heartbeat' signal t
o the Jobtracker so as to notify him of the current state of the
system. 
• Thus job tracker keeps track of the overall progress of each j
ob. In the event of task failure, the job tracker can reschedule
it on a different task tracker.
What is HDFS?
What is HDFS?
• HDFS is a distributed file system for storing very large data files, runni
ng on clusters of commodity hardware.
• It is fault tolerant, scalable, and extremely simple to expand.
• Hadoop comes bundled with HDFS (Hadoop Distributed File Systems)
.
• When data exceeds the capacity of storage on a single physical machi
ne, it becomes essential to divide it across a number of separate mach
ines.
• A file system that manages storage specific operations across a netwo
rk of machines is called a distributed file system.
• HDFS is one such software.
HDFS Architecture
• HDFS cluster primarily consists of a NameNode that manages the file system Metadata and
a DataNodes that stores the actual data.
NameNode: 
• NameNode can be considered as a master of the system.
• It maintains the file system tree and the metadata for all the files and directories present in t
he system.
• Two files 'Namespace image' and the 'edit log' are used to store metadata information.
• Namenode has knowledge of all the datanodes containing data blocks for a given file, howe
ver, it does not store block locations persistently.
• This information is reconstructed every time from datanodes when the system starts.
DataNode: 
• DataNodes are slaves which reside on each machine in a cluster and provide the actual stor
age.
• It is responsible for serving, read and write requests for the clients.
• Read/write operations in HDFS operate at a block level.
• Data files in HDFS are broken into block-sized chunks, which are store
d as independent units.
• Default block-size is 64 MB.
• HDFS operates on a concept of data replication wherein multiple replic
as of data blocks are created and are distributed on nodes throughout
a cluster to enable high availability of data in the event of node failure.
• A file in HDFS, which is smaller than a single block, does not occupy a
block's full storage. 
Read Operation In HDFS
• Data read request is served by HDFS, NameNode, and Data
Node. Let's call the reader as a 'client'.
• Below diagram depicts file read operation in Hadoop.
1. A client initiates read request by calling 'open()' method of FileSystem object; it is an object of type Distrib
utedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the locations of the
blocks of the file. Please note that these addresses are of first few blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that block is returned
back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the clien
t.  FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and Na
meNode.
In step 4 shown in the above diagram, a client invokes 'read()' method which causes DFSInputStream to esta
blish a connection with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This process of rea
d() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on to locate the n
ext DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
7. Write Operation In HDFS
• A client initiates write operation by calling 'create()' method of DistributedFileSystem object which creates a new file - Step no. 1 in the above di
agram.
1. DistributedFileSystem object connects to the NameNode using RPC call and initiates new file creation. However, this file creates operation doe
s not associate any blocks with the file. It is the responsibility of NameNode to verify that the file (which is being created) does not exist already
and a client has correct permissions to create a new file. If a file already exists or client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds and a new record for the file is created by the NameNode.
2. Once a new record in NameNode is created, an object of type FSDataOutputStream is returned to the client. A client uses it to write data into t
he HDFS. Data write method is invoked (step 3 in the diagram).
3. FSDataOutputStream contains DFSOutputStream object which looks after communication with DataNodes and NameNode. While the client co
ntinues writing data, DFSOutputStream continues creating packets with this data. These packets are enqueued into a queue which is called as
DataQueue.
4. There is one more component called DataStreamer which consumes this DataQueue. DataStreamer also asks NameNode for allocation of ne
w blocks thereby picking desirable DataNodes to be used for replication.
5. Now, the process of replication starts by creating a pipeline using DataNodes. In our case, we have chosen a replication level of 3 and hence t
here are 3 DataNodes in the pipeline.
6. The DataStreamer pours packets into the first DataNode in the pipeline.
7. Every DataNode in a pipeline stores packet received by it and forwards the same to the second DataNode in a pipeline.
8. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which are waiting for acknowledgment from DataNodes.
9. Once acknowledgment for a packet in the queue is received from all DataNodes in the pipeline, it is removed from the 'Ack Queue'. In the even
t of any DataNode failure, packets from this queue are used to reinitiate the operation.
10. After a client is done with the writing data, it calls a close() method (Step 9 in the diagram) Call to close(), results into flushing remaining data p
ackets to the pipeline followed by waiting for acknowledgment.
11. Once a final acknowledgment is received, NameNode is contacted to tell it that the file write operation is complete.
Access HDFS using JAVA API

• In order to interact with Hadoop's filesystem programmatically


, Hadoop provides multiple JAVA classes.
• Package named org.apache.hadoop.fs contains classes usef
ul in manipulation of a file in Hadoop's filesystem.
• These operations include, open, read, write, and close.
• Actually, file API for Hadoop is generic and can be extended t
o interact with other filesystems other than HDFS.
Reading a file from HDFS, programmatically
• Object java.net.URL is used for reading contents of a file.
• To begin with, we need to make Java recognize Hadoop's
hdfs URL scheme.
• This is done by calling setURLStreamHandlerFactory met
hod on URL object and an instance of FsUrlStreamHandl
erFactory is passed to it.
• This method needs to be executed only once per JVM, he
nce it is enclosed in a static block.
An example code is-
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
• This code opens and reads contents of a file. Path of this file on HDFS is passe
d to the program as a command line argument.
Access HDFS Using COMMAND-LINE INTERFACE

• This is one of the simplest ways to interact with HDFS.


• Command-line interface has support for filesystem operati
ons like read the file, create directories, moving files, delet
ing data, and listing directories.
• We can run '$HADOOP_HOME/bin/hdfs dfs -help' to ge
t detailed help on every command.
• Here, 'dfs' is a shell command of HDFS which supports m
ultiple subcommands.
• Some of the widely used commands are listed below alon
g with some details of each one.
1. Copy a file from the local filesystem to HDFS
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /
This command copies file temp.txt from the local filesystem to HDFS.

2. We can list files present in a directory using -ls


$HADOOP_HOME/bin/hdfs dfs -ls /
• We can see a file 'temp.txt' (copied earlier) being listed under ' / ' directory.

3. Command to copy a file to the local filesystem from HDFS


$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt
• We can see temp.txt copied to a local filesystem.

4. Command to create a new directory


$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory

5. Check whether a directory is created or not.


• Get directory listing of user home directory in hdfs
- hadoop fs -ls
• Get directory listing of root directory
- hadoop fs -ls /
• Copy a local file to hdfs (user home directory)
- hadoop fs -copyFromLocal foo.txt foo.txt
• Move a file from hdfs to local disk
- hadoop fs -copyToLocal /user/geert/foo.txt foo.txt
• Display the contents of a file
- hadoop fs -cat /data/as400/customers.txt
• Make a new directory (user home directory)
- hadoop fs -mkdir output
YARN
• Resource Manager (RM):It is the master daemon of Yarn. RM manages the global
assignments of resources (CPU and memory) among all the applications. It arbitrat
es system resources between competing applications.
• Scheduler: The scheduler is responsible for allocating the resources to the running
application.
• Application Manager: It manages running Application Masters in the cluster, i.e., it
is responsible for starting application masters and for monitoring and restarting th
em on different nodes in case of failures
• Node Manager (NM): It is the slave daemon of Yarn. NM is responsible for contain
ers monitoring their resource usage and reporting the same to the ResourceManag
er.
• Application Master (AM): One application master runs per application. It negotiate
s resources from the resource manager and works with the node manager. It Mana
ges the application life cycle.
• Container: Containers execute tasks as specified by the Application Master.
Hadoop is not....
• Hadoop is not a database.
• Hadoop is an open source framework for storing and processing big dat
a in a distributed fashion on large clusters of commodity hardware.
• It does have a storage component called HDFS (Hadoop Distributed File
System) which stores files used for processing but HDFS does not qualif
y as a relational database, it is just a storage model.
• There are components like Hive which can work on top of HDFS and whi
ch allow users to query the HDFS storage using SQL like queries using
HiveQL but that is just SQL like queries and does not make HDFS or Ap
cahe Hadoop a database or relational database.
• Apache Hadoop uses key-value pair as its fundamental unit for data stor
age.
Some of Hadoop’s Projects
• PIG
– A high level language that translates map-reduce jobs written using a high-level description into
Map-Reduce.
• HIVE
– Takes a computation expressed in SQL-like language and translates into Mapreduce and execut
es the instruction. Used extensively at Facebook
• HBASE
– Provides a simple interface to your distributed data.Can be accessed by HIVE, PIG and Mapred
uce.
– Used for appplications such as Facebook messages.
• ZOOKEEPER
– Provides coordination services for different servers.
• HCatalog
– Is a metadata store that stores information about the tables in HIVE that HIVE can process. Can
be used by PIG and Mapreduce
Installing Hadoop on Windows
Step 1: Installation Software Download
Download

1. Hadoop 2.8.0 (Link:


http://www-eu.apache.org/dist/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz

OR
http://archive.apache.org/dist/hadoop/core//hadoop-2.8.0/hadoop-2.8.0.tar.gz

2. Java JDK 1.8.0.zip (Link: http://www.oracle.com/technetwork/java/javase/downloads/jdk


8-downloads-2133151.html)

• These instructions are also available online on :


https://github.com/MuhammadBilalYar/Hadoop-On-Window/wiki/Step-by-step-Hadoop-2.8.0-install
ation-on-Window-10
Hadoop installation
• Extract file Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and p
lace under "C:\Hadoop-2.8.0".

• NB: There might be need to extract the file as an administ


rator, without which might result in an error during extracti
on.
Step 2: Installing JAVA to root drive

• Install JAVA under C:\Java, if it is not installed in that spec


ific location.
• Check for the installation, when it completes, by using:
javac -version
Inside the command prompt.
Step 3: Setting HADOOP_HOME

• Set the path HADO


OP_HOME Environ
ment variable on wi
ndows 10(see Step
1,2,3 and 4 below).
Step 4: Setting up JAVA_HOME

• Set the path


JAVA_HOME
Environment
variable on wi
ndows 10(se
e Step 1,2,3
and 4 below).
Step 5: Hadoop and Java directory paths

• Next we set the Ha


doop bin directory p
ath and JAVA bin dir
ectory path.
Step 6: Configuration: core-site.xml

• Edit file C:/Hadoop-2.8.0/etc/hadoop/core-site.xml, paste be


low xml paragraph and save this file.

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Step 7: Rename "mapred-site.xml.template" to "mapred-site.xml" and edit

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 8: Configuration - hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
</configuration>
Step 9: Configuration - yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Step 10: Edit file C:/Hadoop-2.8.0/etc/hadoop/hadoop-env.cmd

• Close the command line "JAVA_HOME=%JAVA_HOME%" instead of set "JA


VA_HOME=C:\Java" (On C:\java this is path to file jdk.18.0)
Step 11: Create folder "data" under "C:\Hadoop-2.8.0"

• Create folder "datanode" under "C:\Hadoop-2.8.0\data"


• Create folder "namenode" under "C:\Hadoop-2.8.0\dat
a"
Step 12: Replace bin folder hadoop-2.8.0

1. Dowload file Hadoop Configuration.zip (Link:


https://github.com/MuhammadBilalYar/HADOOP-INSTALLATION-ON-WINDOW- 10/blob/master/
Hadoop%20Configuration.zip)

2. Delete file bin on C:\Hadoop-2.8.0\bin, replaced by file bin on file


just download (from Hadoop Configuration.zip).
Step 13: Testing
• Open cmd (In privileged mode) and type comman • Open cmd and change directory to
d "C:\Hadoop-2.8.0\sbin" and
"hdfs namenode –format" . type "start-all.cmd" to start apache.
You will see :
Open in browser: http://localhost:8088
Open http://localhost:50070
Our First Hadoop Program: Basic Wordcount
1. Download files

• Download MapReduceClient.jar
Link:
https://github.com/MuhammadBilalYar/HADOOP-INSTALLATION-ON-WINDOW-10/
blob/master/MapReduceClient.jar)

• Download input_file.txt
Link: 
https://github.com/MuhammadBilalYar/HADOOP-INSTALLATION-ON-WINDOW-10/
blob/master/input_file.txt)

Place both files in "C:/"


2. Start Hadoop

Open cmd in Admini


strative mode and
move to "C:/Hadoo
p-2.8.0/sbin" and st
art cluster:

start-all.cmd
3. Create an input directory in HDFS.
hadoop fs -mkdir /input_dir
4. Copy the input text file named input_file.txt in the input directory (input_dir)of
HDFS.
hadoop fs -put C:/input_file.txt /input_dir

5. Verify input_file.txt available in HDFS input directory (input_dir).


hadoop fs -ls /input_dir/
4. Verify content of the copied file.
hadoop dfs -cat /input_dir/input_file.txt
5. Run MapReduceClient.jar and also provide input and
out directories.

hadoop jar C:/MapReduceClient.jar wordcount /input_dir /output_dir


6. Verify content for generated output file.

hadoop dfs -cat /output_dir/


*
7. Some Other useful commands
• To leave Safe mode:
hadoop dfsadmin –safemode leave

• To Delete file from HDFS direct


ory:
hadoop fs -rm -r /iutput_dir/input_file.txt

• To Delete directory from HDFS


directory:
hadoop fs -rm -r /iutput_dir
Installing Hadoop on Linux
Installing Hadoop on Linux

Step 1: Add a Hadoop system user using below command:


sudo addgroup hadoop_

Step 2: sudo adduser --ingroup hadoop_ hduser_


Step 3 Step 4
• Execute the command: • Re-login as hduser_:

sudo adduser hduser_ sudo su hduser_


Step 5 Step 6
• Configure SSH • Type the following command:
• In order to manage nodes ssh-keygen -t rsa -P ""
in a cluster, Hadoop requi
res SSH access
• First, switch user, enter th
e following command
su - hduser_
Step 7 Step 8
Enable SSH access to local • Now test SSH setup by c
onnecting to localhost as
machine using this key. 'hduser' user.

cat $HOME/.ssh/id_rsa.pub > ssh localhost


> $HOME/.ssh/authorized_ke
ys
• Note: To resolve this, purge SSH usi
If you see below error in respo ng:
nse to 'ssh localhost', then ther
e is a possibility that SSH is not
available on this system- 1. sudo apt-get purge openssh-server

2. Install SSH using the command-

sudo apt-get install openssh-server


Step 9
• navigate to the directory conta
ining the hadoop tar file

• Enter:
sudo tar xzf hadoop-2.x.0.tar.gz
(Depending on the version you have)
Step 10
• Now, rename hadoop-2.x.0 as
hadoop

sudo chown -R hduser_:hadoop_ hadoop


sudo mv hadoop-2.x.0 hadoop
Step 11: Configure Hadoop
• Step 1) Modify ~/.bashrc file:
vim ~/.bashrc

• Add following lines to end of file ~/.bashrc


#Set HADOOP_HOME
export HADOOP_HOME=<Installation Directory of H
adoop>
#Set JAVA_HOME
export JAVA_HOME=<Installation Directory of Java>
# Add bin/ directory of Hadoop to PATH
export PATH=$PATH:$HADOOP_HOME/bin
Step 12

• Now, source this environment config


uration using below command:

. ~/.bashrc
Step 13: Configurations related to HDFS
• Set JAVA_HOME inside file
$HADOOP_HOME/etc/had • With
oop/hadoop-env.sh

• Note: You may need to


login in as administrato
r and edit the file using
vim editor.
Step 14: core-site.xml

1. 'hadoop.tmp.dir' - Used to specify a directory which will • Copy below line in between tags <conf
be used by Hadoop to store its data files.
iguration></configuration>
2. 'fs.default.name' - This specifies the default file system
.
<property>
<name>hadoop.tmp.dir</name>
• To set these parameters, open core-site.xml
<value>/app/hadoop/tmp</value>
sudo gedit $HADOOP_HOME/etc/hadoop/core-site.xml <description>Parent directory for other temporary directories.</descr
iption>
</property>
<property>
<name>fs.defaultFS </name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. </description>
</property>
Step 15

• Navigate to the directory $H • Now, create the directory me


ADOOP_HOME/etc/Hadoop ntioned in core-site.xml

• sudo mkdir -p /app/hadoop/tmp

Grant permissions to the directory: sudo chown -R hduser_:hadoop_ /app/hadoop/tmp

sudo chmod 750 /app/hadoop/tmp


Step 16: Map Reduce Configuration

• Exit the Terminal and restart again


lets set HADOOP_HOME path
• Log in as superuser: su hduser_
• sudo gedit /etc/profile.d/hadoop.sh
• Type echo $HADOOP_HOME
• Add:
to verify the path
export HADOOP_HOME=/home/guru99/
Downloads/Hadoop

Now copy files


Next enter:
sudo cp $HADOOP_HOME/etc/hadoop/mapred-
sudo chmod +x /etc/profile.d/hadoop.sh site.xml.template $HADOOP_HOME/etc/hadoop/mapred-
site.xml
Step 17: mapred-site.xml

• Add below lines of setting in b


• Open the mapred-site.xml file
etween tags <configuration> a
sudo gedit $HADOOP_HOME/etc/h nd </configuration>
adoop/mapred-site.xml
<property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:54311</value>
<description>MapReduce job tracker runs at thi
s host and port.
</description>
</property>
Step 18: hdfs-site.xml

• Add below lines of setting between tags <co


• Open $HADOOP_HOME/et nfiguration> and </configuration>
c/hadoop/hdfs-site.xml as b
<property>
elow: <name>dfs.replication</name>
<value>1</value>
<description>Default block replication.</descri
• sudo gedit $HADOOP_HO ption>
</property>
ME/etc/hadoop/hdfs-site.xm
<property>
l <name>dfs.datanode.data.dir</name>
<value>/home/hduser_/hdfs</value>
</property>
Step 19

• Create a directory specified in above setting-

• sudo mkdir -p <Path of Directory used in above settin


g>
sudo mkdir -p /home/hduser_/hdfs

sudo chown -R hduser_:hadoop_ <Path of Directory created in above step>

sudo chown -R hduser_:hadoop_ /home/hduser_/hdfs

sudo chmod 750 <Path of Directory created in above step>


sudo chmod 750 /home/hduser_/hdfs
Before we start Hadoop for the first time, format HDFS using below command

• $HADOOP_HOME/bin/hdfs namenode -format • Start Hadoop single node cluster


• Switching user: su - hduser_
using below command:

$HADOOP_HOME/sbin/start-dfs.sh
Starting yarn

• $HADOOP_HOME/sbin/s – Using 'jps' tool/command, verify w


hether all the Hadoop related proc
tart-yarn.sh esses are running or not.

If Hadoop has started successfully then an output of jps should show NameNode, NodeManager,
ResourceManager, SecondaryNameNode, DataNode.
Stopping Hadoop
• $HADOOP_HOME/sbin/st
op-dfs.sh • $HADOOP_HOME/sbin/s
top-yarn.sh
Our first Hadoop program on Linux:
Salesmapper program
• Resources can be obtained at:
– https://www.guru99.com/create-your-first-hadoop-program.html
• Specific items can be downloaded at:
– https://drive.google.com/uc?export=download&id=0B_vqvT0ovz
Hcekp1WkVfUVNEdVE
• hduser_@mastara-Inspiron-5570 ~ $ cp /home/mastara/Downloads/Art.zip /home/hduser_
• hduser_@mastara-Inspiron-5570 ~ $ ls
Art.zip hdfs MapReduceTutorial 模板
• hduser_@mastara-Inspiron-5570 ~ $ unzip Art.zip -d MapReduceTutorial/
• hduser_@mastara-Inspiron-5570 ~ $ cd MapReduceTutorial/
• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial $ ls
Article 6
• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article 6 $ ls
desktop.ini SalesCountryReducer.java SalesMapper.java
SalesCountryDriver.java SalesJan2009.csv
• hduser_@mastara-Inspiron-5570 /home/mastara/hadoop $ export CLASSPATH="$HADOOP_H
OME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.8.0.jar:$HADOOP_HOME/
share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.8.0.jar:$HADOOP_HOME/s
hare/hadoop/common/hadoop-common-2.8.0.jar:~/MapReduceTutorial/SalesCountry/*:$HA
DOOP_HOME/lib/*"
• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article $ javac -d . SalesMapper.java SalesC
ountryReducer.java SalesCountryDriver.java

• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article $ ls
desktop.ini SalesCountryDriver.java SalesJan2009.csv
SalesCountry SalesCountryReducer.java SalesMapper.java

• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article/SalesCountry $ ls
SalesCountryDriver.class SalesCountryReducer.class SalesMapper.class

• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article $ jar cfm ProductSalePerCountry.jar


SalesCountry/Manifest.txt SalesCountry/*.class
Check that the jar file is created : Product Sale per country
• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article/SalesCountry $ ls
Manifest.txt SalesCountryDriver.class SalesMapper.class
ProductSalePerCountry.jar SalesCountryReducer.class

• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article $ hadoop fs -mkdir /input

• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article $ hdfs dfs -copyFromLocal Sale


sJan2009.csv /input

• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Article $ $HADOOP_HOME/bin/hdfs df


s -ls /input
Found 1 items
-rw-r--r-- 1 hduser_ supergroup 123637 2019-09-05 10:08 /input/SalesJan2009.csv
• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Arti
cle $ $HADOOP_HOME/bin/hadoop jar ProductSalePer
Country.jar /input /mapreduce_output_sales

• hduser_@mastara-Inspiron-5570 ~/MapReduceTutorial/Arti
cle $ $HADOOP_HOME/bin/hdfs dfs -cat /mapreduce_o
utput_sales/part-00000
To view from browser:
Italy 15
Japan 2
Output:
• hduser_@mastara-Inspiron-5570 Jersey 1
~/MapReduceTutorial/Article $ Kuwait 1
$HADOOP_HOME/bin/hdfs dfs -cat Latvia 1
/mapreduce_output_sales/part-00000
Luxembourg 1
• Argentina 1
Malaysia 1
• Australia38
• Austria 7 Malta 2
• Bahrain 1 Mauritius 1
• Belgium 8 Moldova 1
• Bermuda 1 Monaco2
• Brazil 5 Netherlands 22
• Bulgaria 1 New Zealand 6
• CO 1 Norway 16
• Canada 76
Philippines 2
• Cayman Isls 1
Poland 2
• China 1
• Costa Rica 1 Romania 1
• Country 1 Russia 1
• Czech Republic 3 South Africa 5
• Denmark 15 South Korea 1
• Dominican Republic 1 Spain 12
• Finland 2 Sweden13
• France 27 Switzerland 36
• Germany 25
Thailand 2
• Greece 1
The Bahamas 2
• Guatemala 1
• Hong Kong 1 Turkey 6
• Hungary 3 Ukraine 1
• Iceland 1 United Arab Emirates 6
• India 2 United Kingdom 100
• Ireland 49 United States 462
• Israel 1
• URL: http://localhost:50070/explorer.html#/
• Go to “Utilities”

Click on mapreduce_output_sales
Click on part-00000
Hadoop 2.7.3 Multinode Installation
Step 1 - Check if the nodes are reachable

• Try to ping the other nodes from each of them.


• Type ifconfig to get the ip address of a node.
Step 2
• Change the hostname of the system(if they all have the
same hostname)
» To confirm : Type hostname
– sudo vim /etc/hostname
– Delete what is written and type the new name as needed.
E.g master on master node.
Step 3
• Update the host file on all the 3 nodes.

sudo vim /etc/hosts

• 127.0.0.1 localhost
• #127.0.1.1 master #remove this line
• 192.168.220.185 master #Added this and below 2 lines
• 192.168.220.186 slave1
• 192.168.220.185 slave2
Step 4

• Reboot the machine.


Step 5

• Confirm the hostname of all the 3 nodes


– Type hostname
Ping each system by using the hostname
Step 6
• Test ssh connectivity
• From master: type
– ssh master
– ssh slave1 (If you create a folder it will be created in the
slave not in the master)
– ssh slave2
Step 7
• Update core-site.xml (All the 3 nodes)
– sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml

• 2 changes:
a)Remove hadoop.tmp.dir configuration. We dont require them
b)Change localhost to master

#Remove this
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories</description>
</property>

#Put this
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
Step 8
Update hdfs-site.xml(Master + all slave nodes)

3 changes:
a.Replication is set to 2
b.Namenode configured only in master
c.Datanode configured only in slave

sudo vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml


• <property>
<name>dfs.replication</name>
<value>2</value> <!--IN all nodes - Coz we have a 2 node cluster -->
</property>

<!--Keep this entry in master only, and delete from slaves -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>
</property>

<!--Remove this entry from master, and keep this entry in slaves -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>
</property>
Step 9
Update yarn-site.xml (All 3 nodes)

sudo vim /usr/local/hadoop/etc/hadoop/yarn-site.xml

<property>
<name>yarn.resourcemanager.resource.tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8050</value>
</property>

(Add these entries without deleting anything)


Step 10
Update mapred-site.xml (All 3 nodes)

• sudo vim /usr/local/hadoop/etc/hadoop/mapred-site.xml

• <property>
• <name>mapreduce.jobhistory.address</name>
• <value>master:10020</value> <!-- Pointing jobhistory to master -->
• <description> Host and port for Job history server (default 0.0.0.0:10020)</description>
• </property>

• Replace what is there


Step 11
Master node

• If you see any entry related to localhost, feel free to delete it.

sudo vim /usr/local/hadoop/etc/hadoop/slaves

localhost
slave1
slave1

• sudo vim /usr/local/hadoop/etc/hadoop/masters

– master
Step 12
• Recreate namenode folder(Master only)

• sudo rm -rf /usr/local/hadoop_tmp


• sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode
• sudo chown hdusr:hadoop -R /usr/local/hadoop_tmp/
• sudo chmod 777 /usr/local/hadoop_tmp/hdfs/nameno
de
Step 13
• Recreate datanode folder(All slave nodes only)

• sudo rm -rf /usr/local/hadoop_tmp


• sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode
• sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/
• sudo chmod 777 /usr/local/hadoop_tmp/hdfs/datanode
Step 14
• Format the namenode(Master only)

• Before starting the cluster we need to f


ormat the namenode using the folloein
g command in master only:

• hdfs namenode -format


Step 15
• Start the DFS & Yarn (Master only)

• start-dfs.sh
• start-yarn.sh

• OR

• start-dfs.sh && start-yarn.sh

• OR

• Start-all.sh
Step 16
• jps (On master node)
• 3379 Namenode #because of start-dfs.sh
• 3175 SecondaryNameNode #because of start-dfs.sh
• 3539 ResourceManager #because of start-yarn.sh

• jps on slave nodes(Slave 1 and slave 2)


• hduser@slave1$jps
• 2484 DataNode #because of start-dfs.sh
• 2607 NodeManager #because of start-yarn.sh
Step 17
• Review Yarn console

• In browser, put the following url: http://maste


r:8088/cluster/nodes
• Or http://master:50070

• Or hdfs dfsadmin -rep


ort
Example class programs
• First program:
– Given a text file, generate a list of words with the number of tim
es each of them appear in the file.
• Second program:
– Given a text file containing numbers, one per line, count sum of
squares of odd, even and prime numbers.
– e.g Expected output: <odd, 302>, <even, 278>, <prime, 323>.
• Third program
– Given a file containing sales data, quantify the sales based on r
egion / country
Setting up a cluster on Linux nodes
• This information is supposed to be supplied on each node
that is to join the cluster.
• Open the file /etc/hosts
• Add information about the slaves and masters in that file:

e.g
– 10.160.9.20 master
– 10.160.9.22 slave1
– 10.160.9.21 slave2
Questions

You might also like