Professional Documents
Culture Documents
AIM
To setup Multinode Hadoop cluster.
THEORY
Hadoop is an open-source framework that allows the storage and processing of
big data in a distributed environment across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it has become a
complete subject, which involves various tools, techniques and frameworks.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel with others. In short, Hadoop is used to develop applications
that could perform complete statistical analysis on huge amounts of data.
Hadoop Architecture
At its core, Hadoop has two major layers namely –
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System)
MapReduce: MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large amounts of data
(multiterabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop which is an Apache open-source framework.
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. It is highly
fault-tolerant and is designed to be deployed on low-cost hardware. It provides
high throughput access to application data and is suitable for applications having
large datasets.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
• Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
NameNode and DataNodes HDFS have a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that manages the
file system namespace and regulates access to files by clients. In addition, there
are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is split into
one or more blocks and these blocks are stored in a set of DataNodes. The
NameNode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to
DataNodes. The DataNodes are responsible for serving read and write requests
from the file system’s clients. The DataNodes also perform block creation,
deletion, and replication upon instruction from the NameNode.
NameNode
- It is a single master server that exists in the HDFS cluster.
- As it is a single node, it may become the reason for single point failure.
- It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
- It simplifies the architecture of the system.
DataNode
- The HDFS cluster contains multiple DataNodes.
- Each DataNode contains multiple data blocks.
- These data blocks are used to store data.
- It is the responsibility of DataNode to read and write requests from the file
system's clients.
It performs block creation, deletion, and replication upon instruction from the
NameNode.
HDFS
In HDFS data is distributed over several machines and replicated to ensure their
durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of
blocks, data nodes and node names.
Hadoop Tasks
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
• Data is initially divided into directories and files. Files are divided into uniform
sized blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further
processing.
• HDFS, being on top of the local file system, supervises the processing.
• Blocks are replicated for handling hardware failure.
• Checking that the code was executed successfully.
• Performing the sort that takes place between the map and reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job.
IMPLEMENTATION
1) Installing SSH on slave
11) Setting paraphraseless SSH and copying keys in master and slave
12) Editing core-site.xml in master and slave
13) Editing hdfs-site.xml in master and slave