You are on page 1of 64

Unit - II

Hadoop Cluster
• A Hadoop cluster is a collection of computers, known as nodes,
that are networked together to perform these kinds of parallel
computations on big data sets.
• At a high level, a computer cluster is a group of two or more
computers, or nodes, that run in parallel to achieve a
common goal. This allows workloads consisting of a high
number of individual, parallelizable tasks to be distributed
among the nodes in the cluster.
Command Line Interface:
The HDFS CLI on Windows is a powerful tool that allows you to manage files and directories in HDFS from
the command line.

These command works similar to the Unix-base commands, but you need to use the ‘hdfs –dfs’ prefix
before each command to indicate that your working with HDFS. You can also use the ‘-help’ option with
any command to display usage of information.
1.hdfs dfs -ls: Lists the contents of a directory in HDFS.
2.hdfs dfs -mkdir: Creates a new directory in HDFS.
3.hdfs dfs -put: Copies a file or directory from the local file system to HDFS.
4.hdfs dfs -get: Copies a file or directory from HDFS to the local file system.
5.hdfs dfs -rm: Deletes a file or directory in HDFS.
6.hdfs dfs -cat: Displays the contents of a file in HDFS.
7.hdfs dfs -du: Displays the disk usage of a file or directory in HDFS.
8.hdfs dfs -chmod: Changes the permissions of a file or directory in HDFS.
9.hdfs dfs -chown: Changes the owner of a file or directory in HDFS.
10.hdfs dfs -chgrp: Changes the group ownership of a file or directory in HDFS.
Hadoop Distributed file systems:
Hadoop Distributed File System (HDFS) is the default file system used
by Hadoop. It is designed to store and manage large amounts of data
across multiple nodes in a cluster. HDFS is fault-tolerant and highly
available, which makes it ideal for storing large datasets used for big
data processing.
1. Local file system: This is the file system that is available on the local
disk of each node in the Hadoop cluster. It is not a distributed file
system, so it does not provide the scalability and fault tolerance of
HDFS. However, it can be used for small-scale Hadoop deployments.
2. Amazon S3: Amazon Simple Storage Service (S3) is a cloud-based
object storage service provided by Amazon Web Services (AWS).
Hadoop can be configured to use S3 as a file system, allowing you to
store and process data in S3 buckets.
3. Azure Data Lake Storage: Azure Data Lake Storage is a cloud-
based storage service provided by Microsoft Azure. It is optimized
for big data analytics and can be used as a file system for Hadoop.
4. GlusterFS: GlusterFS is an open-source distributed file system
that can be used as an alternative to HDFS. It provides scalability
and fault tolerance, and can be used with Hadoop for storing and
processing data.
5.MapR-FS: MapR-FS is a proprietary distributed file system that
is used by the MapR distribution of Hadoop. It provides scalability
and fault tolerance, and can be used for storing and processing
data in Hadoop.
The Java Interface:
The Hadoop Java API consists of several classes and interfaces, including:
FileSystem: This is the main class used to interact with HDFS. It provides methods for creating,
deleting, and renaming files and directories, as well as for reading and writing data.
Path: This class represents a path or location in HDFS. It provides methods for creating new
paths and resolving relative paths.
Configuration: This class contains the configuration settings for a Hadoop cluster. It provides
methods for setting and retrieving configuration properties, such as the location of the HDFS
namenode and the block size of files.
FSDataInputStream and FSDataOutputStream: These classes are used for reading and writing
data to files in HDFS.
Data Flow:
Hadoop Distributed File System (HDFS) is a distributed file system
that provides reliable and scalable storage for big data. In HDFS,
data is stored in the form of blocks and is distributed across
multiple nodes in the cluster. The data flow in HDFS can be
explained as follows:
Data Ingestion: The first step in the data flow in HDFS is to ingest
data into the system. This can be done through various sources
such as file systems, databases, and streaming data sources.
Data Partitioning: The data is then partitioned into smaller blocks
and is distributed across multiple nodes in the HDFS cluster. The
default block size in HDFS is 128 MB, but it can be changed based
on the requirements of the application.
Data Replication: To ensure data reliability, HDFS replicates each
block multiple times across the cluster. By default, HDFS replicates
each block three times, but this can also be configured based on the
requirements of the application.

Data Processing: Once the data is ingested and stored in HDFS, it can
be processed using various tools such as MapReduce, Spark, and
Hive. These tools can be used to perform various operations on the
data such as filtering, sorting, aggregating, and joining.

Data Retrieval: Finally, the processed data can be retrieved from


HDFS and used for various applications such as machine learning,
analytics, and reporting.
MapReduce:
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and
processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from
multiple servers to return a consolidated output back to the application.
Map Reduce – Word Count
Example - 2
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.
Without using Combiner Class
here it’s only 29 (key, value) pairs, but real
time lakhs of (key, value) pairs
Using Combiner
Partitioner Phase
• A partitioner works like a condition in processing an input
dataset. The partition phase takes place after the Map phase
and before the Reduce phase.
• The number of partitioners is equal to the number of reducers.
That means a partitioner will divide the data according to the
number of reducers.
• Difference between Hadoop and HDFS

A core difference between Hadoop and HDFS is that Hadoop is the


open source framework that can store, process and analyze data,
while HDFS is the file system of Hadoop that provides access to
data. This essentially means that HDFS is a module of Hadoop.
• Difference between HDFS and MapReduce

• HDFS and MapReduce form the two major pillars of the Apache
Hadoop framework. HDFS is more of an infrastructural component
whereas the MapReduce is more of a computational framework.
Hadoop Daemons
• Hadoop Daemons are a set of processes that run on
Hadoop. Hadoop is a framework written in Java, so all these
processes are Java Processes. Apache Hadoop consists of the
following Daemons:

• NameNode, DataNode, Secondary namenode, balancer,


checkpoint node, journal node
• NameNode: NameNode is the central daemon that manages the file system
namespace and regulates access to files by clients.

• DataNode: The DataNode daemon runs on each machine in the cluster that
stores data blocks for the HDFS. It is responsible for storing and retrieving
the data blocks and reporting their status to the NameNode.

• Secondary NameNode: The Secondary NameNode daemon periodically


merges the namespace and transaction logs from the NameNode to create a
new checkpoint file. This helps in reducing the recovery time of the
NameNode in case of a failure.
• Balancer: The Balancer daemon is responsible for balancing the data across
the DataNodes in the cluster. It periodically checks the data distribution and
moves data blocks from overloaded DataNodes to underloaded DataNodes.

• CheckpointNode: The CheckpointNode daemon is responsible for creating


periodic checkpoints of the HDFS metadata on a separate machine. This
provides an additional backup of the metadata in case of a failure.

• JournalNode: The JournalNode daemon is responsible for managing a


quorum of nodes that participate in maintaining the HDFS transaction log. It
provides high availability for the NameNode's metadata by ensuring that the
transaction log is always available.
MapReduce Daemons
• JobTracker: The JobTracker is the central daemon that manages the MapReduce
jobs submitted to the cluster. It is responsible for scheduling tasks, monitoring
progress, and re-executing failed tasks.

• TaskTracker: The TaskTracker daemon runs on each machine in the cluster and
executes the individual MapReduce tasks assigned to it by the JobTracker. It
communicates with the JobTracker to report task status and request new tasks.
• JobHistoryServer: The JobHistoryServer daemon is responsible for storing and
providing access to the historical data and logs of completed MapReduce jobs.
• TaskController: The TaskController daemon is responsible for launching and
managing the user-defined tasks in a secure environment.

• ShuffleHandler: The ShuffleHandler daemon is responsible for transferring the


output of the Map tasks to the input of the Reduce tasks.
The Configuration API :
The Configuration API is essentially a key-value store that
is used to set various properties that govern the behavior
of the MapReduce job. Some examples of properties that
can be set using the Configuration API include:
• Input and output paths for the job
• The number of reducers to use
• The name of the job
• Any additional command-line arguments to pass to the
job
Setting up the development environment for
MapReduce :
It requires a few steps to ensure that you have all the
necessary tools and software installed.
1. Install Java – Java version from Oracle website
2. Install Hadoop: - from the Apache Hadoop website
3. Setup the environment variables
4. Install an IDE – Eclipse, Intellij IDEA and NetBeans
5. Configure IDE
6. Start Coding
Running a MapReduce job on test data locally:
1. Create a sample input file
2. Set up the input and output directories
3. Copy the sample input file to the input directory
4. Write the MapReduce code
5. Set up the job configuration
6. Run the job
7. Verify the output
Running a MapReduce job on a cluster involves the following steps:
1.Input data is divided into small chunks and distributed across the worker
nodes in the cluster.
2.The master node assigns a Map task to each worker node, which processes
the data and produces intermediate key-value pairs.
3.The intermediate key-value pairs are then sorted and partitioned by the
master node.
4.The master node assigns a Reduce task to each worker node, which
combines the intermediate key-value pairs and produces the final output.
5.The final output is then collected and combined by the master node.

You might also like