Professional Documents
Culture Documents
(Autonomous)
• In 2002,Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch.It
is an open source web crawler software project.
• While working on Apache Nutch, they were dealing with big data.
• In 2003, Google introduced a file system known as GFS (Google file system). It is a
distributed file system developed to provide efficient access to data.
• In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
• In 2005,Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
• In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year.
• Doug Cutting gave named his project Hadoop after his son's toy elephant.
• In 2007, Yahoo runs two clusters of 1000 machines.
• In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
• 2013, Hadoop 2.2 was released.In 2017, Hadoop 3.0 was released.
• The latest version of Hadoop is 3.3.6, which was released on June 23, 2023
Distributed processing means that a specific task can be broken up into functions, and the
functionsare dispersed across two or more interconnected processors. A distributed application
is an application for which the component application programs are distributed between two or
more interconnected processors.
More often, however, distributed processing refers to local-area networks (LANs) designed
so that a single program can run simultaneously at various sites. Most distributed processing
systems contain sophisticated software that detects idle CPUs on the network and parcels out
programs to utilize them.
MapReduce
HDFS:
HDFS is the storage component of Hadoop. It’s a distributed filesystem that’s modeled after
the Google File System (GFS) paper. HDFS is optimized for high throughput and works best
when reading and writing large files (gigabytes and larger). To support this throughput HDFS
leverages unusually large (for a filesystem) block sizes and data locality optimizations to
reduce network input/output (I/O).
Scalability and availability are also key traits of HDFS, achieved in part due to data
replication and fault tolerance. HDFS replicates files for a configured number of times, is
tolerant of both software and hardware failure, and automatically re-replicates data blocks on
nodes that have failed.
Figure1.2. HDFS architecture shows an HDFS client communicating with the master
NameNode and slave DataNodes.
Figure1.2 shows a logical representation of the components in HDFS: the NameNode and
the DataNode. It also shows an application that’s using the Hadoop filesystem library to
access HDFS.
MAPREDUCE
MapReduce is a batch-based, distributed computing framework modeled after Google’s
paper on MapReduce. It allows you to parallelize work over a large amount of raw data,
such as combining web logs with relational data from an OLTP database to model how
users interact with your website. This type of work, which could take days or longer using
conventional serial programming techniques, can be reduced down to minutes using
MapReduce on a Hadoop cluster.
The MapReduce model simplifies parallel processing by abstracting away the complexities
involved in working with distributed systems, such as computational parallelization, work
distribution, and dealing with unreliable hardware and software. With this abstraction,
MapReduce allows the programmer to focus on addressing business needs, rather than
getting tangled up in distributed system complications.
Figure1.3. A Client Submitting a job to MapReduce
MapReduce decomposes work submitted by a client into small parallelized map and reduce
workers, as shown in Figure1.3. The map and reduce constructs used in MapReduce
are borrowed from those found in the Lisp functional programming language, and use a
shared- nothing model6 to remove any parallel execution interdependencies that could add
unwanted synchronization points or state sharing.
The role of the programmer is to define map and reduce functions, where the map function
outputs key/value tuples, which are processed by reduce functions to produce the final output.
Though one can run several DataNodes on a single machine, but in the practical
world, these DataNodes are spread across various machines.
NameNode:
NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains
and manages the blocks present on the DataNodes (slave nodes).
NameNode is a very highly available server that manages the File System Namespace
and controls access to files by clients.
The HDFS architecture is built in such a way that the user data never resides on the
NameNode. The data resides on DataNodes only.
Functions of NameNode:
It is the master daemon that maintains and manages the DataNodes (slave nodes)
It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated
with the metadata:
FsImage: It contains the complete state of the file system namespace since the start of
the NameNode.
EditLogs: It contains all the recent modifications made to the file system with respect
to the most recent FsImage.
It records each change that takes place to the file system metadata. For example, if a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.
The NameNode is also responsible to take care of the replication factor of all the
blocks which we will discuss in detail later in this HDFS tutorial blog.
In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.
DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high quality or high-availability.
Functions of DataNode:
These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file system’s
clients.
They send heartbeats to the NameNode periodically to report the overall health of HDFS,
by default, this frequency is set to 3 seconds.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode.
The Secondary NameNode works concurrently with the primary NameNode as a helper
daemon.
It is responsible for combining the EditLogs with FsImage from the NameNode.
It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage. The new FsImage is copied back to the NameNode, which is used whenever
the NameNode is started the next time.
Blocks:
Now, as we know that the data in HDFS is scattered across the DataNodes as blocks.
Let’s have a look at what is a block and how is it formed?
Blocks are the nothing but the smallest continuous location on your hard drive where
data is stored. In general, in any of the File System, you store the data as a collection
of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache
Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64
MB in Apache Hadoop 1.x) which you can configure as per your requirement.
Replication Management:
HDFS provides a reliable way to store huge data in a distributed environment as data
blocks. The blocks are also replicated to provide fault tolerance.
The default replication factor is 3 which is again configurable. So, as you can see in
the figure below where each block is replicated three times and stored on different
DataNodes (considering the default replication factor):
Rack Awareness:
The NameNode also ensures that all the replicas are not stored on the same rack or a
single rack.
Considering the replication factor is 3, the Rack Awareness Algorithm says that the
first replica of a block will be stored on a local rack and the next two replicas will be
stored on a different (remote) rack but, on a different DataNode within that (remote)
rack as shown in the figure above.
If you have more replicas, the rest of the replicas will be placed on random
DataNodes provided not more than two replicas reside on the same rack, if possible.
The NameNode will then grant the client the write permission and will provide the IP
addresses of the DataNodes where the file blocks will be copied eventually.
Let’s say the replication factor is set to default i.e. 3. Therefore, for each block the
NameNode will be providing the client a list of (3) IP addresses of DataNodes. The
list will be unique for each block.
Suppose, the NameNode provided following lists of IP addresses to the client:
For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
Each block will be copied in three different DataNodes to maintain the replication
factor consistent throughout the cluster.
Now the whole data copy process will happen in three stages:
Set up of Pipeline
Data streaming and replication
Shutdown of Pipeline (Acknowledgement stage)
The NameNode will return the list of DataNodes where each block (Block A and B) are
stored.
After that client, will connect to the DataNodes where the blocks are stored.
The client starts reading data parallel from the DataNodes (Block A from DataNode 1
and Block B from DataNode 3).
Once the client gets all the required file blocks, it will combine these blocks to form a
file.
1.6. Hadoop Master – Slave Architecture
– The Master (NameNode) manages the file system namespace operations like opening,
closing, and renaming files and directories and determines the mapping ofblocks to
DataNodes along with regulating access to files by clients
– Slaves (DataNodes) are responsible for serving read and write requests from thefile
system’s clients along with perform block creation, deletion, and replication upon instruction
from the Master (NameNode).
Master: JobTracker
Slaves: Tasktraker
– Master {Jobtracker} is the point of interaction between users and the map/reduce
framework. When a map/reduce job is submitted, Jobtracker puts it in a queue of pending jobs
and executes them on a first-come/first-served basis and then manages the assignment of map
and reduce tasks to the tasktrackers.
-- Slaves {tasktracker} execute tasks upo instruction from the Master {Jobtracker} andalso
handle data motion between the map and reduce phases.
The following is the master-slave architecture in which the NameNode and JobTrackers are
masters and the DataNodes and TaskTrackers are slaves.
NameNode
Hadoop employs a master/slave architecture for both distributed storage and distributed
computation. The distributed storage system is called the Hadoop File System, or HDFS.
The NameNode is the master of HDFS that directs the slave DataNode daemons to perform
the low-level I/O tasks. The NameNode is the bookkeeper of HDFS; it keeps track of how
your files are broken down into file blocks, which nodes store those blocks, and the overall
health of the distributed filesystem.
The function of the NameNode is memory and I/O intensive. As such, the server hosting the
NameNode typically doesn’t store any user data or perform any computations for a
MapReduce program to lower the workload on the machine. This means that the NameNode
server doesn’t double as a DataNode or a TaskTracker.
DataNode:
Each slave machine in your cluster will host a DataNode daemon to perform the grunt work
of the distributed filesystem—reading and writing HDFS blocks to actual files on the local
filesystem. When you want to read or write a HDFS file, the file is broken into blocks and
the NameNode will tell your client which DataNode each block resides in.
Your client communicates directly with the DataNode daemons to process the local files
corresponding to the blocks. Furthermore, a DataNode may communicate with other
DataNodes to replicate its data blocks for redundancy.
In this illustration, each block has three replicas. For example, block 1 (used for data1) is
replicated over the three rightmost DataNodes. This ensures that if any one DataNode
crashes or becomes inaccessible over the network, you’ll still be able to read the files.
DataNodes are constantly reporting to the NameNode. Upon initialization, each of the
DataNodes informs the NameNode of the blocks it’s currently storing. After this mapping is
complete, the DataNodes continually poll the NameNode to provide information regarding
local changes as well as receive instructions to create, move, or delete blocks from the local
disk.
The NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots
help minimize the downtime and loss of data. Nevertheless, a NameNode failure requires
human intervention to reconfigure the cluster to use the SNN as the primary NameNode.
JobTracker
The JobTracker daemon is the liaison between our application and Hadoop. Once you
submit code to cluster, the JobTracker determines the execution plan by determining which
files to process, assigns nodes to different tasks, and monitors all tasks as they’re running.
Should a task fail, the JobTracker will automatically relaunch the task, possibly on a
different node, up to a predefined limit of retries. There is only one JobTracker daemon per
Hadoop cluster. It’s typically run on a server as a master node of the cluster.
TaskTracker
As with the storage daemons, the computing daemons also follow master/slave architecture:
the JobTracker is the master overseeing the overall execution of a MapReduce job and the
TaskTrackers manage the execution of individual tasks on each slave node.
Figure 2.1 illustrates how one interacts with a Hadoop cluster. A Hadoop cluster is a set of
commodity machines networked together in one location. Data storage and processing all
occur within this “cloud” of machines . Different users can submit computing “jobs” to
Hadoop from individual clients, which can be their own desktop machines in remote
locations from the Hadoop cluster.
With empty configuration files, Hadoop will run completely on the local machine. Because
there’s no need to communicate with other nodes, the standalone mode doesn’t use HDFS,
nor will it launch any of the Hadoop daemons. Its primary use is for developing and
debugging the application logic of a MapReduce program without the additional complexity
of interacting with the daemons.
Pseudo-Distributed Mode:
The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons
running on a single machine. This mode complements the standalone mode for debugging
your code, allowing you to examine memory usage, HDFS input/output issues, and other
daemon interactions.
Both standalone and pseudo-distributed modes are for development and debugging purposes.
An actual Hadoop cluster runs in the third mode, the fully distributed mode.
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
HDFS consists of two core components i.e.
Name node
Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data nodes
are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost
effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage
the resources across the clusters. In short, it performs scheduling and resource allocation for
the Hadoop System.
Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth
per machine and later on acknowledges the resource manager. Application manager works as
an interface between the resource manager and node manager and performs negotiations as per
the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data sets
into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group.
Map generates a key-value pair based result which is later on processed by the Reduce()
method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just
the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of
the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive
or iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything
of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big
Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data
Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component based
communication, grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator
jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.