02 Unit-II Hadoop Architecture and HDFS

SHRI VISHNU ENGINEERING COLLEGE FOR WOMENS :: BHIMAVARAM
(Autonomous)
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

BIG DATA TECHNOLOGIES
UNIT- II
Chapter 1 : Hadoop Architecture and HDFS
1.1. What is Hadoop?

It is a kind of framework that permits us to store massive data in distributed environments; it
provides applications that gather data in different formats into Hadoop through API operations
to interact with namenode. It tracks the file directory structure and chunks placement for every
file to replicate in data nodes. Its ecosystem developed with high speed over the years with its
extensibility. It contains many applications and tools for collection, storing, processing,
analyzing, and managing massive data. Its search results return through humans as the web
develops from a few to million pages that need automation. It generates web crawlers and
various research products of universities and the startups of search engines.
The given below are some fundamental reasons which made it essential.
 Ability to store large amounts of data quickly: using data volumes and varieties
developed randomly, mainly from IOT and social media, which is an essential
consideration.
 Computing power: its model of distributive computing helps to process colossal
data quickly. Our processing power depends on the computing nodes we use.
 Fault tolerance: applications and data processing is secured against the failure of
hardware when nodes comedown jobs are redirected to other nodes automatically to
ensure the defeat of computing.
 Flexibility: in controversial to traditional databases, we don't need any data to
preprocess before storing. We can hold a large amount of data per our requirement,
which contains unstructured data such as videos, images, text.
 Low cost: it is a framework of open-source free of charge and utilizes the
commodity hardware for massive data storage.
 Scalability: we can develop our system for more data handling quickly through
additional nodes. It needs only small administration
1.2. Hadoop History
• In 2002,Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch.It
is an open source web crawler software project.
• While working on Apache Nutch, they were dealing with big data.
• In 2003, Google introduced a file system known as GFS (Google file system). It is a
distributed file system developed to provide efficient access to data.
• In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
• In 2005,Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
• In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year.
• Doug Cutting gave named his project Hadoop after his son's toy elephant.
• In 2007, Yahoo runs two clusters of 1000 machines.
• In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
• 2013, Hadoop 2.2 was released.In 2017, Hadoop 3.0 was released.
• The latest version of Hadoop is 3.3.6, which was released on June 23, 2023
1.3. Distributed Processing System
Distributed processing means that a specific task can be broken up into functions, and the
functionsare dispersed across two or more interconnected processors. A distributed application
is an application for which the component application programs are distributed between two or
more interconnected processors.
More often, however, distributed processing refers to local-area networks (LANs) designed
so that a single program can run simultaneously at various sites. Most distributed processing
systems contain sophisticated software that detects idle CPUs on the network and parcels out
programs to utilize them.
Another form of distributed processing involves distributed databases. This is databases in

which the data is stored across two or more computer systems. The database system keeps
track of where the data is so that the distributed nature of the database is not apparent to
users.
1.4. Core Components of Hadoop

The core component of Hadoop is:
 Hadoop Distributed File Systems (HDFS)
 MapReduce
HDFS:
HDFS is the storage component of Hadoop. It’s a distributed filesystem that’s modeled after
the Google File System (GFS) paper. HDFS is optimized for high throughput and works best
when reading and writing large files (gigabytes and larger). To support this throughput HDFS
leverages unusually large (for a filesystem) block sizes and data locality optimizations to
reduce network input/output (I/O).
Scalability and availability are also key traits of HDFS, achieved in part due to data
replication and fault tolerance. HDFS replicates files for a configured number of times, is
tolerant of both software and hardware failure, and automatically re-replicates data blocks on
nodes that have failed.
Figure1.2. HDFS architecture shows an HDFS client communicating with the master
NameNode and slave DataNodes.
Figure1.2 shows a logical representation of the components in HDFS: the NameNode and
the DataNode. It also shows an application that’s using the Hadoop filesystem library to
access HDFS.
MAPREDUCE
MapReduce is a batch-based, distributed computing framework modeled after Google’s
paper on MapReduce. It allows you to parallelize work over a large amount of raw data,
such as combining web logs with relational data from an OLTP database to model how
users interact with your website. This type of work, which could take days or longer using
conventional serial programming techniques, can be reduced down to minutes using
MapReduce on a Hadoop cluster.
The MapReduce model simplifies parallel processing by abstracting away the complexities
involved in working with distributed systems, such as computational parallelization, work
distribution, and dealing with unreliable hardware and software. With this abstraction,
MapReduce allows the programmer to focus on addressing business needs, rather than
getting tangled up in distributed system complications.
Figure1.3. A Client Submitting a job to MapReduce
MapReduce decomposes work submitted by a client into small parallelized map and reduce
workers, as shown in Figure1.3. The map and reduce constructs used in MapReduce
are borrowed from those found in the Lisp functional programming language, and use a
shared- nothing model6 to remove any parallel execution interdependencies that could add
unwanted synchronization points or state sharing.
The role of the programmer is to define map and reduce functions, where the map function
outputs key/value tuples, which are processed by reduce functions to produce the final output.
1.5. HDFS Architecture

 Apache HDFS or Hadoop Distributed File System is a block-structured file system
where each file is divided into blocks of a pre-determined size. These blocks are
stored across a cluster of one or several machines.
 Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a
cluster comprises of a single NameNode (Master node) and all the other nodes are
DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines
that support Java.
 Though one can run several DataNodes on a single machine, but in the practical
world, these DataNodes are spread across various machines.
NameNode:
 NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains
and manages the blocks present on the DataNodes (slave nodes).
 NameNode is a very highly available server that manages the File System Namespace
and controls access to files by clients.
 The HDFS architecture is built in such a way that the user data never resides on the
NameNode. The data resides on DataNodes only.
Functions of NameNode:
 It is the master daemon that maintains and manages the DataNodes (slave nodes)
 It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated
with the metadata:
 FsImage: It contains the complete state of the file system namespace since the start of
the NameNode.
 EditLogs: It contains all the recent modifications made to the file system with respect
to the most recent FsImage.
 It records each change that takes place to the file system metadata. For example, if a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
 It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.
 The NameNode is also responsible to take care of the replication factor of all the
blocks which we will discuss in detail later in this HDFS tutorial blog.
 In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.
DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high quality or high-availability.
Functions of DataNode:
 These are slave daemons or process which runs on each slave machine.
 The actual data is stored on DataNodes.
 The DataNodes perform the low-level read and write requests from the file system’s
clients.
 They send heartbeats to the NameNode periodically to report the overall health of HDFS,
by default, this frequency is set to 3 seconds.
Secondary NameNode:
 Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode.
 The Secondary NameNode works concurrently with the primary NameNode as a helper
daemon.
Functions of Secondary NameNode:

 The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the file
system.
 It is responsible for combining the EditLogs with FsImage from the NameNode.
 It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage. The new FsImage is copied back to the NameNode, which is used whenever
the NameNode is started the next time.
 Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is

also called CheckpointNode.
Blocks:
 Now, as we know that the data in HDFS is scattered across the DataNodes as blocks.
Let’s have a look at what is a block and how is it formed?
 Blocks are the nothing but the smallest continuous location on your hard drive where
data is stored. In general, in any of the File System, you store the data as a collection
of blocks.
 Similarly, HDFS stores each file as blocks which are scattered throughout the Apache
Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64
MB in Apache Hadoop 1.x) which you can configure as per your requirement.
Replication Management:
 HDFS provides a reliable way to store huge data in a distributed environment as data
blocks. The blocks are also replicated to provide fault tolerance.
 The default replication factor is 3 which is again configurable. So, as you can see in
the figure below where each block is replicated three times and stored on different
DataNodes (considering the default replication factor):
Rack Awareness:
 The NameNode also ensures that all the replicas are not stored on the same rack or a
single rack.
 It follows an in-built Rack Awareness Algorithm to reduce latency as well as provide

fault tolerance.
 Considering the replication factor is 3, the Rack Awareness Algorithm says that the
first replica of a block will be stored on a local rack and the next two replicas will be
stored on a different (remote) rack but, on a different DataNode within that (remote)
rack as shown in the figure above.
 If you have more replicas, the rest of the replicas will be placed on random
DataNodes provided not more than two replicas reside on the same rack, if possible.
HDFS Write Architecture:

The following protocol will be followed whenever the data is written into HDFS:
 At first, the HDFS client will reach out to the NameNode for a Write Request against
the two blocks, say, Block A & Block B.
 The NameNode will then grant the client the write permission and will provide the IP
addresses of the DataNodes where the file blocks will be copied eventually.
 The selection of IP addresses of DataNodes is purely randomized based on

availability, replication factor and rack awareness
 Let’s say the replication factor is set to default i.e. 3. Therefore, for each block the
NameNode will be providing the client a list of (3) IP addresses of DataNodes. The
list will be unique for each block.
 Suppose, the NameNode provided following lists of IP addresses to the client:
For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
 Each block will be copied in three different DataNodes to maintain the replication
factor consistent throughout the cluster.
 Now the whole data copy process will happen in three stages:
 Set up of Pipeline
 Data streaming and replication
 Shutdown of Pipeline (Acknowledgement stage)
HDFS Read Architecture:

The following steps will be taking place while reading the file:
 The client will reach out to NameNode asking for the block metadata for the file.
 The NameNode will return the list of DataNodes where each block (Block A and B) are
stored.
 After that client, will connect to the DataNodes where the blocks are stored.
 The client starts reading data parallel from the DataNodes (Block A from DataNode 1
and Block B from DataNode 3).
 Once the client gets all the required file blocks, it will combine these blocks to form a
file.
1.6. Hadoop Master – Slave Architecture
Apache Hadoop is designed to have Master Slave architecture.
 Master: Namenode, JobTracker

 Slave: DataNode, TaskTraker
 HDFS is one primary components of Hadoop cluster and HDFS is
designed to have Master-slave architecture.
– The Master (NameNode) manages the file system namespace operations like opening,
closing, and renaming files and directories and determines the mapping ofblocks to
DataNodes along with regulating access to files by clients
– Slaves (DataNodes) are responsible for serving read and write requests from thefile
system’s clients along with perform block creation, deletion, and replication upon instruction
from the Master (NameNode).
Map/Reduce is also primary component of Hadoop and it also have Master-slave

architecture
 Master: JobTracker
 Slaves: Tasktraker
– Master {Jobtracker} is the point of interaction between users and the map/reduce
framework. When a map/reduce job is submitted, Jobtracker puts it in a queue of pending jobs
and executes them on a first-come/first-served basis and then manages the assignment of map
and reduce tasks to the tasktrackers.
-- Slaves {tasktracker} execute tasks upo instruction from the Master {Jobtracker} andalso
handle data motion between the map and reduce phases.
The following is the master-slave architecture in which the NameNode and JobTrackers are
masters and the DataNodes and TaskTrackers are slaves.
Fig:Master-slave architecture Hadoop Cluster

1.7. Daemon types - Name node, Data node, Secondary Name node
To run Hadoop eco system, it is necessary to run a set of daemons (or) resident programs on
the different servers in the network. These daemons have specific roles, some exist only on
one server, some exist across multiple servers.
The daemons include:

 NameNode
 DataNode
 Secondary NameNode
 JobTracker
 TaskTracker
NameNode
Hadoop employs a master/slave architecture for both distributed storage and distributed
computation. The distributed storage system is called the Hadoop File System, or HDFS.
The NameNode is the master of HDFS that directs the slave DataNode daemons to perform
the low-level I/O tasks. The NameNode is the bookkeeper of HDFS; it keeps track of how
your files are broken down into file blocks, which nodes store those blocks, and the overall
health of the distributed filesystem.
The function of the NameNode is memory and I/O intensive. As such, the server hosting the
NameNode typically doesn’t store any user data or perform any computations for a
MapReduce program to lower the workload on the machine. This means that the NameNode
server doesn’t double as a DataNode or a TaskTracker.
There is unfortunately a negative aspect to the importance of the NameNode—it’s a single

point of failure of your Hadoop cluster. For any of the other daemons, if their host nodes
fail for software or hardware reasons, the Hadoop cluster will likely continue to function
smoothly or you can quickly restart it. Not so for the NameNode.
DataNode:
Each slave machine in your cluster will host a DataNode daemon to perform the grunt work
of the distributed filesystem—reading and writing HDFS blocks to actual files on the local
filesystem. When you want to read or write a HDFS file, the file is broken into blocks and
the NameNode will tell your client which DataNode each block resides in.
Your client communicates directly with the DataNode daemons to process the local files
corresponding to the blocks. Furthermore, a DataNode may communicate with other
DataNodes to replicate its data blocks for redundancy.
Figure1.4. Name Node and Data Node in HDFS

Figure1.4 illustrates the roles of the NameNode and DataNodes. In this figure, we show two
data files, one at /user/chuck/data1 and another at /user/james/data2. The data1 file takes up
three blocks, which we denote 1, 2, and 3, and the data2 file consists of blocks 4 and 5. The
content of the files are distributed among the DataNodes.
In this illustration, each block has three replicas. For example, block 1 (used for data1) is
replicated over the three rightmost DataNodes. This ensures that if any one DataNode
crashes or becomes inaccessible over the network, you’ll still be able to read the files.
DataNodes are constantly reporting to the NameNode. Upon initialization, each of the
DataNodes informs the NameNode of the blocks it’s currently storing. After this mapping is
complete, the DataNodes continually poll the NameNode to provide information regarding
local changes as well as receive instructions to create, move, or delete blocks from the local
disk.
Secondary NameNode (SNN)

The SNN is an assistant daemon for monitoring the state of the cluster HDFS. Like the
NameNode, each cluster has one SNN, and it typically resides on its own machine as well.
No other DataNode or TaskTracker daemons run on the same server. The SNN differs from
the NameNode in that this process doesn’t receive or record any real-time changes to
HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS
metadata at intervals defined by the cluster configuration.
The NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots
help minimize the downtime and loss of data. Nevertheless, a NameNode failure requires
human intervention to reconfigure the cluster to use the SNN as the primary NameNode.
JobTracker
The JobTracker daemon is the liaison between our application and Hadoop. Once you
submit code to cluster, the JobTracker determines the execution plan by determining which
files to process, assigns nodes to different tasks, and monitors all tasks as they’re running.
Should a task fail, the JobTracker will automatically relaunch the task, possibly on a
different node, up to a predefined limit of retries. There is only one JobTracker daemon per
Hadoop cluster. It’s typically run on a server as a master node of the cluster.
TaskTracker
As with the storage daemons, the computing daemons also follow master/slave architecture:
the JobTracker is the master overseeing the overall execution of a MapReduce job and the
TaskTrackers manage the execution of individual tasks on each slave node.
Figure1.5. Job Tracker and Task Tracker Interaction

Each TaskTracker is responsible for executing the individual tasks that the JobTracker
assigns. Although there is a single TaskTracker per slave node, each TaskTracker can
spawn multiple JVMs to handle many map or reduce tasks in parallel.
One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If

the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of
time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks
to other nodes in the cluster.
Chapter 2 : Hadoop Clusters and the Hadoop Ecosystem
2.1. What is Hadoop Cluster?

A cluster is a collection of nodes. A node is a process running on a virtual or physical
machine or in a container. A Hadoop cluster is a special type of
computational cluster designed specifically for storing and analyzing huge amounts of
unstructured data in a distributed computing environment.
A Hadoop cluster is a group of computers that work together to store, process and
analyse large amounts of data12. A Hadoop cluster consists of a master node that
coordinates the tasks and several worker nodes that perform the tasks. A Hadoop cluster is
designed to handle petabytes of data with high scalability and fault tolerance
Figure 2.1. A Hadoop Cluster has many parallel machines
Figure 2.1 illustrates how one interacts with a Hadoop cluster. A Hadoop cluster is a set of
commodity machines networked together in one location. Data storage and processing all
occur within this “cloud” of machines . Different users can submit computing “jobs” to
Hadoop from individual clients, which can be their own desktop machines in remote
locations from the Hadoop cluster.
2.2. Type of clusters / Operational Modes of Hadoop

Local (Standalone) Mode:
The standalone mode is the default mode for Hadoop. When you first uncompress the
Hadoop source package, it’s ignorant of your hardware setup. Hadoop chooses to be
conservative and assumes a minimal configuration.
With empty configuration files, Hadoop will run completely on the local machine. Because
there’s no need to communicate with other nodes, the standalone mode doesn’t use HDFS,
nor will it launch any of the Hadoop daemons. Its primary use is for developing and
debugging the application logic of a MapReduce program without the additional complexity
of interacting with the daemons.
Pseudo-Distributed Mode:
The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons
running on a single machine. This mode complements the standalone mode for debugging
your code, allowing you to examine memory usage, HDFS input/output issues, and other
daemon interactions.
Both standalone and pseudo-distributed modes are for development and debugging purposes.
An actual Hadoop cluster runs in the third mode, the fully distributed mode.
Fully-Distributed Mode (Multi Node Cluster):

Setting up Hadoop cluster on more than one server enabling a distributed environment for
storage and processing which is mainly used for production phase.
The below terminology is used in fully distributed mode:
 Master—The master node of the cluster and host of the NameNode and Job-Tracker
daemons
 Backup—The server that hosts the Secondary NameNode daemon
 Hadoop1, Hadoop2, Hadoop3, ...The slave boxes of the cluster running both
DataNode and TaskTracker daemons
2.3. Hadoop Ecosystem - Pig, Hive, Oozie, Flume, SQOOP.

Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these major elements. All
these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
HDFS: Hadoop Distributed File System

YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
HDFS consists of two core components i.e.
Name node
Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data nodes
are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost
effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage
the resources across the clusters. In short, it performs scheduling and resource allocation for
the Hadoop System.
Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth
per machine and later on acknowledges the resource manager. Application manager works as
an interface between the resource manager and node manager and performs negotiations as per
the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data sets
into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group.
Map generates a key-value pair based result which is later on processed by the Reduce()
method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just
the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of
the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive
or iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything
of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big
Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data
Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component based
communication, grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator
jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.

02 Unit-II Hadoop Architecture and HDFS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Unit-II Hadoop Architecture and HDFS

Uploaded by

Copyright:

Available Formats

SHRI VISHNU ENGINEERING COLLEGE FOR WOMENS :: BHIMAVARAM

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

1.1. What is Hadoop?

1.2. Hadoop History

1.3. Distributed Processing System

Another form of distributed processing involves distributed databases. This is databases in

1.4. Core Components of Hadoop

 Hadoop Distributed File Systems (HDFS)

1.5. HDFS Architecture

Functions of Secondary NameNode:

 Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is

 It follows an in-built Rack Awareness Algorithm to reduce latency as well as provide

HDFS Write Architecture:

 The selection of IP addresses of DataNodes is purely randomized based on

HDFS Read Architecture:

Apache Hadoop is designed to have Master Slave architecture.

 Master: Namenode, JobTracker

Map/Reduce is also primary component of Hadoop and it also have Master-slave

Fig:Master-slave architecture Hadoop Cluster

The daemons include:

There is unfortunately a negative aspect to the importance of the NameNode—it’s a single

Figure1.4. Name Node and Data Node in HDFS

Secondary NameNode (SNN)

Figure1.5. Job Tracker and Task Tracker Interaction

One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If

Chapter 2 : Hadoop Clusters and the Hadoop Ecosystem

2.1. What is Hadoop Cluster?

Figure 2.1. A Hadoop Cluster has many parallel machines

2.2. Type of clusters / Operational Modes of Hadoop

Fully-Distributed Mode (Multi Node Cluster):

2.3. Hadoop Ecosystem - Pig, Hive, Oozie, Flume, SQOOP.

HDFS: Hadoop Distributed File System

You might also like