You are on page 1of 3

Important feature of Hadoop: Rack Awareness (or) Network topology awareness

Terms:

Node = Computer. Typically non enterprise commodity hardware

Rack = collection of 30 or 40 nodes physically close by, and all connected to same network switch

N/W bandwidth between any two nodes within same rack is > between two nodes on a different RACK

Cluster = Hadoop Cluster = collection of racks

Pre Hadoop 2.2 Architecture:


Two major components:

(1) Distributed File system


- Hadoop Distributed File system (HDFS)
- IBM Spectrum scale

(2) MapReduce Engine


Map + Reduce
Has a built in resource manager and scheduler

Pre Hadoop 2.2 has MapReduce v1

HDFS
HDFS runs on top of existing file system

Not POSIX compliant

Reliability is through replication

Designed to handle very large files. The larger the file, it spends less time seeking next file location on
disk.

You need to reduce seeks by selecting as big files as possible.

Hadoop is designed to seek once and proceed further sequentially.

Not for random access.

Default Hadoop block: 64MB. Recommended is: 128MB (This is Biginsights default)

Replication: can be set across many nodes. Factor can be set across each file
MapReduce
Map + Reduce

Map tasks run in parallel

Reduce tasks that run in parallel

Types of nodes:
Major types: HDFS (or) MapReduce

HDFS: name node + data node

MapReduce v1: Job tracker + Task Tracker

Secondary nodes: Checkpoint (or) backup nodes

A client communicates with Job Tracker, which in turn communicates with Task Tracker and Namenode

Namenode:
There will be only one node in cluster.

Metadata is stored ad name node

Namenode should have as much RAM as possible. Because it keeps entire filesystem metadata in
memory.

DataNode:
Many per Hadoop cluster

Blocks from different files can be stored on the same DataNode

When Client requests a file ==> finds out from NameNode that which datanodes makes up blocks of that
file.

(Namenode has datanodes information)

Each DataNode periodically reports list of blocks it stores to Namenode

DataNodes are designed on non-commodity hardware; Replication is provided at software layer.

Jobtracker Node:
Only one per cluster

Manages the MapReduce jobs in the cluster.


Receives job requests submitted by the client.

It schedules Map tasks and Reduce tasks on the appropriate Task Trackers.

Task Tracker
To run Map and/or reduce in parallel, there are many Task Trackers.

Each TaskTracker spawns JVM to run either map or Reduce tasks

Hadoop 2.2 architecture


Resource manager and schedulers are external to any framework.

Datanodes still exist

Jobtracker and Tasktrackers no longer exist.

YARN: Yet Another Resource Negotiator

Two main ideas with YARN:

(1) Provide generic scheduling and resource management


(2) More efficient scheduling and workload management

In MapReduce v1: you have to define how many Map slots, how many Reduce slots.
But, you no longer have to do that with MapReduce V2 (YARN)

YARN: Resource Management


HDFS: Distributed Storage
MapReduce v2: processing

You might also like