Professional Documents
Culture Documents
Hadoop (Bigdatauniversity)
Hadoop (Bigdatauniversity)
Terms:
Rack = collection of 30 or 40 nodes physically close by, and all connected to same network switch
N/W bandwidth between any two nodes within same rack is > between two nodes on a different RACK
HDFS
HDFS runs on top of existing file system
Designed to handle very large files. The larger the file, it spends less time seeking next file location on
disk.
Default Hadoop block: 64MB. Recommended is: 128MB (This is Biginsights default)
Replication: can be set across many nodes. Factor can be set across each file
MapReduce
Map + Reduce
Types of nodes:
Major types: HDFS (or) MapReduce
A client communicates with Job Tracker, which in turn communicates with Task Tracker and Namenode
Namenode:
There will be only one node in cluster.
Namenode should have as much RAM as possible. Because it keeps entire filesystem metadata in
memory.
DataNode:
Many per Hadoop cluster
When Client requests a file ==> finds out from NameNode that which datanodes makes up blocks of that
file.
Jobtracker Node:
Only one per cluster
It schedules Map tasks and Reduce tasks on the appropriate Task Trackers.
Task Tracker
To run Map and/or reduce in parallel, there are many Task Trackers.
In MapReduce v1: you have to define how many Map slots, how many Reduce slots.
But, you no longer have to do that with MapReduce V2 (YARN)