Professional Documents
Culture Documents
TH E U N IV E R S IT Y
HDFS Architecture
O F S T R A T H C L YD E
https://commons.wikimedia.org/wiki/File:Hdfsarchitecture.gif
Creative Commons Attribution-Share Alike 4.0 International license.
X
TH E U N IV E R S IT Y
Features of HDFS
O F S T R A T H C L YD E
Goal of HDFS
Fault detection and recovery
As it includes a large number of commodity hardware, failure of components is frequent and to be
O F S T R A T H C L YD E
expected
HDFS has mechanisms for quick and automatic fault detection and recovery.
Huge datasets
Should be able to have hundreds of nodes per cluster to manage the applications with huge datasets.
Hardware at data
Requested task can be done more efficiently, when the computation takes place closer to the data the
data
Makes sense to reduce network traffic and increase the throughput.
X
TH E U N IV E R S IT Y
Namenode
Namenode is based on commodity hardware
O F S T R A T H C L YD E
Datanode
Datanode is based on a commodity hardware:
O F S T R A T H C L YD E
Block
Generally data is stored in files in the HDFS
O F S T R A T H C L YD E
Files in the file system are divided into one or more segments
and/or stored in individual data nodes
File segments are called as blocks
The minimum amount of data that HDFS can read or write is
called a Block
Default block size is 64MB, however this can be increased as
needed
X
TH E U N IV E R S IT Y
Apache Foundation
The Apache Software Foundation is an American non-profit
O F S T R A T H C L YD E
corporation
Decentralised community of open source developers
Develop a range of open source technologies and tools that
are widely used
Development of these tools is normally collaborative and
consensus based, with an open software license
X
TH E U N IV E R S IT Y
Apache Ecosystem
Pig - platform for analysing large data sets that consists of a high-level
language combined with MapReduce and HDFS
O F S T R A T H C L YD E
Apache Mahout
Supports a lot of the data analysis tools and techniques we
O F S T R A T H C L YD E
Apache Spark
Apache Spark is a cluster computing technology, designed for fast
computation
O F S T R A T H C L YD E