You are on page 1of 10

X

TH E U N IV E R S IT Y

HDFS Architecture
O F S T R A T H C L YD E

https://commons.wikimedia.org/wiki/File:Hdfsarchitecture.gif
Creative Commons Attribution-Share Alike 4.0 International license.
X
TH E U N IV E R S IT Y

Features of HDFS
O F S T R A T H C L YD E

Suitable for the distributed storage and processing


Hadoop provides a command interface to interact with the
HDFS
The built-in servers of namenode and datanode help users to
easily check the status of cluster
HDFS provides file permissions and authentication.
X
TH E U N IV E R S IT Y

Goal of HDFS
Fault detection and recovery
As it includes a large number of commodity hardware, failure of components is frequent and to be
O F S T R A T H C L YD E

expected
HDFS has mechanisms for quick and automatic fault detection and recovery.

Huge datasets
Should be able to have hundreds of nodes per cluster to manage the applications with huge datasets.

Hardware at data
Requested task can be done more efficiently, when the computation takes place closer to the data the
data
Makes sense to reduce network traffic and increase the throughput.
X
TH E U N IV E R S IT Y

Namenode
Namenode is based on commodity hardware
O F S T R A T H C L YD E

GNU/Linux operating system


namenode software.

System having the namenode acts as a master server and


does the following:
Manages the file system namespace
Regulates client’s access to files
Executes file system operations e.g. renaming, closing, and opening files/directories.
X
TH E U N IV E R S IT Y

Datanode
Datanode is based on a commodity hardware:
O F S T R A T H C L YD E

GNU/Linux operating system


datanode software

For every node in a cluster there is be a datanode.


Datanodes perform read-write operations on file systems
Datanodes perform operations such as block creation,
deletion, and replication based on the instructions of the
namenode.
X
TH E U N IV E R S IT Y

Block
Generally data is stored in files in the HDFS
O F S T R A T H C L YD E

Files in the file system are divided into one or more segments
and/or stored in individual data nodes
File segments are called as blocks
The minimum amount of data that HDFS can read or write is
called a Block
Default block size is 64MB, however this can be increased as
needed
X
TH E U N IV E R S IT Y

Apache Foundation
The Apache Software Foundation is an American non-profit
O F S T R A T H C L YD E

corporation
Decentralised community of open source developers
Develop a range of open source technologies and tools that
are widely used
Development of these tools is normally collaborative and
consensus based, with an open software license
X
TH E U N IV E R S IT Y

Apache Ecosystem
Pig - platform for analysing large data sets that consists of a high-level
language combined with MapReduce and HDFS
O F S T R A T H C L YD E

Hbase - open-source NoSQL database


Hive - is a data warehouse software. Has an SQL-like interface to query
data stored in various databases and file systems that integrate with
Hadoop. Provides data summarisation, query, and analysis.
Impala - open source massively parallel processing SQL query engine
for data stored in a computer cluster running Apache Hadoop
Mahout - scalable machine learning algorithms, many of which are
implemented using the Hadoop platform.
X
TH E U N IV E R S IT Y

Apache Mahout
Supports a lot of the data analysis tools and techniques we
O F S T R A T H C L YD E

have seen previously


Many companies including Adobe, Facebook, LinkedIn,
Foursquare, Twitter, and Yahoo use Mahout
Foursquare uses the recommender engine of Mahout for a variety of recommendations
Twitter uses Mahout for user interest modelling
Yahoo! uses Mahout for pattern mining

Mahout is moving away from MapReduce, while still


maintaining what is available, towards Apache Spark
X
TH E U N IV E R S IT Y

Apache Spark
Apache Spark is a cluster computing technology, designed for fast
computation
O F S T R A T H C L YD E

Based on Hadoop MapReduce and it extends the MapReduce model


Spark is not a modified version of Hadoop
Hadoop is just one of the ways to implement Spark.
Spark uses Hadoop in two ways, storage and processing. Spark has its own
cluster management computation
APIs in Java, Scala, or Python
As well as map and reduce it supports SQL queries, streaming data, machine
learning and graph algorithms

You might also like