You are on page 1of 9

BiG DaTa

Soumyajit Basu
Introduction, Architecture & Details

By

1
ABSTRACT
In todays world of complex business applications where complex data is
being generated every day, some data may have a specification and some
may not. These data may be coming from sensors which is used to gather
climatic information, posts to social media sites, digital pictures and videos,
purchase transaction records, cell phone GPS. So you may say that these
data can be referred as BI G DATA.
SO WHAT IS BIG DATA?
Big Data describes massive volumes of both structured and unstructured
data that is difficult to process using traditional database and software
techniques. Big data has the potential to help companies make operation
faster. The ultimate wheel running behind the working of Big Data is the
HDFS (Hadoop Data File System).
THE THREE VS OF BIG DATA
Big Data is defined using these three aspects.
Volumes: Refers to the amount of data that is being processed or
manipulated.
Velocity: Refers to data availability and providing the user with the
assurance of no network downtime. Some transactions may need higher
velocity for its communication and even slightest of downtime may cause
severity in business.
Variety: Refers to both structured and unstructured data that is being
processed. Big Data is any type of data structured or unstructured data such
as text, sensor data, audio, video, log files and more. New insights are found
when monitoring these data types together. For example monitoring of
100s of live video feeds from surveillance cameras to target points of
interest. Another example can be exploiting 80% growth in images, videos
and documents to improve customer satisfaction.
There are other two factors also which impacts the
manipulation of data.

2
Variability: Data coming from different source can be highly inconsistent
as data can have variance and certain inward flow of data may take peak
value. So in todays world it is tough to make a mindset of dealing only with
a specific set of data. For example daily event-triggered peak data loads can
be challenging to manage. It is even tougher with unstructured data involved.
Complexity: Todays data comes from multiple source. It is still a research
of how to link, match, cleanse and transform data across systems. It is also
necessary to connect and correlate relationships across hierarchies and
multiple data linkages
HADOOP DATA FILE SYSTEM (HDFS) - ARCHITECTURE
HDFS was designed to be a scalable, fault tolerant distributed file system
that works closely with Map Reduce (MapR). By distributing storage and
computation across many servers, the storage resource can grow with
demand while remaining economical at every size.
These are some of the specific features which ensures the proper
working of the HADOOP clusters so that they remain highly functional and
highly available.
Rack Awareness: Allows consideration of a nodes physical location, when
allocating storage and scheduling tasks.
Minimal Data Motion: MapR moves the compute processes to the data
node on HDFS. Processing tasks can occur on the physical node where the
data resides. This reduces the network I/O patterns on the local disk or
within the same rack and provides very high aggregate read/write
bandwidth.
Utilities: Diagnose the health of the file system and can rebalance the data
on different nodes.
Rollback: Allows system operators to bring back the previous version of the
HDFS after an upgrade in case of any human error.
Standby Name Node: Provides redundancy and supports high availability.
Highly Operable: The unique design allows HADOOP to allow
maintaining of clusters.

3
ARCHITECTURE OF HDFS

An HDFS cluster consists of a Name Node which manages the cluster
metadata and Data Nodes that store the data. Files and directories are
represented on the Name Node by inodes. Inodes record attributes like
permissions, modification, access time or namespace.
The Name Node does not directly send requests to Data Nodes. It sends
instructions to the Data Nodes by replying to the heartbeats sent by those
data nodes. The instructions include commands to replicate blocks to other
nodes, send an immediate block report or shut down the node.
MAP REDUCE (MAPR)
In order to achieve reliability with the physical server one has to resort to a
replication scheme where multiple redundant copies of data are stored. In
normal operations the cluster needs to write and replicate incoming data as
quickly as possible, provide consistent read operation and replicate
incoming data as quickly as possible. When crash occurs the customer is
exposed to data loss until the replica is recovered and resynchronized. This
is measured by the Mean Time to Data Loss (MTTDL). In short the
MTTDL is measurement of time for recovery. The essence is to maximize

4
the MTTDL by quickly resynchronizing the replica without impacting the
overall cluster performance and reliability.
THE HDFS SOLUTION TO THE PROBLEM
HDFS takes a different approach to the issue of synchronizing data. HDFS
completely bypasses the core problem by making everything read only so it
is trivial to re-sync data. There is a single writer per file and no read
operations are allowed to a file while it is being written. The file close
transaction allows readers to see the data. When a failure occurs unclosed
files were deleted. This model assumes that a corrupted data need not to be
presented and deletes the data as if the data were not produced at all.
LIMITATION OF USING HDFS
The system design of HDFS was way too unrealistic as it had
limitations. The limitations were mostly semantic based.
The first limitation was that the file needed to be closed immediately in
order to make data visible. The problem is that there was the creation of
too many files. Which can be a problem for the centralized metadata
storage architecture.

The second limitation was that HDFS was not able to support full read
write via NFS because the NFS protocol cannot invoke a close operation
on the file when writing to HDFS. The limitation of HDFS is that it has
to guess when to close the file. If the guess is wrong then the data might
be lost.
So there was a need to implement a new technology so that it would
maintain the core-file system capabilities.

5
THE TRADITIONAL SOLUTION


The traditional solution suggested using a Dual ported Disk Array which
used RAID-6. RAID-6 used block stripping along with an extra parity block.
Thus it uses block level stripping with two parity block distributed across
all member disks. The Dual ported array connected two servers to each port.
The servers uses NVRAM which is a non-volatile RAM to manage the disks.
Whatever data was written to the disk was copied to the replica using the
NVRAM. When the primary fails the replica used to take over the primary.
So there was no need to re-sync as there was a replica of the server already
available which could make data readily available but this technique was
not at all scalable.
Besides this technique had performance issues because it followed
traditional architecture where SAN or NAS was followed which was
connected to a database with a bunch of applications connected to the
database and the respected data needed to be moved where the processing
needs to be done. But what really was needed is that the data and processing
of the function is carried out locally to the data. This can make data
processing faster that is mainly done by HADOOP.


6

THE MAPR SOLUTION
MapR implementation helps make a file readable as soon as it is written but
to solve the resynchronization problem and make it server compatible it has
a new architecture called a container architecture which is reliable and also
reduces the Mean Time to Data Loss (MTTDL).
CONTAINER ARCHITECTURE
MapR splits the data into many other parts which are called containers.
These containers are replicated across the clusters. The advantage of this is
that each node contains a part of every nodes data scattered across the
network. So when a server break down occurs it is easy to re-sync the dead
nodes data. For this kind of system if there are 1000 nodes in the network
then the MTTDL is 1000x times.


7
MAPR NFS ALLOWS DIRECT DEPOSIT OF DATA
MapR architecture exploits its read-write file system capability to provide
standard NFS interface for cluster access so that you can directly deposit
data into the cluster without needing any special connectors. Whether its a
web server, or a database server, or an application server, they can all dump
data into MapR directly at very high speeds.
NAME NODE AND ITS LIMITATIONS
The NameNode is the centre piece of an HDFS file system. It maintains the
directory tree of all files in the file system, and tracks where across the
cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a
file, or when they want to add, copy, move or delete a file. The NameNode
responds the successful requests by returning a list of relevant Data
Node servers where the data lives.
The Name Node is a single point of failure for the HDFS Cluster. HDFS is
not currently a High Availability system. When the NameNode goes down,
the file system goes offline. There is an optional Secondary Name
Nodethat can be hosted on a separate machine. It only creates checkpoints
of the namespace of the database.
DATA NODE
A DataNode stores data in the Hadoop File System. A functional file system
has more than one DataNode, with data replicated across them.
On start-up, a DataNode connects to the Name Node. It then responds to
requests from the NameNode for file system operations.
Client applications can talk directly to a DataNode, once the Name
Node has provided the location of the data.


8
HDFS ARCHITECTURE

You might also like