Professional Documents
Culture Documents
BDA Unit-1
BDA Unit-1
Working with Big Data: Google File System, Hadoop Distributed File System (HDFS) –
Building blocks of Hadoop (Namenode, Datanode, Secondary Namenode, JobTracker,
TaskTracker), Introducing and Configuring Hadoop cluster (Local, Pseudo-distributed mode,
Fully Distributed mode), Configuring XML files.
Big data is a term that describes the large volume of data – both structured and unstructured – It
is a collection of large data sets that cannot be processed using traditional computing techniques.
Big data is not merely a data; rather it has become a complete subject, which involves various
tools, techniques and frameworks.
Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will
be of three types.
Volume: The quantity of generated and stored data. The size of the data determines the value
and potential insight- and whether it can actually be considered big data or not.
Variety: The type and nature of the data. This helps people who analyze it to effectively use the
resulting insight.
Velocity: In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.
Variability: Inconsistency of the data set can hamper processes to handle and manage it.
Veracity: The quality of captured data can vary greatly, affecting accurate analysis.
1. The Google File System:
Google File System (GFS or GoogleFS) is a proprietary distributed file system developed
by Google for its own use. It is designed to provide efficient, reliable access to data using large
clusters of commodity hardware.
GFS is enhanced for Google's core data storage and usage needs (primarily the search engine),
which can generate enormous amounts of data that needs to be retained; Google File System
grew out of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the
early days of Google, while it was still located in Stanford. Files are divided into fixed-
size chunks of 64 megabytes, similar to clusters or sectors in regular file systems, which are only
extremely rarely overwritten, or shrunk; files are usually appended to or read. It is also designed
and optimized to run on Google's computing clusters, dense nodes which consist of cheap
"commodity" computers, which means precautions must be taken against the high failure rate of
individual nodes and the subsequent data loss. Other design decisions select for high
data throughputs, even when it comes at the cost of latency.
A GFS cluster consists of multiple nodes. These nodes are divided into two types:
one Master node and a large number of Chunk servers. Each file is divided into fixed-size
chunks. Chunk servers store these chunks. Each chunk is assigned a unique 64-bit label by the
master node at the time of creation, and logical mappings of files to constituent chunks are
maintained. Each chunk is replicated several times throughout the network, with the minimum
being three, but even more for files that have high end-in demand or need more redundancy.
The Master server does not usually store the actual chunks, but rather all the metadata associated
with the chunks, such as the tables mapping the 64-bit labels to chunk locations and the files they
make up, the locations of the copies of the chunks, what processes are reading or writing to a
particular chunk, or taking a "snapshot" of the chunk pursuant to replicate it (usually at the
instigation of the Master server, when, due to node failures, the number of copies of a chunk has
fallen beneath the set number). All this metadata is kept current by the Master server periodically
receiving updates from each chunk server ("Heart-beat messages").
Permissions for modifications are handled by a system of time-limited, expiring "leases", where
the Master server grants permission to a process for a finite period of time during which no other
process will be granted permission by the Master server to modify the chunk. The modifying
chunkserver, which is always the primary chunk holder, then propagates the changes to the
chunkservers with the backup copies. The changes are not saved until all chunk servers
acknowledge, thus guaranteeing the completion and atomicity of the operation.
Programs access the chunks by first querying the Master server for the locations of the desired
chunks; if the chunks are not being operated on (i.e. no outstanding leases exist), the Master
replies with the locations, and the program then contacts and receives the data from the
chunkserver directly.
Client can be other computers or computer applications and make a file request. Requests can
range from retrieving and manipulating existing files to creating new files on the system. Clients
can be thought as customers of the GFS.
Master Server is the coordinator for the cluster. Its task include:-
1. Maintaining an operation log, that keeps track of the activities of the cluster. The operation log
helps keep service interruptions to a minimum if the master server crashes, a replacement server
that has monitored the operation log can take its place.
2. The master server also keeps track of metadata, which is the information that describes chunks.
The metadata tells the master server to which files the chunks belong and where they fit within
the overall file.
Chunk Servers are the workhorses of the GFS. They store 64-MB file chunks. The chunk
servers don't send chunks to the master server. Instead, they send requested chunks directly to the
client. The GFS copies every chunk multiple times and stores it on different chunk servers. Each
copy is called a replica. By default, the GFS makes three replicas per chunk, but users can
change the setting and make more or fewer replicas if desired.
Google FileSystem READ Algorithm Execution Flow
Google File System READ Algorithm
1. Application originates the read request.
2. GFS client translates the request from (filename, byte range) - (filename, chunk index), and
sends it to master.
3. Master responds with chunk handle and replica locations (i.e. chunk servers where the replicas
are stored).
4. Client picks a location and sends the (chunkhandle, byterange) request to that location.
5. Chunk server sends requested data to the client.
6. Client forwards the data to the application.
Google File System WRITE Algorithm Execution Flow
Google File System WRITE Algorithm
1. Application originates write request.
2. GFS client translates request from (file name, data)->(filename,chunkindex),and sends it to
master.
3. Master responds with chunk handle and (primary + secondary) replica locations.
4. Client pushes write data to all locations. Data is stored in chunk servers’ internal buffers.
5. Client sends write command to primary.
6. Primary determines serial order for data instances stored in its buffer and writes the instances
in that order to the chunk.
7. Primary sends serial order to the secondaries and tells them to perform the write.
8. Secondaries respond to the primary.
9. Primary responds back to client.
What Is Hadoop
Apache Hadoop is a framework that allows for the distributed processing of large data
sets across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
Key features – Why Hadoop?
1. Flexible:
As it is a known fact, that only 20% of data in the organizations is structured, and the rest is all
unstructured, it is very crucial to manage unstructured data which goes unattended.
Hadoop is the core to manage different types of Big Data, whether structured or unstructured,
encoded or formatted, or any other type of data and makes it useful for decision making process.
Moreover, Hadoop is simple, relevant and schema-less! Though Hadoop generally supports Java
Programming, but to your pleasant surprise, any programming language can be used in Hadoop
with the help of the MapReduce technique.
Though Hadoop works best on Windows and Linux, it can also work on other operating systems
like BSD and OS X.
2. Scalable
Hadoop is a scalable platform, in the sense that new nodes can be easily added in the system as
and when required without altering the data formats, how data is loaded, how programs are
written, or even without modifying the existing applications.
Hadoop is a totally open source platform and runs on industry-standard hardware. Moreover,
Hadoop is also fault tolerant – this means, even if a node gets lost or goes out of service, the
system automatically reallocates work to another location of the data and continues processing as
if nothing had happened!
4. Robust Ecosystem:
Hadoop is robust and rich ecosystem that is well suited to meet the analytical needs of the
developers, web startups and other organizations. Hadoop Ecosystem consists of various related
projects such as MapReduce, Hive, HBase, Zookeeper, HCatalog, Apache Pig, which make
Hadoop very competent to deliver a broad spectrum of services.
6. Cost Effective:
The basic idea behind Hadoop is to perform cost effective data analysis present across world
wide web!
Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service on top
of a cluster of computers, each of which may be prone to failures.
Hive:
Apache Hive data warehouse software facilitates querying and managing large datasets residing
in distributed storage. Hive provides a mechanism to project structure onto this data and query
the data using a SQL-like language called HiveQL. At the same time this language also allows
traditional map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
Pig:
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets.
Flume:
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible architecture
based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms
and many failover and recovery mechanisms. It uses a simple extensible data model that allows
for online analytic application
Sqoop:
Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases. You
can use Sqoop to import data from a relational database management system (RDBMS) such as
MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in
Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of
this process, relying on the database to describe the schema for the data to be imported. Sqoop
uses MapReduce to import and export the data, which provides parallel operation as well as fault
tolerance
Oozie:
Apache Oozie is an open source project that simplifies the process of creating workflows and
managing coordination among jobs. In principle, Oozie offers the ability to combine multiple
jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and
supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop. In addition, it can be used to
schedule jobs specific to a system, such as Java programs. Therefore, using Oozie, Hadoop
administrators are able to build complex data transformations that can combine the processing of
different individual tasks and even sub-workflows.
1. NameNode
This daemon runs on Master. Namenode stores all the metadata like filename, path,
number of blocks, blockIds, block locations, number of replicas, slave related
configuration, etc.
The NameNode executes file system namespace operations like opening, closing, and
renaming files and directories.
The Namenode acts like a Bookkeeper
The NameNode periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly.
Blockreport contains a list of all blocks on a DataNode.
If a HeartBeat is missed the NameNode checks the particular DataNode. This is the
Balancing Utility in the architecture.
2. DataNode
This daemon runs on all the slaves and stores actual data. So the namenode corresponds
to the master machine and the datanodes to slave machines.
It also determines the mapping of blocks to DataNodes. The DataNodes are responsible
for serving read and write requests from the file system’s clients.
Client communicates directly with the datanode to process the local files corresponding to
the blocks.
A datanode may communicate with other datanodes to replicate its datablocka for
redundancy.
Datanodes are constantly reporting to the datanode.
Datanodes informs the namenode of the blocks it is currently storing.
NameNode which keeps all filesystem metadata in RAM has no capability to process that
metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't
have any backup of filesystem. What secondary node does is it contacts NameNode in an hour
and pulls copy of metadata information out of NameNode. It shuffle and merge this information
into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence
Secondary Node is not the backup rather it does job of housekeeping.
4. Job Tracker:
5. Task Tracker:
The blocks of a file are replicated for fault tolerance across the cluster. The NameNode
makes all decisions regarding replication of blocks. Replication is nothing but keeping same
blocks of data on different nodes.
And Replication factor is the number of times we are going to replicate every single
block of data. All the data blocks are replicated across the cluster of nodes. Ideally one replica is
present at one geographical location.
The default replication factor is 3, which can be changed according to the
requirements. We can change replication factor to desired value by editing configuration
files.
An application can specify the number of replicas of a file. The replication factor can be
specified at file creation time and can be changed later. Files in HDFS are writing-once and have
strictly one writer at any time.
Hadoop HDFS data read and write operations
The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which for HDFS is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to
determine the locations of the blocks for the first few blocks of the file. For each block, the
namenode returns the addresses of all the datanodes that have a copy of that block. Client will
interact with respective datanodes to read the file. Namenode also provide a token to the client
which it shows to data node for authentication.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first closest datanode
for the first block in the file.
Step 4: Data is streamed from the datanode back to the client, which calls read() repeatedly on
the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the
datanode, then find the best datanode for the next block. This happens transparently to the client,
which from its point of view is just reading a continuous stream.
Step 6: Blocks are read in order, with the DFSInputStream opening new connections to
datanodes as the client reads through the stream. It will also call the namnode to retrieve the
datanode locations for the next batch of blocks as needed. When the client has finished reading,
it calls close() on the FSDataInputStream.
NameNode:
NameNode does NOT store the files but only the file's metadata. NameNode oversees the health
of DataNode and coordinates access to the data stored in DataNode.
Name node keeps track of all the file system related information such as to
Which section of file is saved in which part of the cluster
Last access time for the files
User permissions like which user have access to the file
JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.
Secondary Node is NOT the backup or high availability node for Name node.
The job of Secondary Node is to contact NameNode in a periodic manner after certain time
interval (by default 1 hour).
NameNode which keeps all filesystem metadata in RAM has no capability to process that
metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't
have any backup of filesystem. What secondary node does is it contacts NameNode in an hour
and pulls copy of metadata information out of NameNode. It shuffle and merge this information
into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence
Secondary Node is not the backup rather it does job of housekeeping.
In case of NameNode failure, saved metadata can rebuild it easily.
Slaves:
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
Store the data
Process the computation
Each slave runs both a DataNode and Task Tracker daemon which communicates to their
masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a
slave to the NameNode
Configuring Hadoop cluster (Local, Pseudo-distributed mode, Fully Distributed mode)
Standalone Mode
hadoop-env.sh
Hadoop-en.sh speifices environments variables which affects the JDK used by Hadoop Daemon
(bin/hadoop) as hadoop framework is written in Java and uses Java Runtime environment, one
of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-
env.sh. This variable directs Hadoop daemon to the Java path in the system.
This file is also used for setting another Hadoop daemon execution environment such asheap
size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location
(HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary
parameters to start a cluster.
The following three files are the important configuration files for the runtime environment
settings of a Hadoop cluster.
core-site.sh
This file informs Hadoop daemon where NameNode runs in the cluster. It contains the
configuration settings for Hadoop Core such as I/O settings that are common
to HDFS andMapReduce.
Where hostname and port are the machine and port on which NameNode daemon runs and
listens. It also informs the Name Node as to which IP and port it should bind. The commonly
used port is 8020 and you can also specify IP address rather than hostname.
hdfs-site.sh
This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary
Name Node, and the data nodes.
You can also configure hdfs-site.xml to specify default block replication and permission
checking on HDFS. The actual number of replications can also be specified when the file is
created. The default is used if replication is not specified in create time.
The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the
value “false” turns off the permission checking. Switching from one parameter value to the other
does not change the mode, owner or group of files or directories.
mapred-site.sh
This file contains the configuration settings for MapReduce daemons; the job tracker and the
task-trackers. The mapred.job.tracker parameter is a hostname (or IP address) and portpair on
which the Job Tracker listens for RPC communication. This parameter specify the location of the
Job Tracker to Task Trackers and MapReduce clients.
You can replicate all of the four files explained above to all the Data Nodes and Secondary
Namenode. These files can then be configured for any node specific configuration e.g. in case of
a different JAVA HOME on one of the Datanodes.
The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop
cluster.
Masters
This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file
at Master server contains a hostname Secondary Name Node servers.
Slaves
The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node
and Task Tracker servers.