Module 2. 16974328568170

Module 2
Introduction to Hadoop
Big Data Programming Model
• A programming model is centralized computing of data
in which the data is transferred from multiple distributed
data sources to a central server.
• Analyzing, reporting, visualizing, business intelligence tasks

compute centrally. Data are inputs to the central server.
• Another programming model is distributed computing

that uses the databases at multiple computing nodes with
data sharing between the nodes during computation.
• Distributed computing in this model requires the cooperation

(sharing) between the DBs in a transparent manner.
• Transparent means that each user within the system
may access all the data within all databases as if they
were a single database.
• A second requirement is location independence.
• Analysis results should be independent of

geographical locations.
• The access of one computing node to other nodes may

fail due to a single link failure.
Distributed pieces of codes as well as the data at the computing nodes
Transparency between data nodes at computing nodes do not fulfill for Big Data
when distributed computing takes place using data sharing between local and
remote.
• Following are the reasons for this:
▫ Distributed data storage systems do not use the concept of joins.
▫ Data need to be fault-tolerant and data stores should take into account
the possibilities of network failure. When data need to be partitioned into
data blocks and written at one set of nodes, then those blocks need
replication at multiple nodes. This takes care of possibilities of network
faults. When a network fault occurs, then replicated node makes the data
available.
▫ Big Data follows a theorem known as the CAP theorem. The CAP states
that out of three properties (consistency, availability and partitions), two
must at least be present for applications, services and processes.
i. Big Data Store Model
• A model for Big Data store is as follows:
▫ Data store in file system consisting of data blocks
(physical division of data).
▫ The data blocks are distributed across multiple nodes.
▫ Data nodes are at the racks of a cluster. Racks are
scalable.
• A Rack has multiple data nodes (data servers), and
each cluster is arranged in a number of racks.
• Data Store model of files in data nodes in racks in the

clusters Hadoop system uses the data store model in
which storage is at clusters, racks, data nodes and data
blocks.
• Data blocks replicate at the DataNodes such that a

failure of link leads to access of the data block from the
other nodes replicated at the same or other racks.
ii. Big Data Programming Model
• Big Data programming model is that application in
which application jobs and tasks (or subtasks) is
scheduled on the same servers which store the data for
processing.
Hadoop and its echo system
• Hadoop is a computing environment in which input
data stores, processes and stores the results.
• The environment consists of clusters which distribute at
the cloud or set of servers.
• Each cluster consists of a string of data files
constituting data blocks.
• The toy named Hadoop consisted of a stuffed elephant.
• The Hadoop system cluster stuffs files in data blocks.
• The complete system consists of a scalable distributed

set of clusters.
• Infrastructure consists of cloud for clusters.
• A cluster consists of sets of computers or PCs.
• The Hadoop platform provides a low cost Big Data platform,

which is open source and uses cloud services.
• Tera Bytes of data processing takes just few minutes.

Hadoop enables distributed processing of large datasets
(above 10 million bytes) across clusters of computers using
a programming model called MapReduce.
• The system characteristics are scalable, selfmanageable,

self-healing and distributed file system.
• Scalable means can be scaled up (enhanced) by adding
storage and processing units as per the requirements.
• Self-manageable means creation of storage and

processing resources which are used, scheduled and
reduced or increased with the help of the system itself.
• Self-healing means that in case of faults, they are taken

care of by the system itself. Self-healing enables
functioning and resources availability. Software detect and
handle failures at the task level.
• Software enable the service or task execution even in case

of communication or node failure.
Hadoop Core Components
• The Hadoop core components of the framework are:
• Hadoop Common - The common module contains the
libraries and utilities that are required by the other modules
of Hadoop.
• For example, Hadoop common provides various

components and interfaces for distributed file system and
general input/output.
• This includes serialization, Java RPC (Remote Procedure

Call) and file-based data structures.
• Hadoop Distributed File System (HDFS) - A Java-
based distributed file system which can store all kinds of
• MapReduce vl - Software programming model in
Hadoop 1 using Mapper and Reducer. The vl processes
large sets of data in parallel and in batches.
• YARN - Software for managing resources for

computing. The user application tasks or subtasks run
in parallel at the Hadoop, uses scheduling and handles
the requests for the resources in distributed running of
the tasks.
• MapReduce v2 - Hadoop 2 YARN-based system for

parallel processing of large datasets and distributed
processing of the application tasks.
Features of Hadoop
• Hadoop features are as follows:
1. Fault-efficient scalable, flexible and modular
design :which uses simple and modular programming
model. The system provides servers at high scalability.
The system is scalable by adding new nodes to handle
larger data. Hadoop proves very helpful in storing,
managing, processing and analyzing Big Data.
2. Robust design of HDFS: Execution of Big Data
applications continue even when an individual server or
cluster fails. This is because of Hadoop provisions for
backup (due to replications at least three times for each
data block) and a data recovery mechanism. HDFS thus
has high reliability.
3. Store and process Big Data: Processes Big Data
of 3V characteristics.
4. Distributed clusters computing model with

data locality: Processes Big Data at high speed as the
application tasks and sub-tasks submit to the
DataNodes. One can achieve more computing power by
increasing the number of computing nodes. The
processing splits across multiple DataNodes (servers),
and thus fast processing and aggregated results.
• 5. Hardware fault-tolerant: A fault does not affect
data and application processing. If a node goes down,
the other nodes take care of the residue. This is due to
multiple copies of all data blocks which replicate
automatically. Default is three copies of data blocks.
• 6. Open-source framework: Open source access

and cloud services enable large data store. Hadoop uses
a cluster of multiple inexpensive servers or the cloud.
• 7. Java and Linux based: Hadoop uses Java

interfaces. Hadoop base is Linux but has its own set of
shell commands support.
Hadoop Eco system Components
• The four layers are as follows:
• (i)Distributed storage layer.
(ii) Resource-manager layer for job or application sub-tasks

scheduling and execution.
(iii) Processing-framework layer, consisting of Mapper and

Reducer for the MapReduce process-flow.
(iv) APIs at application support layer (applications such as

Hive and Pig).
The codes communicate and run using MapReduce or YARN at

processing framework layer. Reducer output communicate to APis
• YARN stands for Yet Another Resource
Negotiator, but it's commonly referred to by the
acronym alone.
• It is one of the core components in open source

Apache Hadoop suitable for resource management. It
is responsible for managing workloads, monitoring,
and security controls implementation.
• It also allocates system resources to the various

applications running in a Hadoop cluster while
assigning which tasks should be executed by each
cluster nodes.
• Apache Spark is an essential product from the
Apache software foundation, and it is considered as a
powerful data processing engine. The growth of
large unstructured amounts of data increased need for
speed and to fulfill the real-time analytics led to the
invention of Apache Spark.
• Apache Hive is a datawarehouse open source

software built on Apache Hadoop for performing data
query and analysis. Hive mainly does three
functions; data summarization, query, and analysis.
Hive uses a language called HiveQL( HQL), which is
similar to SQL
• HBase is a Column-based NoSQL database . It runs
on top of HDFS and can handle any type of data. It
allows for real-time processing and random read/write
operations to be performed in the data.
• Apache Pig is a very high-level tool or platform being

used for the processing of large data sets. The
data analysis codes are developed with the use of a
high-level scripting language called Pig Latin.
• Hive is a distributed data warehouse system
developed by Facebook. It allows for easy reading,
writing, and managing files on HDFS. It has its own
querying language for the purpose known as Hive
Querying Language (HQL) which is very similar to
SQL. This makes it very easy for programmers to write
MapReduce functions using simple HQL queries.
• Flume is an open-source, reliable, service used to
efficiently collect, aggregate, and move large amounts of
data from multiple data sources into HDFS. It can collect
data in real-time as well as in batch mode.
• Apache Oozie is a Java Web application used to

schedule Apache Hadoop jobs. Oozie combines multiple
jobs sequentially into one logical unit of work.
• Mahout which is renowned for machine learning.

Mahout provides an environment for creating machine
learning applications which are scalable. It allows you to
apply the concept of machine learning via a selection of
Mahout algorithms to distributed computing via Hadoop.
• AVRO enables data serialization between the layers.
• Zookeeper enables coordination among layer

components.
• The holistic view of Hadoop architecture provides an

idea of implementation of Hadoop components of the
ecosystem.
• Client hosts run applications using Hadoop ecosystem

projects, such as Pig, Hive and Mahout.
Apache Ambari
• Apache Ambari is an open-source administration
tool deployed on top of Hadoop clusters, and it is
responsible for keeping track of the running
applications and their status.
• Apache Ambari can be referred to as a web-based

management tool that manages, monitors, and
provisions the health of Hadoop clusters.
HADOOP DISTRIBUTED FILE SYSTEM
• HDFS is a core component of Hadoop. HDFS is
designed to run on a cluster of computers and servers
at cloud-based utility services.
• HDFS stores Big Data which may range from GBs (1

GB= 230B) to PBs (1 PB= 1015 B, nearly the 250 B).
HDFS stores the data in a distributed manner in order
to compute fast.
• The distributed data store in HDFS stores data in any

format regardless of schema.
In Hadoop datails stored inside a number of clusters .
Each cluster has a number of data stores called racks.
Each rack stores a number of data nodes.
Each data node has a large number of data blocks.
A file containig data divides into data blocks.
A data block default size is 64MB.

Hadoop HDFS features are as follows
• i. Create, append, delete, rename and attribute
modification functions.
• ii. Content of individual file cannot be modified or

replaced but appended with new data at the end of the
file.
• iii. Write Once Read Many times (WORM)during

usages and processing.
• iv. Average file size can be more than 500 MB.

• v)converged data storage and processing happen on
the same server nodes.
• vi)Moving computation is cheaper than moving data.

• HDFS consists of name nodes and data nodes .
• Name node stores files meta data .Meta data gives

information about the file of user applications.
• The data nodes stores the actual data files in the data
blocks.
Hdfs components
• In a larger cluster ,the HDFS is managed through a
NameNode server to host the file system index and a
secondary NameNode that keeps snapshots of the NameNode.
• At the time of failure of NameNode replace the primary

NameNode ,thus preventing file system from getting corrupt
and reducing data loss.
• The NameNode actively monitors the number of replicas of a

block.
• When a replica of a block is lost due to a DataNode failure or

disk failure, the NameNode creates another replica of the
• The Namenode stores all the file system related
information :
▫ The file section is stored in which part of the cluster.
▫ Last access time of the files.
▫ User permission like which user has access to file.
• The secondary NameNode take snapshots of
primary NameNode directory information after a
regular interval of time which is saved in local or
remote directories.
• These checkpoint images can be used in the place of

the primary NameNode to start a failed Primary
NameNode without replaying the entire journal of file
system actions and without editing the log to create an
up-to-date directory structure.
• NameNode is the single point storage and

• Secondary NameNode has 2 disk files fsimage and edit
logs that track changes to the metadata:
▫ FsImage is a file stored on the OS filesystem that contains the
complete directory structure (namespace) of the HDFS with
details about the location of the data on the Data Blocks and
which blocks are stored on which node.
▫ When a NameNode starts up, it reads HDFS state from an image

file, fsimage, and then applies edits from the edits log file.
▫ EditLogs is a transaction log that recorde the changes in the

HDFS file system or any action performed on the HDFS cluster
such as addtion of a new block, replication, deletion etc., It
records the changes since the last FsImage was created, it then
merges the changes into the FsImage file to create a new
FsImage file.
• Majority of nodes in Hadoop cluster act as DataNodes
and TaskTrackers.
• These nodes are referred as slave nodes or slaves.
• These slaves are responsible to store data and process

the computations task submitted by clients.
• Let us assume that a DataNode cluster goes down
while processing is going on,then the NameNode
should known that the some DataNode is down in
the cluster,otherwise it cannot continue processing.
• Each DataNode sends a heart beat signal to

NameNode after every few minutes to make
NameNode aware of the active/inactive status of
DataNode.
• This system is called ‘Heartbeat mechanism’.

HDFS block replication
• When HDFS writes a file it is replicated across cluster
• The factor of replication is based on the value of
dfs.replication in the hdsf-site.xml file.
• The default value can be overruled with hdfs dfs-

setup command.
• If the hadoop cluster contains more than 8 data nodes
the replication value is usually set to 3.
• If data nodes is 8 or fewer replication factor can be set
to 2.
HDFS Safe mode
• When the NameNode starts it enters a read only safe
mode where blocks cannot be replicated or deleted.
• Safe mode enables the NameNode to perform two

important process.
1.The previous file system state is reconstructed by loading
the fsimage file into memory and replaying editlog .
2.The mapping between blocks and data nodes is created

by waiting for enough of the Data nodes to register so that
atlest one copy of data node is available.
• HDFS may also enter safe mode for maintainence
using command :
▫ hdfs dfsadmin-safemode
Rack Awareness
• Rack awareness deals with data locality.
• The major design goal of Hadoop Mapreduce is to

move the computations to the data.
• A typical Hadoop cluster exhibit 3 levels of data

locality.
1.Data resides in the local machine.
2.Data resides in the same rack.
3.Data resides in the different rack.
NameNode high availability
• The NameNode was a single point of failure that could
bring down the entire Hadoop cluster .
• The solution was to implement High Availability

NameNode as a means to provide true fail over
service.
• A HA Hadoop cluster has two separate NameNode
machines.
• Each machine is configured with exactly same
software
• One of the NameNode machine is in active state and
other is in standby state.
• Like single NameNode the active node is responsible

for all HDFS operation in the cluster.
• The standby NameNode maintains enough state to

provide a fast failover.
• To guarentee file system state is preserved both the

active and standby NameNodes receive block reports
from the DateNodes.
• The active nodes send all file system edits to a quarum
of Journel nodes .
• Atleast 3 physical separate Journel nodes daemons are

required because editlog modifications must be written
to a majority of the journel nodes.
• The standby node continously reads the edits from

journel nodes to ensure its namespace is synchronized
with that of active Name node.
• In the event of Active Node failure the standby node

reads all remaining edits from the Journel Nodes before
• During failover the NameNode that is chosen to
become active takes over the role of writing to the
journel nodes .
• The secondary name node is not required in the HA

configuration because the standby node also perform
the task of the secondary NameNode.
HDFS commands
• 1)To use the HDFS commands, first you need to start
the Hadoop services using the following command:
sbin/start-all.sh
• 2)To check the Hadoop services are up and running
use the following command:
jps
• JPS is a type of command that is implemented to
check out all the Hadoop daemons like DataNode,
NodeManager, NameNode, and ResourceManager
that are currently running on the machine
3)Mkdir
command used to create a directory.
Syntax
hdfs dfs -mkdir <folder name>
Example:hdfs dfs -mkdir /geeks
4) copyFromLocal (or) put:
To copy files/folders from local file system to hdfs store.
This is the most important command. Local filesystem
means the files present on the OS.
syntax
• hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
• Example:bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks

• 5) cp: This command is used to copy files within hdfs.
Lets copy folder geeks to geeks_copied.
Syntax:
hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
• Example: hdfs -cp /geeks /geeks_copied
6) ls: This command is used to list all the files. It shows
the name, permissions, owner, size, and modification
date for each file or directories in the specified directory.
Syntax:
hdfs dfs -ls <path>
Example: hdfs dfs -ls / -list the files in root directory
Map reduce programming model
• Map Reduce program can be written in any languages
including JAVA,C++ PIPES or python.
• The input data is in the form of file or directory and is
stored in HDFS.
• The MapReduce program performs 2 jobs on input
data :Map job and Reduce job.
• The Map job takes sequence of data and converts it
into another set of data
• The elements are broken into key value pairs in the
resultant data .
• The reduce job takes the output from a map as input
and combines the data tuples into smaller set of
tuples.
• Map and Reduce functions are always run in isolation
from one another.
• And reduce job is always done after map job.

MAPREDUCE FRAMEWORK AND
PROGRAMMING MODEL
• MapReduce is a processing technique and a program
model for distributed computing based on java.
• Mapper is a function or task which is used to process

all input records from a file and generate the output
which works as input for Reducer.
• Reducer means software for reducing the mapped data

by using the aggregation, query or user specified
function.
• The reducer provides a concise cohesive response for

• Aggregation function means the function that
groups the values of multiple rows together to result a
single value of more significant meaning or
measurement.
• For example, function such as count, sum, maximum,

minimum, deviation and standard deviation.
• Querying function means a function that finds the

desired values. For example, function for finding a
best student of a class who has shown the best
performance in examination.
• MapReduce allows writing applications to process
reliably the huge amounts of data, in parallel, on large
clusters of servers. The cluster size does not limit as
such to process in parallel.
• The parallel programs of MapReduce are useful for

performing large scale data analysis using multiple
machines in the cluster.
Features of MapReduce framework are as follows:
• Provides automatic parallelization and distribution of computation based

on several processors.
• Processes data stored on distributed clusters of DataNodes and racks.
• Allows processing large amount of data in parallel.
• Provides scalability for usages of large number of servers.
• Provides Map Reduce batch-oriented programming model in Hadoop

version 1.
• Provides additional processing modes in Hadoop 2 YARN-based system and

enables required parallel processing. For example, for queries, graph
databases, streaming data, messages, real-time OLAP and ad hoc analytics
Hadoop MapReduce Framework
• MapReduce provides two important functions .
• The distribution of job based on client application to

various nodes within a cluster is one function.
• The second function is organizing and reducing the

results from each node into a cohesive response to the
application.
• The processing task are submitted to Hadoop .

• MapReduce runs as per assigned job by jobtracker
which keeps track of the job submitted for execution
and runs tasktracker for tracking the tasks.
• MapReduce programming enables job scheduling and

task execution as follows:
• A client node submits a request of an application to
the job tracker .
• A jobtracker is a hadoop daemon.
• The following are the steps on the request to
mapreduce .
(i)Estimate the need of the resources for processing the
request .
(ii)Analyse the states of slave nodes.
(iii)Place the mapping task in queue.
(iv)Monitor the progress of the task and on the

failure,restart the task on slots of time available.
The job execution is controlled by two types of process in
MapReduce:
1.The Mapper deploys map task on slots.
Map task assign to those nodes where the data for
application is stored.The reducer output transfers to the
client node after the data serialization using AVRO.
2.The Hadoop system send the Map and Reduce jobs to

the appropriate servers in the client the hadoop
framework in turn manages the task of issuing jobs ,job
completion and copying data around the cluster between
the slave nodes.
• Finally the cluster collects and reduce the data to
obtain the results and send it back to the Hadoop
server after completion of the given tasks.
• The job execution is controlled by 2 types of processes
(i)Job tracker-it coordinates all jobs running on the
cluster and assign map and reduce tasks to run on
TaskTracker.
• The second is number of subordinate process called

TaskTracker.
• These process run assigned task and periodically report
the progress to task tracker.
• Job tracker schedules job submitted by clients keeps

track of TaskTrackers and maintain available map and
• The job tracker also monitors the execution of jobs
and tasks on the cluster .
• The task tracker executes the map and reduce tasks

and reports to the job tracker.
Hadoop YARN
• The most serious limitations of classical MapReduce
are primarily related to scalability, resource
utilization, and the support of workloads different
from MapReduce.
• In order to address short comings and provide more

flexibility ,efficiency and performance boost a new
functionality was developed .
• YARN(Yet Another Resource Negotiator) is a core
Hadoop service that supports two major services
▫ Global resource management(ResourceManager)

▫ Per-application management (ApplicationMaster).
• Apache Yarn Framework consists of a master known
as “Resource Manager”, slave called “node
manager” (one per slave node) and Application
Master (one per application).
Resource Manager (RM)
• It is the master daemon of Yarn.
• RM manages the global assignments of resources
(CPU and memory) among all the applications.
• When a client needs to run an application on a node
RM arbitrates system resources between competing
applications.
• Resource Manager has two Main components
▫ Scheduler
▫ Application manager
a) Scheduler
• The scheduler is responsible for allocating the
resources to the running application.
b) Application Manager
It manages running Application Masters in the cluster,
i.e., it is responsible for starting application masters and
for monitoring and restarting them on different nodes
in case of failures.
Node Manager (NM)
• It is the slave daemon of Yarn.
• NM is responsible for containers monitoring their

resource usage and reporting the same to the Resource
Manager.
• Manage the user process on that machine.
• Yarn NodeManager also tracks the health of the node

on which it is running.
Container is a JVM,(a piece of cpu and memory)
Application Master (AM)
• When a free container is found the application master

will be assigned to that node.
• One application master runs per application.
• It negotiates resources from the resource manager and

works with the node manager.
Hadoop Ecosystem Tools
The major Hadoop ecosystem tools are:
• Zookeeper
• Oozie
• Flume
• Ambari
• Hbase
• Hive
• Pig
• Mahout
Zookeeper
• A distributed system require designing and developing
the coordination services.
• Apache zookeeper is a coordination service that

enables synchronization across a cluster in distributed
application.
• The coordination service manages the jobs in the
cluster.
• Since multiple machines are involved ,the race
condition and deadlock are common problems when
running a distributed application.
• Zookeeper behaves as a centralized repository where
distributed application can write data at a node called
journel node and read the data out of it.
• Zookeeper uses synchronization, serialization, and

coordination activites.
Zookeeper main activites are:
• Name service- A Name service maps a name to the
information associated with the name.
• Name keeps track of servers services those are up and

running and looksup their statusby name in name
service.
• Concurrency control-concurrent access to shared
resources may cause inconsistency of the resources.
• A concurrency control algorithm accesses shared

resources in the distributed system and controls
concurrency.
• Configuration management-A requirement of
distributed system is central configuration manager. A
new joining node can pick up the up-to-date
centralized configuration manager. A new joining
node can pick up the up-to-date centralized
configuration .
• Failure-Distributed system are susceptible of node

failure.This require an automatic recovering strategy
by selecting some alternate node for processing.
OOZIE
• Apache Oozie is an open source project of Apache that
schedules jobs.
• Analysis of Big Data requires creation of multiple jobs

and subtasks in a process.
• Oozie design provisions the scalable processing of

multiple jobs.
• Thus oozie provides a way to package and bundle

multiple coordinator and workflow jobs and manage
the life cycle of those jobs.
• Two basic Oozie functions are
▫ Oozie workflow jobs are represented as Directed Acyclic
Graph.(DAGs) specify a sequence of action to execute.
▫ Oozie coordinator jobs are recurrent oozie workflow

jobs are triggered by time and data availability.
Oozie providesprovisions for the following
• Oozie integrates multiple jobs in sequential manner.
• Stores and support hadoop jobs for

MapReduce ,hive ,Pig and Sqoop.
• Runs workflow jobs based on time and data triggers.
• Manage batch coordinators for applications.

Sqoop
• The loading of data into Hadoop clusters become an
important task during data analytics.
• Apache Sqoop is a tool that is built of loading

efficiently the voluminous amount of data between
Hadoop and external data repositories that reside on
enterprise application server or relational databases
such as Oracle,MySql.
• Sqoop provides the mechanism to import data from

external data stores into HDFS.
• Sqoop provides command line interface to its users .
• Sqoop can also be accessed using java APIs.
• The tools allows defining the schema for the data for
import.
• Sqoop exploits MapReduce framework to import and

export the data and for parallel processing sub task.
FLUME
• Apache flume provides a distributed ,reliable and
available service:
▫ Flume efficiently collects aggregates and transfers a
large amount of streaming data into HDFS.
▫ Flume enables upload of large files into Hadoop clusters.

▫ The features of flume include robustness and fault
tolerance.
▫ It is useful for transfering a large amount of data in

applications related to logs of network traffic, sensor
data,geo location data ,emails and social media
messages.
• Apache flume has the following four important
components:
▫ Sources which accept data from a server or an
application.
▫ Sinks which receive data and store it in HDFS
repository or transmit the data to another source.Data
units that are transferred over a channel from source to
sink are called events.
▫ Channels connect between sources and sink by
queuing event data for transactions. The size of event
data is usually 4KB.
▫ Agents run the sinks and sources in flume .The
interceptors drop the data or transfer data as it flows
Ambari
• Apache Ambari is a management platform for Hadoop.
• It is open source.
• It enables an enterprise to plan, securely install,

manage and maintain the clusters in the Hadoop.
• For advanced cluster security capabilities it uses

Kerberos Ambari.
Features:
1)Simplification of installation, configuration and
management.
2)Enables easy, efficient, repeatable, and automated

creation of clusters.
3)Manages and monitors scalable clustering.

4)Visualizes the health of clusters and critical metrics
for their operation.
5)Enables detection of faulty node links.
6)Provides extensibility and customizability.

HBase
• Hbase is a hadoop system database.Hbase was created
for large tables.
• Hbase is an opensource ,distributed versioned and

non relational database.
• HBase is written in java .
• It stores data in large structured table.
• Hbase data store as key value pairs.

Hbase Features:
• Uses a partial columnar data scheme on top Hadoop and HDFS.
• Supports a large tables of billions of rows and millions of columns.
• Supports data compression algorithms.
• Provisions in-memory column based data transactions.
• Accesses rows serially and does not provision for random accesses and write
into the rows.
• Provide random ,real time read/write access to big data .
• Fault tolerant storage due to automatic failure support between DataNode

servers.
HIVE
• Apache Hive is an open source data warehouse
software.
• Hive facilitates reading writing and managing large

dataset which are distributed in Hadoop files.
• Hive design provisions for batch processing of large

sets of data.
• An application of hive is for managing weblogs.
• Hive does not process real time queries and does not
• Hive also enables data serialization /deserialization
and increases flexibility in schema design by including
a system catalog called Hive Metastore.
• Hive supports different storage types such as ,text

files, sequence files and RCFile(record columnar files)
and Hbase.
• Three major functions of Hive are data summerization
,query and analysis.
• Hive basically interacts with structured data stored in
HDFS with a query language known as HQL
PIG
• Apche Pig is an open source ,high level language
platform.
• Pig was developed for analyzing large data sets.
• Pig executes queries on large datasets that are stored

in HDFS using Hadoop.
• Language used in pig is known as Pig Latin

Additional features of pig are:
• Loads the data after applying the required filters and

dumps the data in desired format.
• Requires java runtime environment for executing pig

Latin programs.
• Converts all operations into map and reduce task.
• Allows concentrating upon the complete operation

irrespective of individual mapper and reducer function
Mahout
• Mahout is a project of Apache with library of scalable
machine learning algorithms.
• Machine learning is mostly required to enhance the

future performance of the system based on previous
outcomes.
• Mahout provides the learning tools to automate the

finding of meaningful patterns in the Big Data sets
stored in HDFS.
Mahout supports 4 main areas:
• Collaborative data filtering that mines user behavior and
makes product recommendations.
• Clustering that takes data items in a particular class and

organizes them into naturally occurring groups such that
items belonging to the same group are similar to each other.
• Frequent item set mining that analyzes item in a group and

then identifies which item usually occur together.
• Classification that means learning from existing

categorization and then assigning the future item to the best
USING APACHE SQOOP TO ACQUIRE RELATIONAL
DATA
• Sqoop is a tool designed to transfer data between
Hadoop and relational databases.
• Sqoop is used to import data from a relational

database management system (RDBMS) into the
Hadoop Distributed File System (HDFS), transform
the data in Hadoop, and then export the data back into
an RDBMS.
• Sqoop can be used with any Java Database
Connectivity (JDBC)–compliant database and has
been tested on Microsoft SQL Server, PostgresSQL,
MySQL, and Oracle.
Apache Sqoop Import and Export
Methods
• The data import is done in two steps.
• In the first step, shown in the figure, Sqoop examines the

database to gather the necessary metadata for the data to
be imported.
• The second step is a map-only (no reduce step) Hadoop

job that Sqoop submits to the cluster.
• This job does the actual data transfer using the metadata
captured in the previous step.
• Note that each node doing the import must have access to
• The imported data are saved in an HDFS directory.
• By default, these files contain comma-delimited fields,

with new lines separating different records.
• You can easily override the format in which data are

copied over by explicitly specifying the field separator
and record terminator characters.
• Once placed in HDFS, the data are ready for

processing.
• Data export from the cluster works in a similar
fashion.
• The export is done in two steps, as shown in Figure
below.
• As in the import process, the first step is to examine
the database for metadata.
• The export step again uses a map-only Hadoop job to

write the data to the database.
• Sqoop divides the input data set into splits, then uses
individual map tasks to push the splits to the database.
USING APACHE FLUME TO ACQUIRE
DATA STREAMS
• Apache Flume is an independent agent designed to
collect, transport, and store data into HDFS.
• Often data transport involves a number of Flume

agents that may traverse a series of machines and
locations.
• Flume is often used for log files, social media-

generated data, email messages, and just about any
continuous data source.
• A Flume agent must have all three of these components
defined.
▫ Source
▫ channel
▫ sink
• A Flume agent can have several sources, channels, and sinks.
Sources can write to multiple channels, but a sink can take
data from only a single channel.
• Data written to a channel remain in the channel until a

• sink removes the data.
• By default, the data in a channel are kept in memory but may

be optionally stored on disk to prevent data loss in the event
• Sqoop agents may be placed in a pipeline, possibly to
traverse several machines or domains.
• This configuration is normally used when data are

collected on one machine (e.g., a web server) and sent to
another machine that has access to HDFS.
• In a Flume pipeline, the sink from one agent is connected
to the source of another.
• The data transfer format normally used by Flume, which
is called Apache Avro, provides several useful features.
• First, Avro is a data serialization/deserialization system

that uses a compact binary format.
• The schema is sent as part of the data exchange and is
defined using JSON (JavaScript Object Notation).
• Avro also uses remote procedure calls (RPCs) to send

data.
• That is, an Avro sink will contact an Avro source to

send data.

Module 2. 16974328568170

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2. 16974328568170

Uploaded by

Copyright:

Available Formats

Module 2

• Analyzing, reporting, visualizing, business intelligence tasks

• Another programming model is distributed computing

• Distributed computing in this model requires the cooperation

• A second requirement is location independence.

• Analysis results should be independent of

• The access of one computing node to other nodes may

• Data Store model of files in data nodes in racks in the

• Data blocks replicate at the DataNodes such that a

• The Hadoop system cluster stuffs files in data blocks.

• The complete system consists of a scalable distributed

• A cluster consists of sets of computers or PCs.

• The Hadoop platform provides a low cost Big Data platform,

• Tera Bytes of data processing takes just few minutes.

• The system characteristics are scalable, selfmanageable,

• Self-manageable means creation of storage and

• Self-healing means that in case of faults, they are taken

• Software enable the service or task execution even in case

• For example, Hadoop common provides various

• This includes serialization, Java RPC (Remote Procedure

• YARN - Software for managing resources for

• MapReduce v2 - Hadoop 2 YARN-based system for

4. Distributed clusters computing model with

• 6. Open-source framework: Open source access

• 7. Java and Linux based: Hadoop uses Java

(ii) Resource-manager layer for job or application sub-tasks

(iii) Processing-framework layer, consisting of Mapper and

(iv) APIs at application support layer (applications such as

The codes communicate and run using MapReduce or YARN at

• It is one of the core components in open source

• It also allocates system resources to the various

• Apache Hive is a datawarehouse open source

• Apache Pig is a very high-level tool or platform being

• Apache Oozie is a Java Web application used to

• Mahout which is renowned for machine learning.

• Zookeeper enables coordination among layer

• The holistic view of Hadoop architecture provides an

• Client hosts run applications using Hadoop ecosystem

• Apache Ambari can be referred to as a web-based

• HDFS stores Big Data which may range from GBs (1

• The distributed data store in HDFS stores data in any

Each cluster has a number of data stores called racks.

Each rack stores a number of data nodes.

Each data node has a large number of data blocks.

A file containig data divides into data blocks.

A data block default size is 64MB.

• ii. Content of individual file cannot be modified or

• iii. Write Once Read Many times (WORM)during

• iv. Average file size can be more than 500 MB.

• vi)Moving computation is cheaper than moving data.

• Name node stores files meta data .Meta data gives

• At the time of failure of NameNode replace the primary

• The NameNode actively monitors the number of replicas of a

• When a replica of a block is lost due to a DataNode failure or

• These checkpoint images can be used in the place of

• NameNode is the single point storage and

▫ When a NameNode starts up, it reads HDFS state from an image

▫ EditLogs is a transaction log that recorde the changes in the

• These nodes are referred as slave nodes or slaves.

• These slaves are responsible to store data and process

• Each DataNode sends a heart beat signal to

• This system is called ‘Heartbeat mechanism’.

• The default value can be overruled with hdfs dfs-