Introduction To Hadoop and Its Ecosystem

Hadoop Ecosystem
Introduction of big data

 Big data means really a
big data.
 It is a collection of large
datasets that cannot be
processed using traditional
computing techniques.
 Big data is not merely a data, rather it has become a
complete subject, which involves various tools,
techniques and frameworks.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 2

IBM Definition of Big Data

What data comes under big data?
 Black Box Data Social Media Data
 Stock Exchange Data Power Grid Data
 Transport Data Search Engine Data

Traditional approach
 In this approach, an enterprise will have a
computer to store and process big data.
 Data will be stored in an RDBMS like Oracle

Database, MS SQL Server or DB2.
 Sophisticated software's can be written to interact

with the database, process the required data and
present it to the users for analysis purpose.

Big Data & its challenges
 Big Data includes huge  Big Data Challenges
volume, high velocity,  Capturing data
and variety of data.  Storage
 Processing
 The data in it will be of  Searching
three types.  Sharing
◦ Structured data  Transfer
◦ Semi Structured data
 Analysis
◦ Unstructured data
 Presentation

Google Solution
 Google solved this problem using an algorithm called
Map-Reduce.
 This algorithm divides the task into small parts and
assigns those parts to many computers connected over
the network, and collects the results to form the final
result dataset.

History of HADOOP
 Apache top level project, open-source
implementation of frameworks for reliable, scalable,
distributed computing and data storage.
 It is a flexible and highly-available architecture for

large scale computation and data processing on a
network of commodity hardware.
 Designed to answer the question:
 “How to process big data with

reasonable cost and time?”

Hadoop’s Developers
 2005: Doug Cutting and Mike Cafarella
developed Hadoop to support distribution
for the Nutch search engine project.
 Google published technical papers

detailing its Google File System and
MapReduce programming framework.
 Cutting and Cafarella modified earlier

technology plans and developed a Java-
based MapReduce implementation and a
file system.
Doug Cutting
 The project was funded by Yahoo.
 2006: Yahoo gave the project to Apache
Software Foundation.

What is Hadoop?
 Hadoop is
◦ An open-source framework
◦ It allows to store and process big data in a distributed
environment across clusters of computers
◦ Uses simple programming models.
◦ It is the most popular and powerful big data tool
◦ Project of Apache Software foundation
◦ Designed to scale up from single servers to thousands
of machines,
◦ Each machine offering
local computation and
storage.
Companies using Hadoop
 Yahoo
 Google
 Facebook
 Amazon
 IBM
 & many more at
 http://wiki.apache.org/hadoop/PoweredBy

Two generation in Hadoop

Original Hadoop

Hadoop Stack Transition

Hadoop Architecture
 Hadoop has two major
layers namely −
 Processing /
Computation layer
(MapReduce)
 Storage layer
(Hadoop Distributed
File System)

MapReduce
 MapReduce-
◦ It is a parallel programming model

◦ for writing distributed applications
◦ devised at Google for efficient processing of large
amounts of data,
◦ on large clusters of commodity hardware in a
reliable, fault-tolerant manner.
 It is based on master slave architecture.
 The MapReduce program runs on Hadoop which is an

Apache open-source framework.

Hadoop Distributed File System
Hadoop Distributed File System- HDFS
◦ It is based on the Google File System (GFS)
◦ It provides a distributed file system that is

designed to run on commodity hardware.
◦ It is highly fault-tolerant and is designed to be

deployed on low-cost hardware.
◦ It provides high throughput access to application

data and is suitable for applications having large
datasets.

Hadoop other components
 Hadoop framework also
includes two modules −
 Hadoop Common −
 Java libraries and
utilities required by other
Hadoop modules.
 Hadoop YARN −
 framework for job
scheduling and cluster
resource management.

How Does Hadoop Work?
 Hadoop runs code across a cluster of computers.
 This process includes following core tasks Hadoop performs
−
 Data is initially divided into directories and files.
 Files are divided into uniform sized blocks of 128M and 64M
(preferably 128M).
 These files are then distributed across various cluster nodes
for further processing.
 HDFS (top of the local file system) supervises the processing.
 Blocksare replicated for handling hardware failure.
 Checking that the code was executed successfully.

Advantages of Hadoop
 Open Source
 Distributed Storage
 Parallel Processing
 Flexible
 Fault Tolerance
 Reliability
 High Availability
 Scalability
 Economic
 Data Locality
 Compatible to all platform
Components of Hadoop
 Hadoop Components
 HDFS (Hadoop Distributed file system )
 Map-Reduce
 PIG
 HIVE
 HBASE
 Zookeeper
 Oozie
 Sqoop
 Flume

Hadoop Eco system

Hadoop Ecosystem

HDFS (Hadoop Distributed file system )
 HDFS is
◦ Primary or major component of Hadoop ecosystem
◦ Responsible for storing large data sets of structured or
unstructured data across various nodes.
◦ Maintaining the metadata in the form of log files.
◦ It provides for data storage of Hadoop.
 HDFS splits the data unit into smaller units called blocks
and stores them in a distributed manner.
 Ithas two daemons running.

 HDFS consists of two core components-
◦ Name node (master node)

◦ Data Node (slave node)
HDFS (Hadoop Distributed File System)

Name Node
◦ The daemon that runs on the master server.
◦ It is the prime node which contains metadata.
◦ Stores meta-data i.e. number of data blocks, replicas & other.
◦ It maintains and manages the slave nodes.
◦ It assigns tasks to slaves.
◦ It is responsible for Namespace management.
◦ It regulates file access by the client.
◦ Keeps track of mapping of blocks to Data Nodes.
◦ It should deploy on reliable hardware
◦ All DataNodes sends a Heartbeat and block report to the
NameNode. (To ensures Data Nodes are alive)
◦ A block report contains a list of all blocks on a data node.

DataNode
 DataNode daemon runs on slave nodes.
 It is responsible for storing actual business data.
A file gets split into a number of data blocks and

stored on a group of slave.
 Data Nodes performs read/write request of the

client.
 Data Node also creates, deletes and replicates

blocks on demand from NameNode.
 Data nodes can deploy commodity hardware.

Task of Data Node
Block replica creation, deletion, and replication
according to the instruction of Name node.
Data Node manages data storage of the system.
Data Nodes send heartbeat to the Name Node to

report the health of HDFS.
By default, this frequency is set to 3 seconds.

HDFS with Name Node and Data Node

Block in HDFS
 Block is nothing but the smallest unit of storage on
a computer system.
 It is smallest contiguous storage allocated to a file.
 In Hadoop, default block size of 128MB or 256

MB.

Replication Management
 To provide fault tolerance, HDFS uses a replication
technique.
 It makes copies of the blocks and stores in on different
DataNodes.
 Replication factor decides how many copies of the

blocks get stored.
 It is 3 by default but we can configure to any value.
 NameNode receives block report from DataNode

periodically to maintain the replication factor.
 When a block is over-replicated/under-replicated the

NameNode add or delete replicas as needed.
Replication Management

Rack Awareness in HDFS
 Multiple Data Nodes in Rack.
 In a large hadoop cluster,
◦ to improve network traffic while reading/writing HDFS file,

◦ NameNode chooses the DataNode which is closer to the
same rack or nearby rack to Read /write request.
 NameNode achieves rack information by maintaining the rack
ids of each Data Node.
◦ NameNode makes sure that all the replicas are not stored on the same
rack or single rack.
◦ It follows Rack Awareness Algorithm to reduce latency as well as fault
tolerance.
 First replica of a block on local rack. Next replica on another
datanode within the same rack. The third replica on different
rack.
HDFS Read/Write Operation
When a client wants to write a file to HDFS, it
communicates to namenode for metadata.
The Namenode responds with a number of

blocks, their location, replicas and other details.
Based on information from Namenode, file split

into multiple blocks.
After that, it starts sending them to first

Datanode.

Write operation:
 Client first sends block A to Datanode 1 with other two
Datanodes details.
 When Datanode 1 receives block A from the client,
Datanode 1 copy same block to Datanode 2 of the same
rack.
 As both the Datanodes are in the same rack so block transfer
via rack switch.
 Datanode 2 copies the same block to Datanode 3.
 As both the Datanodes are in different racks so block
transfer via an out-of-rack switch.
 When Datanode receives the blocks from the client, it sends
write confirmation to Namenode.
 The same process is repeated for each block o the file.
Write operation

Read operation:
 To read from HDFS,
 The first client communicates to namenode for metadata.
 A client receives the name of files and its location.
 The Namenode responds with number of blocks, their

location, replicas and other details.
 Now client communicates with Datanodes.

 The client starts reading data parallel from the Datanodes
based on the information received from the namenode.
 When application receives all the block of the file, it

combines these blocks into the form of an original file.
Read operation

MapReduce
 MapReduce-
◦ It is a parallel programming model

◦ for writing distributed applications
◦ designed for efficient processing of large amounts of
data, on large clusters of commodity hardware in a
reliable, fault-tolerant manner.
◦ MapReduce is a processing technique and
◦ Program model for distributed computing based on java.
 MapReduce algorithm contains two important tasks,

Map and Reduce.
 It is based on master slave architecture.

Terminologies in MapReduce
PayLoad − Applications implement the Map and
the Reduce functions, and form the core of the job.
Mapper − Mapper maps the input key/value pairs to
a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop
Distributed File System (HDFS).
DataNode − Node where data is presented in
advance before any processing takes place.

Terminologies in MapReduce
MasterNode − Node where JobTracker runs and
which accepts job requests from clients.
SlaveNode − Node where Map and Reduce
program runs.
JobTracker − Schedules jobs and tracks the assign
jobs to Task tracker.
Task Tracker − Tracks the task and reports status
to JobTracker.

MapReduce Architecture

How MapReduce Organizes Work?
A job is divided into multiple tasks and run onto multiple
data nodes in a cluster.
Job Tracker-
 Accepts MR jobs submitted by users
 Schedule tasks to run on different data nodes.
 Assigns Map and Reduce tasks to Tasktrackers
 Monitors task and tasktracker status,
 Reexecutes tasks upon failure-On task failure, the job
tracker can reschedule it on a different task tracker.
 Keeps track of the overall progress of each job.

Task Tracker :
 Run Map and Reduce tasks upon instruction from the
Jobtracker
 Execution of individual task is done by task tracker.
 Manage storage and transmission of intermediate
output.
 It resides on every data node executing part of job.
 Task tracker sends the progress report to job tracker.
 Task tracker periodically sends 'heartbeat' signal to
the Job tracker about current state of the system.


MapReduce program execution
 MapReduce program executes in three stages-
◦ Map stage
◦ Shuffle stage
◦ Reduce stage
 Map stage −
◦ The map or mapper’s job is to process the input data.
◦ Input data is in form of file or directory and is stored in

the HDFS.
◦ The input file is passed to the mapper function line by

line.
◦ The mapper processes the data and creates several small
chunks of data. 47
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET
Reduce stage −
 This stage is the combination of the Shuffle stage and
the Reduce stage.
 The Reducer’s job is to process the data that comes
from the mapper.
 After processing, it produces a new set of output,
which will be stored in the HDFS.
 After completion of the given tasks,

◦ Cluster collects and reduces the data to form an
appropriate result, and
◦ Sends it back to the Hadoop server.

MapReduce Stages


MapReduce Phases/stages
 Input Splits:
◦ An input is divided into fixed-size pieces called input
splits
◦ Input split is a chunk of the input that is consumed by a
single map.
 Mapping
◦ In this phase data in each split is passed to a mapping
function to produce output values.
◦ In our example, a job count a number of occurrences of

each word from input splits and prepare a list in the form
of <word, frequency>
DR. Mrs. Jyoti N. Jadhav, Associate
Professor, Dept of CSE, DYPCET 51
MapReduce Phases/stages
 Shuffling
◦ This phase consumes the output of Mapping phase.

◦ Its task is to consolidate the relevant records from
Mapping phase output.
◦ E.g same words are clubed together along with their
respective frequency.
 Reducing
◦ In this phase, output values from the Shuffling phase

are aggregated.
◦ This phase combines values from Shuffling phase and
returns a single output value.
◦ This phase summarizes the complete dataset.

MapReduce word count process
 Example-1
DOG CAT RAT

CAR CAR RAT
DOG CAR CAT
 Example-2
Welcome to Hadoop Class

Hadoop is good
Hadoop is bad

Working Of Map-Reduce

Hadoop YARN
 YARN- Yet Another Resource Negotiator
 It is a resource management component of Hadoop
ecosystem.
 Yarn
is most important component of Hadoop Ecosystem.
 YARN is called as the operating system of Hadoop.
 It is responsible for managing and monitoring workloads.

 It allows multiple data processing engines to handle data
stored on a single platform.

Why YARN
 In Hadoop version 1.0
◦ Referred to as MRV1(MapReduce Version 1)
◦ MapReduce performed both processing and resource

management functions.
◦ It consisted of a Job Tracker which was the single master.
◦ Job Tracker allocated the resources, performed scheduling

and monitored the processing jobs.
◦ It assigned map and reduce tasks on multiple Task
Trackers.
◦ Task Trackers periodically reported their progress to Job

Tracker.

Why YARN
 Scalability in Task Tracker due to a single Job Tracker.
 IBM mentioned in its article that according to Yahoo!,

◦ The practical limits of design of Task tracker are reached with a
cluster of 5000 nodes and 40,000 tasks running concurrently.
 Apart from this limitation, the utilization of computational

resources is inefficient in MRV1.
 Hadoop framework became limited only to MapReduce

processing paradigm.

YARN Components
 Resource Manager:
◦ Runs on a master daemon and manages the resource allocation in the
cluster.
 Node Manager:
◦ They run on the slave daemons and are responsible for the execution
of a task on every single Data Node.
 Application Master:
◦ Manages the user job lifecycle and resource needs of individual
applications.
◦ It works along with the Node Manager and monitors the execution of
tasks.
 Container:
◦ Package of resources including RAM, CPU, Network, HDD etc on a
single node.

Features of YARN
 Flexibility – Enables other purpose-built data
processing models beyond MapReduce (batch).
◦ Due to this feature of YARN, other applications can also be run
along with Map Reduce programs in Hadoop2.
 Efficiency – As many applications run on the same

cluster, Hence, efficiency of Hadoop increases without
much effect on quality of service.
 Shared – Provides a stable, reliable, secure foundation
and shared operational services across multiple
workloads.

RDBMS and HDFS
 Since 1970, RDBMS is solution for data storage and
maintenance related problems.
 When big data arrives ,
◦ Companies realized the benefit of processing big data
◦ Started opting for solutions with Hadoop.
 Hadoop uses
◦ Distributed file system (HDFS) for storing big data
◦ MapReduce to process it.
 Hadoop excels in storing and processing of huge data of
various formats.

Limitations of Hadoop & Solution
 Hadoop can perform only batch processing.
 Data will be accessed only in a sequential manner.
 One has to search the entire dataset even for the simplest of
jobs.
A huge dataset when processed results in another huge data
set, which should also be processed sequentially.
 Solution is needed to access any data in a single unit of
time (random access).
 Hbase is database that store huge amounts of data and
access the data in a random manner.

HBase
◦ It is data storage components of Hadoop ecosystem
◦ HBase is a distributed column-oriented database
◦ It built on top of the Hadoop file system.
◦ It is an open-source project and is horizontally scalable.
◦ It is a data model that is similar to Google’s big table

designed to provide quick random access to huge amounts of
structured data.
◦ It leverages fault tolerance provided by Hadoop File System

(HDFS).
◦ It provides random real-time read/write access to data in the

Hadoop File System.

 One can store data in HDFS either directly or through HBase
 Data consumer reads/accesses the data in HDFS randomly
using HBase.
 HBase sits on top of the Hadoop File System and provides
read and write access.

HDFS and HBase
 HDFS  Hbase
 It is a distributed file  HBase is a database built on top

system suitable for of the HDFS.
storing large files.
 HBase provides fast lookups for
 HDFS does not support larger tables.
fast individual record
 It provides low latency access to
lookups.
single rows from billions of
 It provides high latency records (Random access).
batch processing
 HBase internally uses Hash
 It provides only tables and provides random access.
sequential access of data.

Storage Mechanism in HBase
 Inan HBase:
◦ Table is a collection of rows.
◦ Row is a collection of column families.
◦ Column family is a collection of columns.
◦ Column is a collection of key value pairs.
 Column-oriented databases are designed for huge tables.

Features of HBase
◦ HBase is linearly scalable.
◦ It has automatic failure support.
◦ It provides consistent read and writes.

◦ It provides data replication across clusters.
◦ Apache HBase is used to have random, real-time read/write

access to Big Data.
◦ It hosts large tables on top of clusters of commodity hardware.
◦ HBase is a non-relational database and schema-less.

◦ Companies- Facebook, Twitter, Yahoo use HBase internally.
◦ HBase does not support a structured query language- SQL

◦ HBase works well with Hive- a query engine for big data.

HIVE
 Hive is a data warehouse infrastructure tool to process
structured data in Hadoop.
 It resides on top of Hadoop to summarize Big Data, and
makes querying and analyzing easy.
 It is a platform used to develop SQL type scripts to do
MapReduce operations.
 Hive is not
◦ A relational database
◦ A design for OnLine Transaction Processing (OLTP)
◦ A language for real-time queries and row-level updates

Features of Hive
Itstores schema in a database and processed
data into HDFS.
Itis designed for OLAP-(Online Analytical
Processing)
Itprovides SQL type language for querying
called HiveQL or HQL.
It is familiar, fast, scalable, and extensib le.
 Hive QL works as a translator.
◦ It translates the SQL queries into MapReduce Jobs,
which will be executed on Hadoop.
DR. Mrs. Jyoti N. Jadhav, Associate
Professor, Dept of CSE, DYPCET 71
Components of Hive
Hive was developed by Facebook and later, it was
taken over by Apache Software Foundation.
Components of the hive are-
◦ Meta Store
◦ Driver
◦ Compiler
◦ Optimizer
◦ Executor
Components of Hive
 Meta Store:
◦ Repository that stores the metadata is called as
hive metas tore.
◦ It serves as a storage device for the metadata.
◦ It holds the information of each table such as
location and schema.
◦ Metadata keeps track of data and replicates it.
◦ It acts as a backup store in case of data loss.

Components of Hive
Driver-
 Driver receives the HiveQL instructions and acts as a

Controller.
 It observes the progress and life cycle of various
executions.
 Whenever HiveQL executes a statement, driver stores
the metadata generated out of that action.
 After the completion of MapReduce job, the driver
collects all the data and results of the query

Compiler-
 It is used for translating the HiveQL query into
MapReduce input.
 It invokes a method that executes the steps and tasks that
are needed to read the HiveQL output as needed by the
MapReduce.
 Optimizer:
 Used to improve the efficiency and scalability creating a
task while transforming the data before the reduce
operation.
 It performs transformations like aggregation, pipeline
conversion by a single join for multiple join.
Components of Hive
Executor:
Main task of the executor is to execute the tasks.
Executor interacts with Hadoop job tracker for

scheduling of tasks ready to run.

Applications of Hive
Web log Processing
Text mining
Predictive Modeling
E-Commerce Applications
Customer-facing Business Intelligence (Google
Analytics)

Pig
 It is a tool/platform.
 It is used to analyze larger sets of data representing
them as data flows.
 Pig is used with Hadoop.
 We can perform all the data manipulation operations
in Hadoop using Apache Pig.
 To write data analysis programs, Pig provides a high-
level language known as Pig Latin.
 This language provides various operators.
 Programmers can develop their own functions for
reading, writing, and processing data.
Apache Pig
 To analyze data using Apache Pig, programmers
need to write scripts using Pig Latin language.
 Allthese scripts are internally converted to Map and

Reduce tasks.
 Apache Pig has a component known as Pig Engine.
 It accepts the Pig Latin scripts as input and converts

those scripts into MapReduce jobs.

Advantages of Apache Pig
The benefits of Apache Pig –
◦ Less development time
◦ Easy to learn
◦ Procedural language
◦ Dataflow
◦ Easy to control execution
◦ Usage of Hadoop features
◦ Effective for unstructured
◦ Write our own UDF(User Defined Function)

Limitations of Apache Pig
Disadvantages of Apache Pig –
 Errors of Pig- Due to UDFs.
 Not mature- still in the development
 Support- Google and StackOverflow do not lead good
solutions.
 Minor one- Absence of good IDE or plug
 Implicit data schema- Data Schema is not enforced
explicitly but implicitly.
 Delay in execution-
◦ Until we store an intermediate or final result the

commands are not executed.
◦ Increases iteration between debug & resolve the issue.

Sqoop
Sqoop
◦ It is data access component of Hadoop Ecosystem.
◦ It is a tool designed to transfer data between

Hadoop and relational database servers.
◦ Used to import data from relational databases such

as MySQL, Oracle to Hadoop HDFS.
◦ Used to export from Hadoop file system to

relational databases.

Why Sqoop?
 Analytical processing using Hadoop requires
◦ loading of huge amounts of data from various
sources into Hadoop.
 Process of load bulk data into Hadoop, from

heterogeneous sources and then processing it, having
a certain challenges.
 Factors for data load-

◦ Maintaining and ensuring data consistency
◦ Ensuring efficient utilization of resources.

Major Issues:
 Data load using Scripts
◦ Traditional approach of using scripts to load data is not
suitable for bulk data load into Hadoop.
◦ This approach is inefficient and very time-consuming.
 Direct access to external data via Map-Reduce
application
◦ Direct access to data residing at external systems (without
loading into Hadoop) for map-reduce applications
complicates these applications.
◦ This approach is not feasible.
 To load heterogeneous data into Hadoop, different
tools have been developed- Sqoop and Flume.
Sqoop
 Sqoop (SQL to Hadoop and Hadoop to SQL)
 As a tool it offers capability to

 extract data from non-Hadoop data stores,
 transform the data into a form usable by Hadoop
 then load the data into HDFS.
 This process is called ETL, for Extract, Transform,

and Load.
 Sqoop has a connector based architecture.
 Connectors know how to connect to the respective
data source and fetch the data.
How Sqoop Works?

Components of Sqoop
Sqoop Import
◦ Imports individual tables from RDBMS to HDFS.
◦ Each row in a table is treated as a record in HDFS.
◦ All records are stored as text data in text files
Sqoop Export
◦ Exports a set of files from HDFS back to an
RDBMS.
◦ The files given as input to Sqoop contain records,

which are called as rows in table.
Features of Sqoop
 Parallelimport/export- import & export data, uses
YARN
 Connectors for all RDBMS Databases- covering almost
the entire data.
 Import results of SQL query- import result returned
from SQL query
 Incremental Load- load parts of table whenever it is
updated
 Full Load- load all the tables from a database using single
command.
 Load data directly into HIVE/Hbase- load data directly
into Apache Hive
 Compression- compress data & load compressed table in
Hive. 88
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET
Why Zookeeper
 Distributed applications are
◦ difficult to coordinate and
◦ work with more error prone due to huge number of
machines attached to network.
 As many machines are involved,
◦ race condition and deadlocks are common problems.
 Race condition- when a machine tries to perform two or
more operations at a time. (Serialization)
 Deadlocks- when two or more machines try to access
same shared resource at same time (Synchronization)
 Partial failure of process- Atomicity (either whole
process will finish or nothing.)
ZooKeeper
 Zookeeper is data management component of
Hadoop.
 It is a distributed co-ordination service to manage
large set of hosts.
 Apache Zookeeper is a coordination service for
distributed application that enables synchronization
across a cluster.
 It is used to keep the distributed system functioning
together as a single unit.
 It uses synchronization, serialization and coordination
goals.
Features of Zookeeper
 Synchronization − Mutual exclusion and co-operation
between server processes.
 Ordered Messages
 Reliability - keep it from single point of failure.
 Atomicity − Data transfer either succeeds or fails.
 High performance - The performance aspects
 Distributed.
 High avaliablity.
 Fault-tolerant.
 Loose coupling.
 High throughput and low latency

Oozie
 Oozie is a workflow scheduler system to manage
Apache Hadoop jobs.
 Provides a mechanism to run job at a given schedule.
 It is the tool in which programs can be pipelined in a

desired order to work in Hadoop’s distributed
environment.
 It allows to combine multiple complex jobs to be run in
a sequential order to achieve a bigger task.
 Oozie is a scalable, reliable and extensible system.
 Oozie is able to leverage the existing Hadoop machinery
for load balancing, fail-over.
Apache Flume
 Apache Flume is a system used for moving massive
quantities of streaming data into HDFS.
 Collecting log data present in log files from web
servers and aggregating it in HDFS for analysis
 Flume has a flexible design based upon streaming data
flows.
 It is fault tolerant and robust with multiple failovers
and recovery mechanisms.
A Flume agent is a JVM process which has 3
components -Flume Source, Flume Channel and
Flume Sink
Flume Architecture

Introduction To Hadoop and Its Ecosystem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Hadoop and Its Ecosystem

Uploaded by

Copyright:

Available Formats

Hadoop Ecosystem

Introduction of big data

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 2

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 3

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 4

 Data will be stored in an RDBMS like Oracle

 Sophisticated software's can be written to interact

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 5

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 6

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 7

 It is a flexible and highly-available architecture for

 Designed to answer the question:

 “How to process big data with

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 8

 Google published technical papers

 Cutting and Cafarella modified earlier

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 9

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 11

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 12

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 13

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 14

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 15

◦ It is a parallel programming model

 It is based on master slave architecture.

 The MapReduce program runs on Hadoop which is an

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 16

◦ It provides a distributed file system that is

◦ It is highly fault-tolerant and is designed to be

◦ It provides high throughput access to application

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 17

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 18

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 19

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 21

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 22

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 23

 Ithas two daemons running.

◦ Name node (master node)

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 25

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 26

A file gets split into a number of data blocks and

 Data Nodes performs read/write request of the

 Data Node also creates, deletes and replicates

 Data nodes can deploy commodity hardware.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 27

Data Node manages data storage of the system.

Data Nodes send heartbeat to the Name Node to

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 28

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 29

 It is smallest contiguous storage allocated to a file.

 In Hadoop, default block size of 128MB or 256

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 30

 Replication factor decides how many copies of the

 NameNode receives block report from DataNode

 When a block is over-replicated/under-replicated the

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 32

◦ to improve network traffic while reading/writing HDFS file,

The Namenode responds with a number of

Based on information from Namenode, file split

After that, it starts sending them to first

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 34

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 36

 The Namenode responds with number of blocks, their

 Now client communicates with Datanodes.