You are on page 1of 95

Hadoop Ecosystem

Introduction of big data


 Big data means really a
big data.

 It is a collection of large
datasets that cannot be
processed using traditional
computing techniques.
 Big data is not merely a data, rather it has become a
complete subject, which involves various tools,
techniques and frameworks.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 2


IBM Definition of Big Data

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 3


What data comes under big data?
 Black Box Data Social Media Data
 Stock Exchange Data Power Grid Data
 Transport Data Search Engine Data

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 4


Traditional approach
 In this approach, an enterprise will have a
computer to store and process big data.

 Data will be stored in an RDBMS like Oracle


Database, MS SQL Server or DB2.

 Sophisticated software's can be written to interact


with the database, process the required data and
present it to the users for analysis purpose.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 5


Big Data & its challenges
 Big Data includes huge  Big Data Challenges
volume, high velocity,  Capturing data
and variety of data.  Storage
 Processing
 The data in it will be of  Searching
three types.  Sharing
◦ Structured data  Transfer
◦ Semi Structured data
 Analysis
◦ Unstructured data
 Presentation

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 6


Google Solution
 Google solved this problem using an algorithm called
Map-Reduce.
 This algorithm divides the task into small parts and
assigns those parts to many computers connected over
the network, and collects the results to form the final
result dataset.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 7


History of HADOOP
 Apache top level project, open-source
implementation of frameworks for reliable, scalable,
distributed computing and data storage.

 It is a flexible and highly-available architecture for


large scale computation and data processing on a
network of commodity hardware.

 Designed to answer the question:

 “How to process big data with


reasonable cost and time?”

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 8


Hadoop’s Developers
 2005: Doug Cutting and Mike Cafarella
developed Hadoop to support distribution
for the Nutch search engine project.

 Google published technical papers


detailing its Google File System and
MapReduce programming framework.

 Cutting and Cafarella modified earlier


technology plans and developed a Java-
based MapReduce implementation and a
file system.
Doug Cutting
 The project was funded by Yahoo.
 2006: Yahoo gave the project to Apache
Software Foundation.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 9


What is Hadoop?
 Hadoop is
◦ An open-source framework
◦ It allows to store and process big data in a distributed
environment across clusters of computers
◦ Uses simple programming models.
◦ It is the most popular and powerful big data tool
◦ Project of Apache Software foundation
◦ Designed to scale up from single servers to thousands
of machines,
◦ Each machine offering
local computation and
storage.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 10
Companies using Hadoop

 Yahoo
 Google
 Facebook
 Amazon
 IBM
 & many more at
 http://wiki.apache.org/hadoop/PoweredBy

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 11


Two generation in Hadoop

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 12


Original Hadoop

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 13


Hadoop Stack Transition

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 14


Hadoop Architecture
 Hadoop has two major
layers namely −

 Processing /
Computation layer
(MapReduce)

 Storage layer
(Hadoop Distributed
File System)

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 15


MapReduce
 MapReduce-

◦ It is a parallel programming model


◦ for writing distributed applications
◦ devised at Google for efficient processing of large
amounts of data,
◦ on large clusters of commodity hardware in a
reliable, fault-tolerant manner.

 It is based on master slave architecture.

 The MapReduce program runs on Hadoop which is an


Apache open-source framework.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 16


Hadoop Distributed File System
Hadoop Distributed File System- HDFS
◦ It is based on the Google File System (GFS)

◦ It provides a distributed file system that is


designed to run on commodity hardware.

◦ It is highly fault-tolerant and is designed to be


deployed on low-cost hardware.

◦ It provides high throughput access to application


data and is suitable for applications having large
datasets.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 17


Hadoop other components
 Hadoop framework also
includes two modules −

 Hadoop Common −
 Java libraries and
utilities required by other
Hadoop modules.

 Hadoop YARN −
 framework for job
scheduling and cluster
resource management.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 18


How Does Hadoop Work?
 Hadoop runs code across a cluster of computers.
 This process includes following core tasks Hadoop performs

 Data is initially divided into directories and files.
 Files are divided into uniform sized blocks of 128M and 64M
(preferably 128M).
 These files are then distributed across various cluster nodes
for further processing.
 HDFS (top of the local file system) supervises the processing.
 Blocksare replicated for handling hardware failure.
 Checking that the code was executed successfully.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 19


Advantages of Hadoop
 Open Source
 Distributed Storage
 Parallel Processing
 Flexible
 Fault Tolerance
 Reliability
 High Availability
 Scalability
 Economic
 Data Locality
 Compatible to all platform
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 20
Components of Hadoop
 Hadoop Components
 HDFS (Hadoop Distributed file system )
 Map-Reduce
 PIG
 HIVE
 HBASE
 Zookeeper
 Oozie
 Sqoop
 Flume

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 21


Hadoop Eco system

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 22


Hadoop Ecosystem

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 23


HDFS (Hadoop Distributed file system )
 HDFS is
◦ Primary or major component of Hadoop ecosystem
◦ Responsible for storing large data sets of structured or
unstructured data across various nodes.
◦ Maintaining the metadata in the form of log files.
◦ It provides for data storage of Hadoop.

 HDFS splits the data unit into smaller units called blocks
and stores them in a distributed manner.

 Ithas two daemons running.


 HDFS consists of two core components-

◦ Name node (master node)


◦ Data Node (slave node)
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 24
HDFS (Hadoop Distributed File System)

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 25


Name Node
◦ The daemon that runs on the master server.
◦ It is the prime node which contains metadata.
◦ Stores meta-data i.e. number of data blocks, replicas & other.
◦ It maintains and manages the slave nodes.
◦ It assigns tasks to slaves.
◦ It is responsible for Namespace management.
◦ It regulates file access by the client.
◦ Keeps track of mapping of blocks to Data Nodes.
◦ It should deploy on reliable hardware
◦ All DataNodes sends a Heartbeat and block report to the
NameNode. (To ensures Data Nodes are alive)
◦ A block report contains a list of all blocks on a data node.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 26


DataNode
 DataNode daemon runs on slave nodes.
 It is responsible for storing actual business data.

A file gets split into a number of data blocks and


stored on a group of slave.

 Data Nodes performs read/write request of the


client.

 Data Node also creates, deletes and replicates


blocks on demand from NameNode.

 Data nodes can deploy commodity hardware.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 27


Task of Data Node
Block replica creation, deletion, and replication
according to the instruction of Name node.

Data Node manages data storage of the system.

Data Nodes send heartbeat to the Name Node to


report the health of HDFS.
By default, this frequency is set to 3 seconds.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 28


HDFS with Name Node and Data Node

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 29


Block in HDFS
 Block is nothing but the smallest unit of storage on
a computer system.

 It is smallest contiguous storage allocated to a file.

 In Hadoop, default block size of 128MB or 256


MB.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 30


Replication Management
 To provide fault tolerance, HDFS uses a replication
technique.
 It makes copies of the blocks and stores in on different
DataNodes.

 Replication factor decides how many copies of the


blocks get stored.
 It is 3 by default but we can configure to any value.

 NameNode receives block report from DataNode


periodically to maintain the replication factor.

 When a block is over-replicated/under-replicated the


NameNode add or delete replicas as needed.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 31
Replication Management

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 32


Rack Awareness in HDFS
 Multiple Data Nodes in Rack.
 In a large hadoop cluster,

◦ to improve network traffic while reading/writing HDFS file,


◦ NameNode chooses the DataNode which is closer to the
same rack or nearby rack to Read /write request.
 NameNode achieves rack information by maintaining the rack
ids of each Data Node.
◦ NameNode makes sure that all the replicas are not stored on the same
rack or single rack.
◦ It follows Rack Awareness Algorithm to reduce latency as well as fault
tolerance.
 First replica of a block on local rack. Next replica on another
datanode within the same rack. The third replica on different
rack.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 33
HDFS Read/Write Operation
When a client wants to write a file to HDFS, it
communicates to namenode for metadata.

The Namenode responds with a number of


blocks, their location, replicas and other details.

Based on information from Namenode, file split


into multiple blocks.

After that, it starts sending them to first


Datanode.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 34


Write operation:
 Client first sends block A to Datanode 1 with other two
Datanodes details.
 When Datanode 1 receives block A from the client,
Datanode 1 copy same block to Datanode 2 of the same
rack.
 As both the Datanodes are in the same rack so block transfer
via rack switch.
 Datanode 2 copies the same block to Datanode 3.
 As both the Datanodes are in different racks so block
transfer via an out-of-rack switch.
 When Datanode receives the blocks from the client, it sends
write confirmation to Namenode.
 The same process is repeated for each block o the file.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 35
Write operation

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 36


Read operation:
 To read from HDFS,
 The first client communicates to namenode for metadata.
 A client receives the name of files and its location.

 The Namenode responds with number of blocks, their


location, replicas and other details.

 Now client communicates with Datanodes.


 The client starts reading data parallel from the Datanodes
based on the information received from the namenode.

 When application receives all the block of the file, it


combines these blocks into the form of an original file.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 37
Read operation

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 38


MapReduce
 MapReduce-

◦ It is a parallel programming model


◦ for writing distributed applications
◦ designed for efficient processing of large amounts of
data, on large clusters of commodity hardware in a
reliable, fault-tolerant manner.
◦ MapReduce is a processing technique and
◦ Program model for distributed computing based on java.

 MapReduce algorithm contains two important tasks,


Map and Reduce.
 It is based on master slave architecture.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 39


Terminologies in MapReduce
PayLoad − Applications implement the Map and
the Reduce functions, and form the core of the job.
Mapper − Mapper maps the input key/value pairs to
a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop
Distributed File System (HDFS).
DataNode − Node where data is presented in
advance before any processing takes place.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 40


Terminologies in MapReduce
MasterNode − Node where JobTracker runs and
which accepts job requests from clients.
SlaveNode − Node where Map and Reduce
program runs.
JobTracker − Schedules jobs and tracks the assign
jobs to Task tracker.
Task Tracker − Tracks the task and reports status
to JobTracker.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 41


MapReduce Architecture

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 42


How MapReduce Organizes Work?
A job is divided into multiple tasks and run onto multiple
data nodes in a cluster.
Job Tracker-
 Accepts MR jobs submitted by users
 Schedule tasks to run on different data nodes.
 Assigns Map and Reduce tasks to Tasktrackers
 Monitors task and tasktracker status,
 Re­executes tasks upon failure-On task failure, the job
tracker can reschedule it on a different task tracker.
 Keeps track of the overall progress of each job.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 43


How MapReduce Organizes Work?
Task Tracker :
 Run Map and Reduce tasks upon instruction from the
Jobtracker
 Execution of individual task is done by task tracker.
 Manage storage and transmission of intermediate
output.
 It resides on every data node executing part of job.
 Task tracker sends the progress report to job tracker.
 Task tracker periodically sends 'heartbeat' signal to
the Job tracker about current state of the system.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 44
How MapReduce Organizes Work?

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 45


MapReduce Architecture

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 46


MapReduce program execution
 MapReduce program executes in three stages-
◦ Map stage
◦ Shuffle stage
◦ Reduce stage

 Map stage −
◦ The map or mapper’s job is to process the input data.

◦ Input data is in form of file or directory and is stored in


the HDFS.

◦ The input file is passed to the mapper function line by


line.
◦ The mapper processes the data and creates several small
chunks of data. 47
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET
Reduce stage −
 This stage is the combination of the Shuffle stage and
the Reduce stage.
 The Reducer’s job is to process the data that comes
from the mapper.
 After processing, it produces a new set of output,
which will be stored in the HDFS.

 After completion of the given tasks,


◦ Cluster collects and reduces the data to form an
appropriate result, and
◦ Sends it back to the Hadoop server.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 48


MapReduce Stages

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 49


MapReduce Architecture

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 50


MapReduce Phases/stages
 Input Splits:
◦ An input is divided into fixed-size pieces called input
splits
◦ Input split is a chunk of the input that is consumed by a
single map.

 Mapping
◦ In this phase data in each split is passed to a mapping
function to produce output values.

◦ In our example, a job count a number of occurrences of


each word from input splits and prepare a list in the form
of <word, frequency>
DR. Mrs. Jyoti N. Jadhav, Associate
Professor, Dept of CSE, DYPCET 51
MapReduce Phases/stages
 Shuffling

◦ This phase consumes the output of Mapping phase.


◦ Its task is to consolidate the relevant records from
Mapping phase output.
◦ E.g same words are clubed together along with their
respective frequency.
 Reducing

◦ In this phase, output values from the Shuffling phase


are aggregated.
◦ This phase combines values from Shuffling phase and
returns a single output value.
◦ This phase summarizes the complete dataset.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 52


MapReduce word count process
 Example-1

DOG CAT RAT


CAR CAR RAT
DOG CAR CAT

 Example-2

Welcome to Hadoop Class


Hadoop is good
Hadoop is bad

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 53


Working Of Map-Reduce

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 54


DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 55
Hadoop YARN
 YARN- Yet Another Resource Negotiator
 It is a resource management component of Hadoop
ecosystem.
 Yarn
is most important component of Hadoop Ecosystem.
 YARN is called as the operating system of Hadoop.

 It is responsible for managing and monitoring workloads.


 It allows multiple data processing engines to handle data
stored on a single platform.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 56


Why YARN
 In Hadoop version 1.0
◦ Referred to as MRV1(MapReduce Version 1)

◦ MapReduce performed both processing and resource


management functions.
◦ It consisted of a Job Tracker which was the single master.

◦ Job Tracker allocated the resources, performed scheduling


and monitored the processing jobs.
◦ It assigned map and reduce tasks on multiple Task
Trackers.

◦ Task Trackers periodically reported their progress to Job


Tracker.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 57


Why YARN
 Scalability in Task Tracker due to a single Job Tracker.

 IBM mentioned in its article that according to Yahoo!,


◦ The practical limits of design of Task tracker are reached with a
cluster of 5000 nodes and 40,000 tasks running concurrently.

 Apart from this limitation, the utilization of computational


resources is inefficient in MRV1.

 Hadoop framework became limited only to MapReduce


processing paradigm.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 58


DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 59
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 60
YARN Components
 Resource Manager:
◦ Runs on a master daemon and manages the resource allocation in the
cluster.
 Node Manager:
◦ They run on the slave daemons and are responsible for the execution
of a task on every single Data Node.
 Application Master:
◦ Manages the user job lifecycle and resource needs of individual
applications.
◦ It works along with the Node Manager and monitors the execution of
tasks.
 Container:
◦ Package of resources including RAM, CPU, Network, HDD etc on a
single node.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 61


Features of YARN
 Flexibility – Enables other purpose-built data
processing models beyond MapReduce (batch).
◦ Due to this feature of YARN, other applications can also be run
along with Map Reduce programs in Hadoop2.

 Efficiency – As many applications run on the same


cluster, Hence, efficiency of Hadoop increases without
much effect on quality of service.
 Shared – Provides a stable, reliable, secure foundation
and shared operational services across multiple
workloads.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 62


RDBMS and HDFS
 Since 1970, RDBMS is solution for data storage and
maintenance related problems.
 When big data arrives ,
◦ Companies realized the benefit of processing big data
◦ Started opting for solutions with Hadoop.
 Hadoop uses
◦ Distributed file system (HDFS) for storing big data
◦ MapReduce to process it.
 Hadoop excels in storing and processing of huge data of
various formats.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 63


Limitations of Hadoop & Solution
 Hadoop can perform only batch processing.
 Data will be accessed only in a sequential manner.

 One has to search the entire dataset even for the simplest of
jobs.
A huge dataset when processed results in another huge data
set, which should also be processed sequentially.
 Solution is needed to access any data in a single unit of
time (random access).
 Hbase is database that store huge amounts of data and
access the data in a random manner.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 64


HBase
◦ It is data storage components of Hadoop ecosystem
◦ HBase is a distributed column-oriented database
◦ It built on top of the Hadoop file system.
◦ It is an open-source project and is horizontally scalable.

◦ It is a data model that is similar to Google’s big table


designed to provide quick random access to huge amounts of
structured data.

◦ It leverages fault tolerance provided by Hadoop File System


(HDFS).

◦ It provides random real-time read/write access to data in the


Hadoop File System.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 65


 One can store data in HDFS either directly or through HBase
 Data consumer reads/accesses the data in HDFS randomly
using HBase.
 HBase sits on top of the Hadoop File System and provides
read and write access.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 66


HDFS and HBase
 HDFS  Hbase

 It is a distributed file  HBase is a database built on top


system suitable for of the HDFS.
storing large files.
 HBase provides fast lookups for
 HDFS does not support larger tables.
fast individual record
 It provides low latency access to
lookups.
single rows from billions of
 It provides high latency records (Random access).
batch processing
 HBase internally uses Hash
 It provides only tables and provides random access.
sequential access of data.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 67


Storage Mechanism in HBase
 Inan HBase:
◦ Table is a collection of rows.
◦ Row is a collection of column families.
◦ Column family is a collection of columns.
◦ Column is a collection of key value pairs.
 Column-oriented databases are designed for huge tables.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 68


Features of HBase
◦ HBase is linearly scalable.
◦ It has automatic failure support.

◦ It provides consistent read and writes.


◦ It provides data replication across clusters.

◦ Apache HBase is used to have random, real-time read/write


access to Big Data.
◦ It hosts large tables on top of clusters of commodity hardware.

◦ HBase is a non-relational database and schema-less.


◦ Companies- Facebook, Twitter, Yahoo use HBase internally.

◦ HBase does not support a structured query language- SQL


◦ HBase works well with Hive- a query engine for big data.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 69


HIVE
 Hive is a data warehouse infrastructure tool to process
structured data in Hadoop.
 It resides on top of Hadoop to summarize Big Data, and
makes querying and analyzing easy.
 It is a platform used to develop SQL type scripts to do
MapReduce operations.
 Hive is not
◦ A relational database
◦ A design for OnLine Transaction Processing (OLTP)
◦ A language for real-time queries and row-level updates

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 70


Features of Hive
Itstores schema in a database and processed
data into HDFS.
Itis designed for OLAP-(Online Analytical
Processing)
Itprovides SQL type language for querying
called HiveQL or HQL.
It is familiar, fast, scalable, and extensib le.
 Hive QL works as a translator.
◦ It translates the SQL queries into MapReduce Jobs,
which will be executed on Hadoop.
DR. Mrs. Jyoti N. Jadhav, Associate
Professor, Dept of CSE, DYPCET 71
Components of Hive
Hive was developed by Facebook and later, it was
taken over by Apache Software Foundation.
Components of the hive are-
◦ Meta Store
◦ Driver
◦ Compiler
◦ Optimizer
◦ Executor
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 72
Components of Hive
 Meta Store:
◦ Repository that stores the metadata is called as
hive metas tore.
◦ It serves as a storage device for the metadata.
◦ It holds the information of each table such as
location and schema.
◦ Metadata keeps track of data and replicates it.
◦ It acts as a backup store in case of data loss.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 73


Components of Hive
Driver-

 Driver receives the HiveQL instructions and acts as a


Controller.
 It observes the progress and life cycle of various
executions.
 Whenever HiveQL executes a statement, driver stores
the metadata generated out of that action.
 After the completion of MapReduce job, the driver
collects all the data and results of the query

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 74


Compiler-
 It is used for translating the HiveQL query into
MapReduce input.
 It invokes a method that executes the steps and tasks that
are needed to read the HiveQL output as needed by the
MapReduce.

 Optimizer:
 Used to improve the efficiency and scalability creating a
task while transforming the data before the reduce
operation.
 It performs transformations like aggregation, pipeline
conversion by a single join for multiple join.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 75
Components of Hive
Executor:
Main task of the executor is to execute the tasks.

Executor interacts with Hadoop job tracker for


scheduling of tasks ready to run.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 76


Applications of Hive
Web log Processing
Text mining
Predictive Modeling
E-Commerce Applications
Customer-facing Business Intelligence (Google
Analytics)

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 77


Pig
 It is a tool/platform.
 It is used to analyze larger sets of data representing
them as data flows.
 Pig is used with Hadoop.
 We can perform all the data manipulation operations
in Hadoop using Apache Pig.
 To write data analysis programs, Pig provides a high-
level language known as Pig Latin.
 This language provides various operators.
 Programmers can develop their own functions for
reading, writing, and processing data.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 78
Apache Pig
 To analyze data using Apache Pig, programmers
need to write scripts using Pig Latin language.

 Allthese scripts are internally converted to Map and


Reduce tasks.
 Apache Pig has a component known as Pig Engine.

 It accepts the Pig Latin scripts as input and converts


those scripts into MapReduce jobs.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 79


Advantages of Apache Pig
The benefits of Apache Pig –
◦ Less development time
◦ Easy to learn
◦ Procedural language
◦ Dataflow
◦ Easy to control execution
◦ Usage of Hadoop features
◦ Effective for unstructured
◦ Write our own UDF(User Defined Function)

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 80


Limitations of Apache Pig
Disadvantages of Apache Pig –
 Errors of Pig- Due to UDFs.
 Not mature- still in the development
 Support- Google and StackOverflow do not lead good
solutions.
 Minor one- Absence of good IDE or plug
 Implicit data schema- Data Schema is not enforced
explicitly but implicitly.
 Delay in execution-

◦ Until we store an intermediate or final result the


commands are not executed.
◦ Increases iteration between debug & resolve the issue.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 81


Sqoop
Sqoop
◦ It is data access component of Hadoop Ecosystem.

◦ It is a tool designed to transfer data between


Hadoop and relational database servers.

◦ Used to import data from relational databases such


as MySQL, Oracle to Hadoop HDFS.

◦ Used to export from Hadoop file system to


relational databases.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 82


Why Sqoop?
 Analytical processing using Hadoop requires
◦ loading of huge amounts of data from various
sources into Hadoop.

 Process of load bulk data into Hadoop, from


heterogeneous sources and then processing it, having
a certain challenges.

 Factors for data load-


◦ Maintaining and ensuring data consistency
◦ Ensuring efficient utilization of resources.

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 83


Major Issues:
 Data load using Scripts
◦ Traditional approach of using scripts to load data is not
suitable for bulk data load into Hadoop.
◦ This approach is inefficient and very time-consuming.
 Direct access to external data via Map-Reduce
application
◦ Direct access to data residing at external systems (without
loading into Hadoop) for map-reduce applications
complicates these applications.
◦ This approach is not feasible.
 To load heterogeneous data into Hadoop, different
tools have been developed- Sqoop and Flume.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 84
Sqoop
 Sqoop (SQL to Hadoop and Hadoop to SQL)

 As a tool it offers capability to


 extract data from non-Hadoop data stores,
 transform the data into a form usable by Hadoop
 then load the data into HDFS.

 This process is called ETL, for Extract, Transform,


and Load.
 Sqoop has a connector based architecture.
 Connectors know how to connect to the respective
data source and fetch the data.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 85
How Sqoop Works?

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 86


Components of Sqoop
Sqoop Import
◦ Imports individual tables from RDBMS to HDFS.
◦ Each row in a table is treated as a record in HDFS.
◦ All records are stored as text data in text files

Sqoop Export
◦ Exports a set of files from HDFS back to an
RDBMS.

◦ The files given as input to Sqoop contain records,


which are called as rows in table.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 87
Features of Sqoop
 Parallelimport/export- import & export data, uses
YARN
 Connectors for all RDBMS Databases- covering almost
the entire data.
 Import results of SQL query- import result returned
from SQL query
 Incremental Load- load parts of table whenever it is
updated
 Full Load- load all the tables from a database using single
command.
 Load data directly into HIVE/Hbase- load data directly
into Apache Hive
 Compression- compress data & load compressed table in
Hive. 88
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET
Why Zookeeper
 Distributed applications are
◦ difficult to coordinate and
◦ work with more error prone due to huge number of
machines attached to network.
 As many machines are involved,
◦ race condition and deadlocks are common problems.
 Race condition- when a machine tries to perform two or
more operations at a time. (Serialization)
 Deadlocks- when two or more machines try to access
same shared resource at same time (Synchronization)
 Partial failure of process- Atomicity (either whole
process will finish or nothing.)
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 89
ZooKeeper
 Zookeeper is data management component of
Hadoop.
 It is a distributed co-ordination service to manage
large set of hosts.
 Apache Zookeeper is a coordination service for
distributed application that enables synchronization
across a cluster.
 It is used to keep the distributed system functioning
together as a single unit.
 It uses synchronization, serialization and coordination
goals.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 90
Features of Zookeeper
 Synchronization − Mutual exclusion and co-operation
between server processes.
 Ordered Messages
 Reliability - keep it from single point of failure.
 Atomicity − Data transfer either succeeds or fails.
 High performance - The performance aspects
 Distributed.
 High avaliablity.
 Fault-tolerant.
 Loose coupling.
 High throughput and low latency

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 91


Oozie
 Oozie is a workflow scheduler system to manage
Apache Hadoop jobs.
 Provides a mechanism to run job at a given schedule.

 It is the tool in which programs can be pipelined in a


desired order to work in Hadoop’s distributed
environment.
 It allows to combine multiple complex jobs to be run in
a sequential order to achieve a bigger task.
 Oozie is a scalable, reliable and extensible system.
 Oozie is able to leverage the existing Hadoop machinery
for load balancing, fail-over.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 92
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 93
Apache Flume
 Apache Flume is a system used for moving massive
quantities of streaming data into HDFS.
 Collecting log data present in log files from web
servers and aggregating it in HDFS for analysis
 Flume has a flexible design based upon streaming data
flows.
 It is fault tolerant and robust with multiple failovers
and recovery mechanisms.
A Flume agent is a JVM process which has 3
components -Flume Source, Flume Channel and
Flume Sink
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 94
Flume Architecture

DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 95

You might also like