Professional Documents
Culture Documents
It is a collection of large
datasets that cannot be
processed using traditional
computing techniques.
Big data is not merely a data, rather it has become a
complete subject, which involves various tools,
techniques and frameworks.
Yahoo
Google
Facebook
Amazon
IBM
& many more at
http://wiki.apache.org/hadoop/PoweredBy
Processing /
Computation layer
(MapReduce)
Storage layer
(Hadoop Distributed
File System)
Hadoop Common −
Java libraries and
utilities required by other
Hadoop modules.
Hadoop YARN −
framework for job
scheduling and cluster
resource management.
HDFS splits the data unit into smaller units called blocks
and stores them in a distributed manner.
Map stage −
◦ The map or mapper’s job is to process the input data.
Mapping
◦ In this phase data in each split is passed to a mapping
function to produce output values.
Example-2
One has to search the entire dataset even for the simplest of
jobs.
A huge dataset when processed results in another huge data
set, which should also be processed sequentially.
Solution is needed to access any data in a single unit of
time (random access).
Hbase is database that store huge amounts of data and
access the data in a random manner.
Optimizer:
Used to improve the efficiency and scalability creating a
task while transforming the data before the reduce
operation.
It performs transformations like aggregation, pipeline
conversion by a single join for multiple join.
DR. Mrs. Jyoti N. Jadhav, Associate Professor, Dept of CSE, DYPCET 75
Components of Hive
Executor:
Main task of the executor is to execute the tasks.
Sqoop Export
◦ Exports a set of files from HDFS back to an
RDBMS.