You are on page 1of 35

Department of Collegiate and Technical Education

Big Data Analytics

Department of Computer Science and Engineering

Module – 2:
Session – 1:
• ESSENTIAL HADOOP TOOLS
Objectives:

 The Pig scripting tool is introduced as a way to quickly examine data both locally and on a
Hadoop cluster.

 The Hive SQL-like query tool is explained using two examples.

 The Sqoop RDBMS tool is used to import and export data from MySQL to / from HDFS

 The flume streaming data transport utility is configured to capture weblog data into HDFS.

 The Oozie workflow manager is used to run basic and complex Hadoop workflows.

 The distributed HBase database is used to store and access data on a Hadoop cluster.
Using Apache Pig:

What is Pig in Hadoop?

• Pig Hadoop is basically a high-level programming language that is helpful for the analysis of
huge datasets. Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to
perform a lot of data administration operations.

• For writing data analysis programs, Pig renders a high-level programming language called Pig
Latin. Several operators are provided by Pig Latin using which personalized functions for
writing, reading, and processing of data can be developed by programmers.

• For analyzing data through Apache Pig, we need to write scripts using Pig Latin. Then, these
scripts need to be transformed into MapReduce tasks. This is achieved with the help of Pig
Engine.
What is Apache in Pig?

• By now, we know that Apache Pig is used with Hadoop, and Hadoop is based on the Java
programming language.

• Now, the question that arises in our minds is ‘Why Pig?’

• The need for Apache Pig came up when many programmers weren’t comfortable with Java and
were facing a lot of struggle working with Hadoop, especially, when MapReduce tasks had to
be performed.

• Apache Pig came into the Hadoop world as a boon for all such programmers.
 After the introduction of Pig Latin, now, programmers are able to work on MapReduce tasks
without the use of complicated codes as in Java.

 To reduce the length of codes, the multi-query approach is used by Apache Pig, which results
in reduced development time by 16 folds.

 Since Pig Latin is very similar to SQL, it is comparatively easy to learn Apache Pig if we have
little knowledge of SQL.

• For supporting data operations such as filters, joins, ordering, etc., Apache Pig provides
several in-built operations.
Features of Pig Hadoop:

• There are several features of Apache Pig:

• In-built operators: Apache Pig provides a very good set of operators for performing several
data operations like sort, join, filter, etc.

• Ease of programming: Since Pig Latin has similarities with SQL, it is very easy to write a Pig
script.

• Automatic optimization: The tasks in Apache Pig are automatically optimized. This makes the
programmers concentrate only on the semantics of the language.

• Handles all kinds of data: Apache Pig can analyze both structured and unstructured data and
store the results in HDFS.
Apache Pig Architecture;
• The main reason why programmers have started using Hadoop Pig is that it converts the scripts
into a series of MapReduce tasks making their job easy. Below is the architecture of Pig
Hadoop:
Pig Hadoop framework has four main components:
• Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the parser. The parser is
responsible for checking the syntax of the script, along with other miscellaneous checks. Parser gives
an output in the form of a Directed Acyclic Graph (DAG) that contains Pig Latin statements, together
with other logical operators represented as nodes.
• Optimizer: After the output from the parser is retrieved, a logical plan for DAG is passed to a logical
optimizer. The optimizer is responsible for carrying out the logical optimizations.
• Compiler: The role of the compiler comes in when the output from the optimizer is received. The
compiler compiles the logical plan sent by the optimize The logical plan is then converted into a
series of MapReduce tasks or jobs.
• Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs are sent to
Hadoop in a properly sorted order, and these jobs are executed on Hadoop for yielding the desired
result.
Downloading and installing Pig Hadoop:
• Follow the below steps for the Apache Pig installation. These steps are for
Linux/CentOS/Windows (using VM/Ubuntu/Cloudera). In this tutorial section on ‘Pig Hadoop’,
we are using CentOS.
• Step 1: Download the Pig.tar file by writing the following command on your Terminal:
• wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
• Step 2: Extract the tar file (you downloaded in the previous step) using the following command:
•  tar -xzf pig-0.16.0.tar.gz

• Your tar file gets extracted automatically from this command. To check whether your file is
extracted, write the command ls for displaying the contents of the file. If you see the below
output, the Pig file has been successfully extracted.
• Step 3: Edit the bashrc file for updating the Apache Pig environment variables. It is required to
set this up in order to access Pig from a directory instead of going to the Pig directory for
executing Pig commands. Other applications can also access Pig using this path of Apache Pig
from this file.

• A new window will open up wherein you need to add a few commands.
• When the above window pops up, write down the following commands at the end of the file:
• You then need to save this file in:

• You have to close this window and then, on your terminal, enter the following command for
getting the changes updated:
• Step 4: Check the Pig Version. To check whether Pig is successfully installed, you
can run the following command:
• Step 5: Start the Grunt Shell (used to run Pig Latin scripts) by running the command: Pig
• By default, Pig Hadoop chooses to run MapReduce jobs in which access is required to the
Hadoop cluster and the HDFS installation.
• But there is another mode, i.e., a local mode in which all the files are installed and run using a
localhost, along with the file system. You can run the localhost mode using the command: pig –
x

• I hope you were able to successfully install Apache Pig.


• In this section on Apache Pig, we learned ‘What is Apache Pig?’, why we need Pig, its features,
architecture, and finally the installation of Apache Pig.
Using Apache Hive:
• Apache Hive is an open-source data warehouse system that has been built on top of Hadoop.
You can use Hive for analyzing and querying large datasets that are stored in Hadoop files.
Processing structured and semi-structured data can be done by using Hive. Hive is considered
the de facto standard for interactive SQL queries over petabytes of data using Hadoop and offers
the following features:
1. Tools to enable easy data extraction, transformation, and loading (ETL)
2. A mechanism to impose structure on a variety of data formats
3. Access to files stored either directly in HDFS or in other data storage systems such as HBase
4. Query execution via MapReduce and Tez (Optimized MapReduce)
• Hive provides users who are already familiar with SQL the capability to query the data on
Hadoop clusters.
• At the same time, Hive makes sit possible for programmers who are familiar with the
MapReduce framework to add their custom mappers and reduces to Hive queries.
• Hive queries can also be dramatically accelerated using the Apache Tez framework under
YARN in Hadoop version 2.
What is Hive in Hadoop?
• Don’t you think writing MapReduce jobs is tedious work? Well, with Hadoop Hive, you can
just go ahead and submit SQL queries and perform MapReduce jobs.
• So, if you are comfortable with SQL, then Hive is the right tool for you as you will be able to
work on MapReduce tasks efficiently.
• Similar to Pig, Hive has its own language, called HiveQL (HQL). It is similar to SQL. HQL
translates SQL-like queries into MapReduce jobs, like what Pig Latin does. The best part is that
you don’t need to learn Java to work with Hadoop Hive.
• Hadoop Hive runs on our system and converts SQL queries into a set of jobs for execution on a
Hadoop cluster. Basically, Hadoop Hive classifies data into tables providing a method for
attaching the structure to data stores in HDFS.
Why do we need Hadoop Hive?
• Let’s now talk about the need for Hive. To understand that, let’s see what Facebook did with its 
big data.
• Basically, there were a lot of challenges faced by Facebook before they had finally implemented
Apache Hive.
• One of those challenges was the size of data that has been generated on a daily basis. Traditional
databases, such as RDBMS and SQL, weren’t able to handle the pressure of such a huge amount of
data.
• Because of this, Facebook was looking for better options. It started using MapReduce in the beginning
to overcome this problem. But, it was very difficult to work with MapReduce as it needed mandatory
programming expertise in SQL.
• Later on, Facebook realized that Hadoop Hive had the potential to actually overcome the challenges it
faced.
• Apache Hive helps developers get away with writing complex MapReduce tasks.
• Hadoop Hive is extremely fast, scalable, and extensible.
• Since Apache Hive is comparable to SQL, it is easy for the SQL developers as well to
implement Hive queries.
• Additionally, the Hive is capable of decreasing the complexity of MapReduce by providing an
interface wherein a user can submit various SQL queries.
• So, technically, you don’t need to learn Java for working with Apache Hive.
• Hive Architecture:
• Let’s now talk about the Hadoop Hive architecture and the major working force behind Apache
Hive.
• The components of Apache Hive are as follows:
o Driver: The driver acts as a controller receiving HiveQL statements. It begins the
execution of statements by creating sessions.
o It is responsible for monitoring the life cycle and the progress of the execution. Along with
that, it also saves the important metadata that has been generated during the execution of
the HiveQL statement.
o Metastore: A metastore stores metadata of all tables. Since Hive includes partition
metadata, it helps the driver in tracking the progress of various datasets that have been
distributed across a cluster, hence keeping track of data. In a metastore, the data is saved in
an RDBMS format.
o Compiler: The compiler performs the compilation of a HiveQL query. It transforms the
query into an execution plan that contains tasks.
o Optimizer: An optimizer performs many transformations on the execution plan for
providing an optimized DAG.

o An optimizer aggregates several transformations together like converting a pipeline of joins


to a single join. It can also split the tasks for providing better performance.

o Executor: After the processes of compilation and optimization are completed, the


execution of the task is done by the executor. It is responsible for pipelining the tasks.

• 
Features of Apache Hive:
• Let’s now look at the features of Apache Hive:
 Hive provides easy data summarization and analysis and query support.
 Hive supports external tables, making it feasible to process data without having to store it into
HDFS.
 Since Hadoop has a low-level interface, Hive fits in here properly.
 Hive supports the partitioning of data at the data level for better performance.
 There is a rule-based optimizer present in Hive responsible for optimizing logical plans.
 Hadoop can process external data using Hive.
Limitations of Apache Hive:
• Though Hive is a progressive tool, it has some limitations as well.
 Apache Hive doesn’t offer any real-time queries.
 Online transaction processing is not well-supported by Apache Hive.
 There can be a delay while performing Hive queries.
• That is all for this Apache Hive tutorial. In this section about Apache Hive, you learned about
Hive that is present on top of Hadoop and is used for data analysis. It uses a language called
HiveQL that translates SQL-like queries into relevant MapReduce jobs. In the upcoming
section of this Hadoop tutorial, you will be learning about Hadoop clusters.
Differences between Hive and Pig:
Hive Pig

Used for data analysis Pig is used for data and programs

Used for processing the structured It is used for the semi-structured data
data
Has HiveQL Has Pig Latin

Used for creating reports Used for programming

Works on the server side Works on the client side

Does not support Avro Supports Avro


Using Apache Sqoop to Acquire Relational Data:
• Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use
Sqoop to import data from a relational database management system (RDBMS) into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop, and then export the data back
into an RDBMS.
• Sqoop can be used with any Java Database Connectivity (JDBC) – compliant database and has
been tested on Microsoft SQL Server, PostgesSQL, MySQL, and Oracle.
• In version 1 of Sqoop, data were accessed using connectors written for specific databases.
Version 2 does not support connectors o version 1 data transfer from a RDBMS directly to Hive
or HSase, or data transfer from Hive or HBase to your RDBMS.
• Instead, version 2 offers more generalized ways to accomplish these tasks.
• Sqoop is an automated set of volume data transfer tool which allows to simple import, export of data
from structured based data which stores NoSql systems, relational databases and enterprise data
warehouses to Hadoop ecosystems.
Key features of Sqoop
It has following features:
 JDBC based implementation are used
 Auto generation of tedious user side code
 Integration with hive
 Extensible Backend
• Why Sqoop
 Forcing Map Reduce to access data from RDBMS is repetitive, error prone and costlier than
excepted.
 Data are required to prepare for effective map reduce consumption.
Apache Sqoop Import and Export Methods:
• The below figure describes the Sqoop data import (to HDFS) process. The data import is done
in two steps. In the first step, shown in the figure, Sqoop examines the database to gather the
necessary metadata for the data to be imported. The second step is a map – only Hadoop job
that Sqoop submits to the cluster. This job does the actual data transfer using the metadata
captured in the previous step. Note that each node doing the import mush have access to the
database.
• The imported data are saved in an HDFS directory. Sqoop will use the database name for the
directory, or the user can specify any alternative directory where the files should be populated.
By default, these files contain comma-delimited fields, with new lines separating different
records. You can easily override the format in which data are copied over by explicitly
specifying the field separator and record terminator characters. Once placed in HDFS, the data
Sqoop Import

Sqoop job Map HDFS Storage


 

 
(1)Gather Metadata
 
Map
 

 
RDBMS
 
Map
 

Map

Two-step Apache Sqoop data import method


• Data export from the cluster works in a similar fashion. The export is done in two steps, as
shown in the below figure.
• As in the import process, the first step is to examine the database for metadata. The export step
again uses a map-only Hadoop job to wirte the data to the database.
• Sqoop divides the input data set into splits, then uses individual map tasks to push the splits to
the database.
• Again, this process assumes the map tasks have access to the database.
Sqoop Export

Map
Sqoop job HDFS Storage
 

 
(1) Gather Metadata  
Map  

 
RDBMS  
Map
 

Map

Two-step Sqoop data export method

You might also like