Professional Documents
Culture Documents
HIVE
Contoso 2
S u i t e s
Pigcompiler translates the pig scripts into MapReduce programs which are executed in
a Hadoop cluster.
Pig provides an interactive shell called grunt, for developing pig scripts.
Data Types in Pig
Pig support simple data types such as int, long, float, double, chararray, bytearray, boolean, datetime,
and complex data types such as tuple, bag and map.The simple data types work the same way as in other
programming languages.
Insert or Drag & Drop your photo
Complex data types
Tuple
A tuple is an ordered set of fields.
Bag
A bag is an unordered collection of tuples. A bag is represented with curly braces.
Map
A Map is a set of key-value pairs. Map is represented with square brackets and a # is used to separate the
key and value.
Contoso 3
S u i t e s
Data Filtering & Analysis
The FOREACH operator is used to process each row in a relation and the GENERATE
operator is used to define the fields and generate a new row from the original.
The UNION operator can be used to merge the contents of two or more relations.
The JOIN operator is used to join two relations.
Pig provides various built-in functions such as AVG, MIN, MAX, SUM, and COUNT.
Contoso 4
S u i t e s
Storing Results
To save the results on the filesystem the STORE operator is used.
Pig uses a lazy evaluation strategy and delays the evaluation of expressions till a STORE or DUMP
operator triggers the results to be stored or displayed.
Debugging Operators
The DUMP operator is used to dump the results on the console. DUMP is used in interactive mode for
debugging purposes.
The DESCRIBE operator is used to view the schema of a relation.
The EXPLAIN operator is used to view the logical, physical, and MapReduce execution plans for computing a
relation.
The ILLUSTRATE operator is used to display the step by step execution of statements to compute a relation
with a small sample of data.
Contoso 5
S u i t e s
Apache Pig Grunt Shell Commands
In order to write Pig Latin scripts, we use the Grunt shell of Apache Pig.
By using sh and fs we can invoke any shell commands, before that.
i. sh Command
we can invoke any shell commands from the Grunt shell, using the sh command. But
make sure, we cannot execute the commands that are a part of the shell environment (ex
− cd), using the sh command.
Syntax
The syntax of the sh command is:
grunt> sh shell command parameters
Example
By using the sh option, we can invoke the ls command of Linux shell from the Grunt
shell. Here, it lists out the files in the /pig/bin/ directory.
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
Contoso 6
S u i t e s
ii. help Command
The help command gives you a list of Pig commands or Pig
properties.
Syntax
By using the help command, we can get a list of Pig commands.
grunt> help
iii.history Command
It is the very useful command, it displays a list of
statements executed/used so far since the Grunt sell is invoked.
iv. set Command
Basically, to show/assign values to keys, we use set command in
Pig.
Contoso 7
S u i t e s
ii. fs Command
Moreover, we can invoke any fs Shell commands from the Grunt shell by using the fs command.
Syntax
The syntax of fs command is:
grunt> sh File System command parameters
Example
By using fs command, we can invoke the ls command of HDFS from the Grunt shell. Here, it lists the files
in the HDFS root directory.
grunt> fs –ls
Utility Commands
It offers a set of Pig Grunt Shell utility commands. It involves clear, help, history, quiet, and set. Also, there
are some commands to control Pig from the Grunt shell, such as exec, kill, and run.
i. clear Command
In order to clear the screen of the Grunt shell, we use Clear Command.
Syntax
The syntax of the clear command is:
grunt> clear
Contoso 8
S u i t e s
There are several keys we can set values for, using this command. Such as:
default_parallel
By passing any whole number as a value to this key, we can set the number of reducers for a
map job.
debug
Also, by passing on/off to this key, we can turn off or turn on the debugging feature in Pig.
job.name
Moreover, by passing a string value to this key we can set the Job name to the required job.
job.priority
By passing one of the following values to this key, we can set the job priority to a job −
very_low
low
normal
high
very_high
stream.skippath
By passing the desired path in the form of a string to this key, we can set the path from where
the data is not to be transferred, for streaming.
Contoso 9
S u i t e s
v. quit Command
We can quit from the Grunt shell, Using this command.
Syntax
It Quit from the Grunt shell:
grunt> quit
vi. exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
The syntax of the utility command exec is:
grunt> exec [–param param_name = param_value] [–param_file file_name] [script]
Contoso 10
S u i t e s
vii. kill Command
By using this command, we can kill a job from the Grunt shell.
Syntax
Given below is the syntax of the kill command.
grunt> kill JobId
viii. run Command
By using the run command, we can run a Pig script from the Grunt
shell.
Syntax
The syntax of the run command is:
grunt> run [–param param_name = param_value] [–
param_file file_name] script
Contoso 11
S u i t e s
HIVE
• Architecture of Hive User Interface - Hive is a
data warehouse infrastructure software that can
create interaction between user and HDFS.
• What is Hive? Hive is a data warehouse • The user interfaces that Hive supports are Hive
infrastructure tool to process structure data in Web UI, Hive command line, and Hive HD.
Hadoop. It resides on top of Hadoop to summarize
• HiveQL Process Engine- HiveQL is similar to
Big Data, and makes querying and analyzing easy. SQL for querying on schema info on the
• Initially Hive was developed by Facebook, later Megastore.
the Apache Software Foundation took it up and • It is one of the replacements of traditional
developed it further as an open source under the approach for MapReduce program.
name Apache Hive. • Instead of writing MapReduce program in Java,
• Features of Hive It stores Schema in a database we can write a query for MapReduce job and
and processed data into HDFS(Hadoop process it.
Distributed File System). It is designed for OLAP.
• It provides SQL type language for querying
called HiveQL or HQL. It is familiar, fast,
scalable, and extensible.
• Architecture of Hive
Contoso 12
S u i t e s
6. Execution Engine - The conjunction part of
HiveQL process Engine and MapReduce is Hive
Execution Engine.
• Execution engine processes the query and
generates results as same as MapReduce results. Get Metadata- The compiler sends metadata request to
• It uses the flavor of MapReduce. Megastore
Contoso 13
S u i t e s
• Metadata Ops- Meanwhile in execution, the execution • 14. Literals Floating Point Types - Floating point types
engine can execute metadata operations with Metastore. are nothing but numbers with decimal points. Generally,
Fetch Result- The execution engine receives the results this type of data is composed of DOUBLE data type.
from Data nodes. Send Results- The execution engine Decimal Type - Decimal type data is nothing but floating
sends those resultant values to the driver. Send Results- point value with higher range than DOUBLE data type.
The driver sends the results to Hive Interfaces. The range of decimal type is approximately -10-308 to
10308 .
• 11. Hive- Data Types All the data types in hive are
classified into four types Column Types, Literals, Null
Values ,Complex Types
• 12. Column Types Integral Types - Integer type data can
be specified using integral data types, INT. When the
data range exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than the INT,
you use SMALLINT. TINYINT is smaller than
SMALLINT.
Contoso 14
S u i t e s
1.Complex Types Arrays - Arrays in Hive are used the 19. Partition Hive organizes tables into partitions. It is
same way they are used in Java. Syntax: a way of dividing a table into related parts based on
ARRAY<data_type> Maps - Maps in Hive are similar to the values of partitioned columns such as date, city,
Java Maps. Syntax: MAP<primitive_type, data_type> and department. Using partition, it is easy to query a
Structs - Structs in Hive is similar to using complex data portion of the data. Adding partition- Syntax - hive>
with comment. Syntax: STRUCT<col_name : data_type ALTER TABLE employee ADD PARTITION(year
[ COMMENT col_comment, … ]> =‘2013’) location ‘/2012/part2012’; Dropping partition
- Syntax - hive>ALTER TABLE employee DROP [IF
EXISTS] PARTITION (year=‘2013’);
Contoso 15
S u i t e s
THANK YOU
Contoso 16
S u i t e s