Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab

Pig, Hive, and Jaql
IBM Information Management

Cloud Computing Center of Competence
IBM Toronto Lab
Agenda
Overview
Pig
Hive
Jaql
Agenda
Overview
Pig
Hive
Jaql
Similarities of Pig, Hive and Jaql
All translate their respective high-level languages to

MapReduce jobs
All offer significant reductions in program size over
Java
All provide points of extension to cover gaps in
functionality
All provide interoperability with other languages
None support random reads/writes or low-latency
queries
Comparing Pig, Hive, and Jaql

Characteristic
Pig
Hive
Jaql
Developed by
Yahoo!
Facebook
IBM
Language name
Pig Latin
HiveQL
Jaql
Type of language
Data flow
Declarative (SQL
dialect)
Data flow
Data structures it
operates on
Complex, nested
JSON
Schema optional?
Yes
No, but data can

have many
schemas
Relational complete?
Yes
Yes
Turing complete?
Yes when
extended with
Java UDFs
Yes when
extended with
Java UDFs
Yes
Yes
Agenda
Overview
Pig
Hive
Jaql
Pig components
Two Components
Language (called Pig Latin)

Compiler
Pig
Pig Latin
Compiler
Two execution environments
Local (Single JVM)
Distributed (Hadoop cluster)
pig -x local
pig -x mapreduce, or simply pig
Execution Environment
Local
Distributed
Running Pig
Script
pig scriptfile.pig
Grunt (command line)

pig (to launch command line tool)
Embedded
Call in to Pig from Java
Sample code
#pig
grunt> records = LOAD econ_assist.csv
using PigStorage (,)
AS (country:chararray, sum:long);
grunt> grouped = GROUP records BY country;
grunt> thesum
= FOREACH grouped
GENERATE group,
SUM(records, sum);
grunt> DUMP thesum;
Pig Latin
Statements, operations and commands
10
A Pig Latin program is a collection of statements.

A statement is an operation or a command
Example of an operation: LOAD 'statement.txt';
Example of a command: ls *.txt
Logical plan/physical plan

As statement is processed, it is added to logical plan
When a statement such as 'DUMP relation' is reached, logical
plan is compiled to physical plan and executed
Pig Latin statements
UDF Statements
Commands
Hadoop Filesystem (cat, ls, etc.)

Hadoop MapReduce (kill)
Utility (exec, help, quit, run, set)
Diagnostic Operators
11
REGISTER, DEFINE
DESCRIBE, EXPLAIN, ILLUSTRATE
Pig Latin Relational operators
12
Loading and storing (LOAD, STORE, DUMP)

Filtering (FILTER, DISTINCT, FOREACH...GENERATE,
STREAM, SAMPLE)
Grouping and joining (JOIN, COGROUP, GROUP,
CROSS)
Sorting (ORDER, LIMIT)
Combining and splitting (UNION, SPLIT)
Pig Latin Relations and schemata
Result of a relational operator is a relation

A relation is a set of tuples
Relations can be named using an alias
x = LOAD 'sample.txt' AS (id: int, year: int)
DUMP x
alias
(1,1987) tuple
Structure of a relation is a schema

DESCRIBE x
x: {id: int, year: int} schema
13
Pig Latin expressions
Part of a statement containing a relational operator

Categories of expressions:
Constant
Map lookup
Conditional
Functional
14
Field
Cast
Boolean
Flatten
Projection
Arithmetic
Comparison
Pig Latin Data types

Simple types:
int
long
float
double
bytearray
chararray
Complex types:
Tuple
Bag
Map
15
Sequence of fields of any type

Unordered collection of tuples
Set of key-value pairs. Keys must be chararray.
Pig Latin Function types
Eval
Input: One or more expressions
Output: An expression
Example: MAX
Filter
Input: Bag or map
Output: boolean
Example: IsEmpty
16
Pig Latin Function types
Load
Input: Data from external storage
Output: A relation
Example: PigStorage
Store
Input: A relation
Output: Data to external storage
Example: PigStorage
17
Pig Latin User-Defined Functions

Written in Java
Packaged in a JAR file
Register JAR file using the REGISTER statement
Optionally, alias it with DEFINE statement
18
Agenda
19
Overview
Pig
Hive
Jaql
Hive - Configuration
Three ways to configure hive:
hive-site.xml
-
fs.default.name
mapred.job.tracker
Metastore configuration settings
hive hiveconf
Set command in the Hive Shell
20
Running Hive
Hive Shell
Interactive
hive
Script
hive -f myscript
Inline
hive -e 'SELECT * FROM mytable'
21
Hive services
hive --service servicename
where servicename can be:
hiveserver
server for Thrift, JDBC, ODBC clients
hwi
web interface
jar
hadoop jar with Hive jars in classpath
metastore
out of process metastore
22
Sample code
#hive
hive> CREATE TABLE foreign_aid
(country STRING, sum BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ,
STORED AS TEXTFILE;
hive> SHOW TABLES;
hive> DESCRIBE foreign_aid;
hive> LOAD DATA INPATH econ_assist.csv
OVERWRITE INTO TABLE foreign_aid;
hive> SELECT * FROM foreign_aid LIMIT 10;

hive> SELECT country, SUM(sum) FROM foreign_aid
GROUP BY country;
23
Hive - Metastore
Stores Hive metadata

Configurations
Embedded
in-process metastore, in-process database
Local
in-process metastore, out-of-process database
Remote
out-of-process metastore, out-of-process
database
24
Hive Schema-On-Read
25
Faster loads into the database (simply copy

or move)
Slower queries
Flexibility multiple schemas for the same
data
Hive Query Language (HiveQL)

SQL dialect
Does not support full SQL92 specification
No support for:
HAVING clause in SELECT

Correlated subqueries
Subqueries outside FROM clauses
Updateable or materialized views
Stored procedures
26
Extensions
MySQL-like extensions
MapReduce extensions
Multi-table insert, MAP, REDUCE, TRANSFORM clauses
Data Types
Simple
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING
Complex
ARRAY, MAP, STRUCT
27
Built-in Functions
28
SHOW FUNCTIONS
DESCRIBE FUNCTION
Hive - Tables
Managed CREATE TABLE
External CREATE EXTERNAL TABLE
LOAD File moved into Hive's data warehouse directory

DROP Both metadata and data deleted
LOAD No files moved
DROP Only metadata deleted
Use EXTERNAL when:
29
Sharing data between Hive and other Hadoop applications

You wish to use multiple schemas on the same data
Hive - Partitioning
Can make some queries faster
Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table's partitions
30
Hive - Bucketing
Can make some queries faster
Supports sampling data
Use CLUSTERED BY clause when creating table
For sorted buckets, use SORTED BY clause
when creating table
To query a sample of your data use the
TABLESAMPLE command which uses bucketing
31
Hive Storage formats
Delimited Text (default)
ROW FORMAT DELIMITED
SerDe Serializer/Deserializer
ROW FORMAT SERDE serdename
e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'
Binary SerDe
Row-oriented (Sequence file)

STORED AS SEQUENCEFILE
Column-oriented (RCFile)
STORED AS RCFILE
32
Hive User-Defined Functions
Written in Java
Three UDF types:
UDF
Input: single row, output: single row
UDAF
Input: multiple rows, output: single row
UDTF
Input: single row, output: multiple rows
Register UDF using ADD JAR

Create alias using CREATE TEMPORARY FUNCTION
33
Agenda
34
Overview
Pig
Hive
Jaql
Jaql
A JSON Query Language
JSON = Javascript Object Notation
Running Jaql
35
Jaql Shell
Interactive
jaqlshell
Batch
jaqlshell -b myscript.jaql
Inline
jaqlshell -e jaqlstatement
Modes
Cluster
jaqlshell -c
Minicluster
jaqlshell
Jaql
Query Language
Sources and sinks
e.g. Copy data from a local file to a new file on HDFS
source
sink
read(file(input.json)) -> write(hdfs(output))
36
Core Operators
Filter
Group
Tee
Transform
Join
Sort
Expand
Union
Top
Jaql
Query Language
Variables
= operator binds source output to a variable
e.g. $tweets = read(hdfs(twitterfeed))
Pipes, streams, and consumers
Pipe operator (->) streams data to a consumer
Pipe expects array as input
e.g. $tweets filter $.from_src == 'tweetdeck';
37
$ implicit variable referencing current array value
Jaql
Query Language
38
Categories of Built-in Functions

system
schema
core
xml
hadoop
regex
io
binary
array
date
index
nil
agg
number
string
function
random
record
Jaql
Data Storage
Data store examples
Amazon S3
DB2
HBase
HTTP
JDBC
Local FS
Data format examples

JSON
39
CSV
XML
HDFS
Thank you!
Thank you!

Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab

Uploaded by

Copyright:

Available Formats

Pig, Hive, and Jaql

IBM Information Management

Similarities of Pig, Hive and Jaql

All translate their respective high-level languages to

Comparing Pig, Hive, and Jaql

No, but data can

Language (called Pig Latin)

Two execution environments

Local (Single JVM)

Distributed (Hadoop cluster)

Grunt (command line)

grunt> DUMP thesum;

A Pig Latin program is a collection of statements.

Example of an operation: LOAD 'statement.txt';

Example of a command: ls *.txt

Logical plan/physical plan

Pig Latin statements

Hadoop Filesystem (cat, ls, etc.)

DESCRIBE, EXPLAIN, ILLUSTRATE

Pig Latin Relational operators

Loading and storing (LOAD, STORE, DUMP)

Pig Latin Relations and schemata

Result of a relational operator is a relation

Structure of a relation is a schema

x: {id: int, year: int} schema

Pig Latin expressions

Part of a statement containing a relational operator

Pig Latin Data types

Sequence of fields of any type

Pig Latin Function types

Pig Latin Function types

Pig Latin User-Defined Functions

hive> SELECT * FROM foreign_aid LIMIT 10;

Stores Hive metadata

Faster loads into the database (simply copy

Hive Query Language (HiveQL)

HAVING clause in SELECT

Hive Query Language (HiveQL)

Multi-table insert, MAP, REDUCE, TRANSFORM clauses

Hive Query Language (HiveQL)

External CREATE EXTERNAL TABLE

LOAD File moved into Hive's data warehouse directory

Use EXTERNAL when:

Sharing data between Hive and other Hadoop applications

Hive Storage formats

Delimited Text (default)

ROW FORMAT DELIMITED

ROW FORMAT SERDE serdename

e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'

Row-oriented (Sequence file)

Hive User-Defined Functions

Register UDF using ADD JAR

Sources and sinks

e.g. Copy data from a local file to a new file on HDFS

read(file(input.json)) -> write(hdfs(output))

= operator binds source output to a variable

e.g. $tweets = read(hdfs(twitterfeed))

Pipes, streams, and consumers

Pipe operator (->) streams data to a consumer

Pipe expects array as input

e.g. $tweets filter $.from_src == 'tweetdeck';

$ implicit variable referencing current array value