You are on page 1of 40

Pig, Hive, and Jaql

IBM Information Management


Cloud Computing Center of Competence
IBM Toronto Lab

Agenda

Overview
Pig
Hive
Jaql

Agenda

Overview
Pig
Hive
Jaql

Similarities of Pig, Hive and Jaql

All translate their respective high-level languages to


MapReduce jobs
All offer significant reductions in program size over
Java
All provide points of extension to cover gaps in
functionality
All provide interoperability with other languages
None support random reads/writes or low-latency
queries

Comparing Pig, Hive, and Jaql


Characteristic

Pig

Hive

Jaql

Developed by

Yahoo!

Facebook

IBM

Language name

Pig Latin

HiveQL

Jaql

Type of language

Data flow

Declarative (SQL
dialect)

Data flow

Data structures it
operates on

Complex, nested

JSON

Schema optional?

Yes

No, but data can


have many
schemas

Relational complete?

Yes

Yes

Turing complete?

Yes when
extended with
Java UDFs

Yes when
extended with
Java UDFs

Yes

Yes

Agenda

Overview
Pig
Hive
Jaql

Pig components
Two Components

Language (called Pig Latin)


Compiler

Pig
Pig Latin
Compiler

Two execution environments

Local (Single JVM)

Distributed (Hadoop cluster)

pig -x local
pig -x mapreduce, or simply pig

Execution Environment
Local
Distributed

Running Pig

Script
pig scriptfile.pig

Grunt (command line)


pig (to launch command line tool)

Embedded
Call in to Pig from Java

Sample code
#pig
grunt> records = LOAD econ_assist.csv
using PigStorage (,)
AS (country:chararray, sum:long);
grunt> grouped = GROUP records BY country;
grunt> thesum

= FOREACH grouped
GENERATE group,
SUM(records, sum);

grunt> DUMP thesum;

Pig Latin
Statements, operations and commands

10

A Pig Latin program is a collection of statements.


A statement is an operation or a command

Example of an operation: LOAD 'statement.txt';

Example of a command: ls *.txt

Logical plan/physical plan


As statement is processed, it is added to logical plan
When a statement such as 'DUMP relation' is reached, logical
plan is compiled to physical plan and executed

Pig Latin statements

UDF Statements

Commands

Hadoop Filesystem (cat, ls, etc.)


Hadoop MapReduce (kill)
Utility (exec, help, quit, run, set)

Diagnostic Operators

11

REGISTER, DEFINE

DESCRIBE, EXPLAIN, ILLUSTRATE

Pig Latin Relational operators

12

Loading and storing (LOAD, STORE, DUMP)


Filtering (FILTER, DISTINCT, FOREACH...GENERATE,
STREAM, SAMPLE)
Grouping and joining (JOIN, COGROUP, GROUP,
CROSS)
Sorting (ORDER, LIMIT)
Combining and splitting (UNION, SPLIT)

Pig Latin Relations and schemata

Result of a relational operator is a relation


A relation is a set of tuples
Relations can be named using an alias
x = LOAD 'sample.txt' AS (id: int, year: int)
DUMP x

alias

(1,1987) tuple

Structure of a relation is a schema


DESCRIBE x

x: {id: int, year: int} schema

13

Pig Latin expressions

Part of a statement containing a relational operator


Categories of expressions:
Constant
Map lookup
Conditional
Functional

14

Field
Cast
Boolean
Flatten

Projection
Arithmetic
Comparison

Pig Latin Data types


Simple types:
int
long

float
double

bytearray
chararray

Complex types:
Tuple
Bag
Map

15

Sequence of fields of any type


Unordered collection of tuples
Set of key-value pairs. Keys must be chararray.

Pig Latin Function types

Eval
Input: One or more expressions
Output: An expression
Example: MAX

Filter
Input: Bag or map
Output: boolean
Example: IsEmpty

16

Pig Latin Function types

Load
Input: Data from external storage
Output: A relation
Example: PigStorage

Store
Input: A relation
Output: Data to external storage
Example: PigStorage

17

Pig Latin User-Defined Functions


Written in Java
Packaged in a JAR file
Register JAR file using the REGISTER statement
Optionally, alias it with DEFINE statement

18

Agenda

19

Overview
Pig
Hive
Jaql

Hive - Configuration
Three ways to configure hive:
hive-site.xml
-

fs.default.name
mapred.job.tracker
Metastore configuration settings

hive hiveconf
Set command in the Hive Shell

20

Running Hive

Hive Shell

Interactive
hive

Script
hive -f myscript

Inline
hive -e 'SELECT * FROM mytable'

21

Hive services
hive --service servicename
where servicename can be:

hiveserver
server for Thrift, JDBC, ODBC clients

hwi
web interface

jar
hadoop jar with Hive jars in classpath

metastore
out of process metastore

22

Sample code
#hive
hive> CREATE TABLE foreign_aid
(country STRING, sum BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ,
STORED AS TEXTFILE;
hive> SHOW TABLES;
hive> DESCRIBE foreign_aid;
hive> LOAD DATA INPATH econ_assist.csv
OVERWRITE INTO TABLE foreign_aid;

hive> SELECT * FROM foreign_aid LIMIT 10;


hive> SELECT country, SUM(sum) FROM foreign_aid
GROUP BY country;

23

Hive - Metastore

Stores Hive metadata


Configurations
Embedded
in-process metastore, in-process database
Local
in-process metastore, out-of-process database
Remote
out-of-process metastore, out-of-process
database

24

Hive Schema-On-Read

25

Faster loads into the database (simply copy


or move)
Slower queries
Flexibility multiple schemas for the same
data

Hive Query Language (HiveQL)


SQL dialect
Does not support full SQL92 specification
No support for:

HAVING clause in SELECT


Correlated subqueries
Subqueries outside FROM clauses
Updateable or materialized views
Stored procedures

26

Hive Query Language (HiveQL)

Extensions
MySQL-like extensions
MapReduce extensions

Multi-table insert, MAP, REDUCE, TRANSFORM clauses

Data Types

Simple
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING

Complex
ARRAY, MAP, STRUCT

27

Hive Query Language (HiveQL)

Built-in Functions

28

SHOW FUNCTIONS
DESCRIBE FUNCTION

Hive - Tables
Managed CREATE TABLE

External CREATE EXTERNAL TABLE

LOAD File moved into Hive's data warehouse directory


DROP Both metadata and data deleted
LOAD No files moved
DROP Only metadata deleted

Use EXTERNAL when:

29

Sharing data between Hive and other Hadoop applications


You wish to use multiple schemas on the same data

Hive - Partitioning
Can make some queries faster
Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table's partitions

30

Hive - Bucketing
Can make some queries faster
Supports sampling data
Use CLUSTERED BY clause when creating table
For sorted buckets, use SORTED BY clause
when creating table
To query a sample of your data use the
TABLESAMPLE command which uses bucketing

31

Hive Storage formats

Delimited Text (default)

ROW FORMAT DELIMITED

SerDe Serializer/Deserializer

ROW FORMAT SERDE serdename

e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'

Binary SerDe

Row-oriented (Sequence file)


STORED AS SEQUENCEFILE

Column-oriented (RCFile)
STORED AS RCFILE

32

Hive User-Defined Functions

Written in Java
Three UDF types:

UDF
Input: single row, output: single row
UDAF
Input: multiple rows, output: single row
UDTF
Input: single row, output: multiple rows

Register UDF using ADD JAR


Create alias using CREATE TEMPORARY FUNCTION
33

Agenda

34

Overview
Pig
Hive
Jaql

Jaql
A JSON Query Language
JSON = Javascript Object Notation
Running Jaql

35

Jaql Shell

Interactive

jaqlshell

Batch

jaqlshell -b myscript.jaql

Inline

jaqlshell -e jaqlstatement

Modes

Cluster

jaqlshell -c

Minicluster

jaqlshell

Jaql
Query Language

Sources and sinks

e.g. Copy data from a local file to a new file on HDFS

source

sink

read(file(input.json)) -> write(hdfs(output))

36

Core Operators
Filter

Group

Tee

Transform

Join

Sort

Expand

Union

Top

Jaql
Query Language

Variables

= operator binds source output to a variable

e.g. $tweets = read(hdfs(twitterfeed))

Pipes, streams, and consumers

Pipe operator (->) streams data to a consumer

Pipe expects array as input

e.g. $tweets filter $.from_src == 'tweetdeck';

37

$ implicit variable referencing current array value

Jaql
Query Language

38

Categories of Built-in Functions


system
schema
core
xml
hadoop
regex
io
binary
array
date
index
nil

agg
number
string
function
random
record

Jaql
Data Storage

Data store examples

Amazon S3

DB2

HBase

HTTP

JDBC

Local FS

Data format examples


JSON

39

CSV

XML

HDFS

Thank you!

Thank you!

You might also like