Professional Documents
Culture Documents
Agenda
Overview
Pig
Hive
Jaql
Agenda
Overview
Pig
Hive
Jaql
Pig
Hive
Jaql
Developed by
Yahoo!
IBM
Language name
Pig Latin
HiveQL
Jaql
Type of language
Data flow
Declarative (SQL
dialect)
Data flow
Data structures it
operates on
Complex, nested
JSON
Schema optional?
Yes
Relational complete?
Yes
Yes
Turing complete?
Yes when
extended with
Java UDFs
Yes when
extended with
Java UDFs
Yes
Yes
Agenda
Overview
Pig
Hive
Jaql
Pig components
Two Components
Pig
Pig Latin
Compiler
pig -x local
pig -x mapreduce, or simply pig
Execution Environment
Local
Distributed
Running Pig
Script
pig scriptfile.pig
Embedded
Call in to Pig from Java
Sample code
#pig
grunt> records = LOAD econ_assist.csv
using PigStorage (,)
AS (country:chararray, sum:long);
grunt> grouped = GROUP records BY country;
grunt> thesum
= FOREACH grouped
GENERATE group,
SUM(records, sum);
Pig Latin
Statements, operations and commands
10
UDF Statements
Commands
Diagnostic Operators
11
REGISTER, DEFINE
12
alias
(1,1987) tuple
13
14
Field
Cast
Boolean
Flatten
Projection
Arithmetic
Comparison
float
double
bytearray
chararray
Complex types:
Tuple
Bag
Map
15
Eval
Input: One or more expressions
Output: An expression
Example: MAX
Filter
Input: Bag or map
Output: boolean
Example: IsEmpty
16
Load
Input: Data from external storage
Output: A relation
Example: PigStorage
Store
Input: A relation
Output: Data to external storage
Example: PigStorage
17
18
Agenda
19
Overview
Pig
Hive
Jaql
Hive - Configuration
Three ways to configure hive:
hive-site.xml
-
fs.default.name
mapred.job.tracker
Metastore configuration settings
hive hiveconf
Set command in the Hive Shell
20
Running Hive
Hive Shell
Interactive
hive
Script
hive -f myscript
Inline
hive -e 'SELECT * FROM mytable'
21
Hive services
hive --service servicename
where servicename can be:
hiveserver
server for Thrift, JDBC, ODBC clients
hwi
web interface
jar
hadoop jar with Hive jars in classpath
metastore
out of process metastore
22
Sample code
#hive
hive> CREATE TABLE foreign_aid
(country STRING, sum BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ,
STORED AS TEXTFILE;
hive> SHOW TABLES;
hive> DESCRIBE foreign_aid;
hive> LOAD DATA INPATH econ_assist.csv
OVERWRITE INTO TABLE foreign_aid;
23
Hive - Metastore
24
Hive Schema-On-Read
25
26
Extensions
MySQL-like extensions
MapReduce extensions
Data Types
Simple
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING
Complex
ARRAY, MAP, STRUCT
27
Built-in Functions
28
SHOW FUNCTIONS
DESCRIBE FUNCTION
Hive - Tables
Managed CREATE TABLE
29
Hive - Partitioning
Can make some queries faster
Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table's partitions
30
Hive - Bucketing
Can make some queries faster
Supports sampling data
Use CLUSTERED BY clause when creating table
For sorted buckets, use SORTED BY clause
when creating table
To query a sample of your data use the
TABLESAMPLE command which uses bucketing
31
SerDe Serializer/Deserializer
Binary SerDe
Column-oriented (RCFile)
STORED AS RCFILE
32
Written in Java
Three UDF types:
UDF
Input: single row, output: single row
UDAF
Input: multiple rows, output: single row
UDTF
Input: single row, output: multiple rows
Agenda
34
Overview
Pig
Hive
Jaql
Jaql
A JSON Query Language
JSON = Javascript Object Notation
Running Jaql
35
Jaql Shell
Interactive
jaqlshell
Batch
jaqlshell -b myscript.jaql
Inline
jaqlshell -e jaqlstatement
Modes
Cluster
jaqlshell -c
Minicluster
jaqlshell
Jaql
Query Language
source
sink
36
Core Operators
Filter
Group
Tee
Transform
Join
Sort
Expand
Union
Top
Jaql
Query Language
Variables
37
Jaql
Query Language
38
agg
number
string
function
random
record
Jaql
Data Storage
Amazon S3
DB2
HBase
HTTP
JDBC
Local FS
39
CSV
XML
HDFS
Thank you!
Thank you!