Good Hadoop Hive Sproehnle Bbuzz2010

Hive: SQL for Hadoop
Sarah Sproehnle
Educational Services, Cloudera Inc.
Copyright 2010 Cloudera - Do not distribute
What is Hive?
Subproject of Apache Hadoop
Originally created at Facebook
Provides an easier and faster way to analyze data in
Hadoop (compared to MapReduce in Java, Python, C++,
etc)
The problem with MapReduce

The developer has to worry about a lot of things besides
the analysis/processing logic
Often requires several MapReduce passes to accomplish
a final
The data is schema-less

Would be more convenient to use constructs such as
"filter", "join", "aggregate"
Copyright 2010 Cloudera - Do not
distribute
What does Hive provide?

A parser/optimizer/compiler that translates HiveQL to
MapReduce code
A Metastore (usually a MySQL server) that stores the
"schema" information
Table name, column names, data types
Partition information
Storage location, row format (SerDe), storage format (Input
and OutputFormat)
Despite SQL-like language, Hive is NOT an RDBMS
Language
RDBMS
Hive
SQL-92 standard (maybe)
Subset of SQL-92 plus Hivespecific extension
Update Capabilities INSERT, UPDATE, and DELETE
INSERT OVERWRITE;
no UPDATE or DELETE
Transactions
Yes
No
Latency
Sub-second
Minutes or more
Indexes
Any number of indexes, very

important for performance
No indexes, data is always

scanned (in parallel)
Data size
TBs
PBs
Getting started is easy!

1. Setup Hadoop (NameNode, DataNodes, JobTracker
and TaskTrackers)
2. Create Hive tables
3. Load data into Hive
4. SELECT data
Creating a table in Hive

CREATE TABLE tablename
(col1 INT, col2 STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t'
Describes the format

of the file
STORED AS TEXTFILE;
Load data into the table

LOAD DATA LOCAL INPATH
'/myhd/file_or_dir'
INTO TABLE <tablename>
Copies data from the client's

filesystem to HDFS
LOAD DATA INPATH

'/hdfs/file_or_dir'
INTO TABLE <tablename>
Moves data to
/user/hive/warehouse
File format should match the CREATE TABLE definition!
Schema on read, not write

Data is not checked during the load
Loads are very fast
Parsing errors would be found at query time

Possible to have multiple schemas for the same data
(using EXTERNAL tables)

distribute
Sqoop: SQL to Hadoop
mysql
sqoop!
Hadoop
Custom
MapReduce
programs
reinterpret data
...Via auto-generated
datatype definitions
$ sqoop --connect jdbc:mysql://foo.com/corp \

--table employees \
--hive-import \
--fields-terminated-by '\t' \
--lines-terminated-by '\n'
10
Query the data with SELECT

Similar to SQL:
SELECT..
FROM..
WHERE..
GROUP BY..
ORDER BY..
Inner join, outer join, full outer join
Subqueries (in the FROM clause)
Built-in functions such as round, concat, substr,
max, min, sum, count
11
Hive converts to a series of MapReduce phases
WHERE => map

GROUP BY/ORDER BY => reduce
JOIN => map or reduce depending on optimizer
Example:
SELECT * FROM purchases
WHERE cost > 40
ORDER BY order_date DESC;
Single MapReduce required:

WHERE clause translates to a map
Mapper outputs order_date as key
Single reducer collects sorted rows
12
EXPLAIN - the "map" (slide 1/3)

EXPLAIN SELECT * FROM purchases
WHERE cost > 40
ORDER BY order_date DESC;

STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
purchases
TableScan
alias: purchases
Filter Operator
predicate:
expr: (cost > UDFToDouble(40))
type: boolean
distribute
13
EXPLAIN - the "shuffle and sort" (slide 2/3)

Select Operator
expressions:
expr: custid
type: int
expr: order_date
type: string
expr: cost
type: double
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
key expressions:
expr: _col1
type: string

distribute
14
EXPLAIN - the "reduce" (slide 3/3)

value expressions:
expr: _col0
type: int
expr: _col1
Output of reducers
type: string
expr: _col2
type: double
Reduce Operator Tree:
Extract
Identity
File Output Operator
Reducer
distribute
15
Optimizations
Some operations use direct HDFS access
SELECT * FROM table LIMIT 10;
Number of MapReduce phases is minimized if possible

Map-side join via MAPJOIN hint
SELECT /*+ MAPJOIN(t1) */ t1.col, t2.col
FROM t1 JOIN t2
ON (t1.col = t2.col)
16
Hive extension: multi-table insert

FROM(
SELECT username, accessdate
FROM logs WHERE url LIKE '%.cloudera.com'
) clicks
INSERT OVERWRITE DIRECTORY 'count'
SELECT count(1)
INSERT OVERWRITE DIRECTORY 'list_users'

SELECT DISTINCT clicks.username;

distribute
17
Invoking custom map script

ADD FILE /tmp/map.py;
INSERT OVERWRITE TABLE results_table
SELECT transform(logdata.*)
USING'./map.py' as (output)
Map.py receives the log
FROM
records as tab-separated list
(SELECT *
and returns output
FROM logs) logdata;

distribute
18
Partitioning and bucketing

Divide data into subsets of rows
Benefits:
Better performance
Easier data management (add or delete a portion of a table)
Sampling of data
19
Partitioning
CREATE TABLE logs (url STRING, user STRING)
PARTITIONED BY (d STRING);
LOAD DATA LOCAL INPATH

'/tmp/new_logs.txt'
INTO TABLE logs

PARTITION (d='2010-04-01');
20
Bucketing
CREATE TABLE tablename (columns)
CLUSTERED BY (col) INTO N BUCKETS;
SET hive.enforce.bucketing = true;
INSERT OVERWRITE TABLE target
SELECT * FROM helper;

distribute
21
Sampling
Reading a subset of the data is very efficient if the table
is bucketed
SELECT * FROM tablename
TABLESAMPLE (BUCKET 1 OUT OF 4 ON col)

distribute
22
Summary of Hive
Enables easy analysis of data without a lot of setup
Adds features that Hadoop lacks via the metastore
Quite full-featured with lots of development work
ongoing
23
Thanks!
Sarah Sproehnle
Educational Services, Cloudera Inc.

Good Hadoop Hive Sproehnle Bbuzz2010

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Good Hadoop Hive Sproehnle Bbuzz2010

Uploaded by

Copyright:

Available Formats

Hive: SQL for Hadoop

Copyright 2010 Cloudera - Do not distribute

Copyright 2010 Cloudera - Do not distribute

The problem with MapReduce

The data is schema-less

What does Hive provide?

Copyright 2010 Cloudera - Do not distribute

Despite SQL-like language, Hive is NOT an RDBMS

SQL-92 standard (maybe)

Subset of SQL-92 plus Hivespecific extension

Update Capabilities INSERT, UPDATE, and DELETE

Any number of indexes, very

No indexes, data is always

Copyright 2010 Cloudera - Do not distribute

Getting started is easy!

Copyright 2010 Cloudera - Do not distribute

Creating a table in Hive

Describes the format

Copyright 2010 Cloudera - Do not distribute

Load data into the table

Copies data from the client's

LOAD DATA INPATH

File format should match the CREATE TABLE definition!

Copyright 2010 Cloudera - Do not distribute

Schema on read, not write

Parsing errors would be found at query time

Copyright 2010 Cloudera - Do not

Sqoop: SQL to Hadoop

$ sqoop --connect jdbc:mysql://foo.com/corp \

Query the data with SELECT

Copyright 2010 Cloudera - Do not distribute

Hive converts to a series of MapReduce phases

WHERE => map

Single MapReduce required:

EXPLAIN - the "map" (slide 1/3)

EXPLAIN - the "shuffle and sort" (slide 2/3)

Copyright 2010 Cloudera - Do not

EXPLAIN - the "reduce" (slide 3/3)

Number of MapReduce phases is minimized if possible

Copyright 2010 Cloudera - Do not distribute

Hive extension: multi-table insert

INSERT OVERWRITE DIRECTORY 'list_users'

Copyright 2010 Cloudera - Do not

Invoking custom map script

Copyright 2010 Cloudera - Do not

Partitioning and bucketing

Copyright 2010 Cloudera - Do not distribute

LOAD DATA LOCAL INPATH

INTO TABLE logs

Copyright 2010 Cloudera - Do not distribute

SELECT * FROM helper;

Copyright 2010 Cloudera - Do not

TABLESAMPLE (BUCKET 1 OUT OF 4 ON col)

Copyright 2010 Cloudera - Do not

Copyright 2010 Cloudera - Do not distribute

Copyright 2010 Cloudera - Do not distribute

You might also like