Hive Intro

Apache Hive
An Introduction
Agenda
Overview
A quick look at Hive and its background.
Structure
A peek at the structure of Hive.
Language
How to write DDL and DML statements in Hive.
Hive at Yahoo
Working with Hive on Yahoos grids.
Advanced Features
Some more things you can do with Hive.
More Information
Where to look when you need more details or help.
Overview
Motivation for Hive

Companies are no longer dealing with gigabytes, but rather
terabytes
Large amount of data to analyze

Researchers want to study and understand the data
Business fold wants to slice and dice the data & metrics in
various ways
Every one impatient Give me answers now
Joining across large data sets is quite tricky
Motivation for Hive

Started in January 2007 at Facebook
Query data on Hadoop without having to write complex
MapReduce programs in Java each time

SQL chosen for familiarity and tools-support
An active open-source project since August 2008

Top-level Apache project (hive.apache.org)
Used in many companies; a diverse set of contributors
What Hive Is
A Hadoop-based system for managing and querying
structured data
Hive provides an view of your data as tables with rows

and columns
Uses HDFS for storing data
Provides a SQL-like interface for querying data
Uses MapReduce for executing queries
Scales well to handle massive data-sets
6
Example
SELECT COUNT(1) AS job_count, t.wait_time
FROM
(SELECT ROUND(wait_time/1000) AS wait_time, job_id
FROM starling_jobs
WHERE grid = MB
AND dt >= 2011_07_11
AND dt <= 2011_07_13) t

GROUP BY t.wait_time;
8 Simple steps
Login to grid gateway machine.
Create a hdfs file to store your hive
metadata,
Ex:hadoop fs -mkdir
/user/vmoorthy/warehouse
Go to hive shell by running hive
SET mapred.job.queue.name=unfunded;
job in the unfunded queue
-- to run your
8 Simple steps ()
Create a database specifying the location for meta data
store.
Ex:CREATE DATABASE autos LOCATION
'/user/vmoorthy/warehouse';
USE autos;
-- to work with previously created database
named 'autos
CREATE TABLE used_car(chromeTrimId INT,trimId INT,
usedCarCondition STRING, usedCarMileage INT,
usedCarPrice INT, chromeModelId INT, modelId INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n' LOCATION
'/user/vmoorthy/usedCarTrim';
-- create a table for the tab separated hdfs file
named usedCarTrim
8 Simple steps ()
Now, you are ready to run select queries on the above
table.
Ex:-
SELECT * FROM used_car WHERE chrometrimid >
3030;
10
Structure
11
Architecture
Hive
JDBC
Command-line
Interface
Web
Interface
ODBC
Thrift
Server
Meta-store
Driver
(Compiler, Optimizer, Executor)
Hadoop
Database
12
Query Execution
Query
Parser
Logical Plan
Generator
Optimizer
Physical Plan
Generator
MapReduce Job(s)
Executor
13
Storage
Table metadata is stored in meta-store
Directories for databases, tables and partitions
Files for table-data

<warehouse-directory>
<database-directory>
<table-directory>
<partition-directory>
<data-file1>
<data-file2>
[]
<data-filen>
14
Language
15
Data Model
Database a namespace for tables and other units of data
(default if none specified)
Table a row-based store for data in a database; each row

having one or more columns
Partition a key-based separation of data in a table for
reducing the amount of data scanned (Optional)
Bucket cluster of data in a partition based on a hashing a
column-value (Optional)
16
Primitive Data-types
Integers TINYINT (1 byte), SMALLINT (2 bytes), INT (4
bytes), BIGINT (8 bytes)
Boolean BOOLEAN (TRUE / FALSE)

Floating-Point FLOAT, DOUBLE
String STRING
Implicit and explicit casting supported
17
Complex Data-types
Arrays a list of elements of the same data-type accessible
using an index. A[n] denotes the element at index
n(starts from zero) in array A

Structs a record with named elements. foo.bar denotes
the field bar in the struct foo
Maps maintains mappings from keys to respective values.
M[foo] denotes the value for foo in the map M
Collections can be nested arbitrarily
18
Operators
Relational =, !=, <, <=, etc. as well as IS
NULL, IS NOT NULL, LIKE, etc. Generate TRUE
or FALSE based on comparison

Arithmetic +, -, *, /, etc. Generate number
based on the result of the arithmetic operation
Logical AND, OR, NOT, etc. Generate TRUE or
FALSE based on combining Boolean expressions
19
Built-in Fuctions
Mathematical round(), floor(), rand(), etc.
String concat(), substr(), regexp_replace(),
etc.
Time to_date(), from_unixtime(), year(),
month(), etc.
Aggregates count(), sum(), min(), max(),
avg()
and quite a lot more
20
Creating a Table
CREATE TABLE employees(name STRING, age INT);
or
CREATE TABLE IF NOT EXISTS employees(name STRING, age
INT);
or
CREATE TABLE employees(name STRING, age INT)
PARTITIONED BY (join_dt STRING);
or
STORED AS SequenceFile;
etc.
21
Loading Data
LOAD DATA INPATH '/foo/bar/snafu.txt'
INTO TABLE employees;
or
LOAD DATA LOCAL INPATH '/homes/wombat/emp_2011-12-01.txt'
INTO TABLE employees
PARTITION (join_dt=2011_12_01);
or
INSERT OVERWRITE TABLE employees
SELECT name, age FROM all_employees
WHERE location = 'Bangalore';
22
Querying Data
SELECT * FROM employees;
or
SELECT * FROM employees LIMIT 10;
or
SELECT name, age FROM employees
WHERE age > 30;
or
SET hive.exec.compress.output=false;
SET hive.cli.print.header=true;
INSERT OVERWRITE LOCAL DIRECTORY /homes/wombat/blr
SELECT * FROM all_employees
WHERE location = Bangalore;
etc.
23
External Tables
Data not managed by Hive
Useful when data is already processed and in a usable state
Manually clean up after dropping tables/partitions
CREATE EXTERNAL TABLE foo(name string, age int)
LOCATION /user/bar/wombat;
24
Altering a Table
ALTER TABLE employees RENAME TO blr_employees;
ALTER TABLE employees

REPLACE COLUMNS (emp_name STRING, emp_age INT);
ALTER TABLE employees ADD COLUMNS (emp_id STRING);
ALTER TABLE all_employees

DROP PARTITION (location=Slackville);
25
Databases
CREATE DATABASE foo;
or
CREATE DATABASE IF NOT EXISTS foo;
or
CREATE DATABASE foo LOCATION /snafu/wombat;
USE foo;
SELECT * FROM bar LIMIT 10;
or
SELECT * FROM foo.bar LIMIT 10;
DROP DATABASE foo;
or
DROP DATABASE IF EXISTS foo;
26
Other Operations
SHOW TABLES;
SHOW PARTITIONS all_employees;
SHOW PARTITIONS all_employees
PARTITION (location=Bangalore);
DESCRIBE employees;
DROP TABLE employees;
or
DROP TABLE IF EXISTS employees;
27
Joins
SELECT e.name, d.dept_name
FROM departments d JOIN all_employees e
ON (e.dept_id = d.dept_id);
or
SELECT e.name, d.dept_name

FROM departments d
LEFT OUTER JOIN all_employees e
ON (e.dept_id = d.dept_id);
28
Ordering of Data
ORDER BY global ordering of results based on the
selected columns
SORT BY local ordering of results on each reducer

based on the selected columns
29
File-formats
TextFile plain-text files; fields delimited with Â by
default
SequenceFile serialized objects, possiblycompressed

RCFile columnar storage of serialized objects,
possibly-compressed
30
TextFile Delimiters
Default field-separator is Â; row-separator is \n
John DoeÂ36\n
Jane DoeÂ33\n
Default list-separator is ^B; value-separator is ^C
John DoeÂdept^Cfinance^Bemp_id^C2357\n
ROW FORMAT DELIMITED

FIELDS TERMINATED by '\t';
31
Buckets
Distribute partition-data into files based on columns
Improves performance for filters with these columns
Works best when data is uniformly distributed
CLUSTERED BY (name) INTO 31 BUCKETS;
32
Compressed Storage
Saves space and generally improves performance
Direct support for reading compressed files
LOAD DATA LOCAL INPATH /foo/bar/emp_data.bz2
INTO TABLE all_employees;
Compressed TextFile cannot usually be split
SequenceFile or RCFile recommended instead
33
Tips
Judicious use of partitions and buckets can drastically
improve the performance of your queries
Put always-used Hive CLI commands in $HOME/.hiverc

(e.g. SET mapred.job.queue.name=unfunded;)
Use EXPLAIN to analyze a query before executing it
Use RCFile with compression to save storage and to
improve performance
34
Hive at Yahoo
35
Specifics
Hive CLI available as /home/y/bin/hive on gateways of
supported grids
Mandatory LOCATION clause in CREATE TABLE

Must specify MapReduce queue for submitted Jobs
(e.g. SET mapred.job.queue.name=unfunded;)

No JDBC / ODBC support
Integrated with HCatalog
36
Advanced Features
37
User-defined Functions
Many very useful built-in functions
SHOW FUNCTIONS;
DESCRIBE FUNCTION foo;
Extensible using user-defined functions

User-defined Function (UDF) for one-to-one mapping
E.g. round(), concat(), unix_timestamp(), etc.
User-defined Aggregate Function (UDAF) for many-to-one

mapping
E.g. sum(), avg(), stddev(), etc.
User-defined Table-generating Function (UDTF) for one-to-
many mapping
E.g. explode(), etc.
38
Custom UDF
package com.yahoo.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.io.Text;
@Description(
name = "toupper",
value = "_FUNC_(str) - Converts a string to uppercase",
extended = "Example:\n" +
" > SELECT toupper(author_name) FROM authors a;\n" +
" STEPHEN KING"

)
39
Custom UDF ()
public class ToUpper extends UDF {
public Text evaluate(Text s) {

Text to_value = new Text("");
if (s != null) {
try {
to_value.set(s.toString().toUpperCase());
} catch (Exception e) { // Should never happen
to_value = new Text(s);

}
}
return to_value;
}
}
40
UDF Usage
add jar build/ql/test/test-udfs.jar;
CREATE TEMPORARY FUNCTION TO_UPPER AS
com.yahoo.hive.udf.ToUpper';
SELECT TO_UPPER(src.value) FROM src;
DROP TEMPORARY FUNCTION TO_UPPER;
41
Overloaded UDF
public class UDFExampleAdd extends UDF {
public Integer evaluate(Integer a, Integer b) {
if (a == null || b == null) return null;
return a + b;
}
public Double evaluate(Double a, Double b) {
if (a == null || b == null) return null;

return a + b;
}
42
Overloaded UDF
add jar build/contrib/hive_contrib.jar;
CREATE TEMPORARY FUNCTION example_add AS
'org.apache.hadoop.hive.contrib.udf.UDFExampleAdd';
SELECT example_add(1, 2) FROM src;
SELECT example_add(1.1, 2.2) FROM src;
43
UDAF Example
SELECT page_url, count(1), count(DISTINCT user_id)
FROM mylog;
public class UDAFCount extends UDAF {
public static class Evaluator implements UDAFEvaluator {
private int mCount;
public void init() {mcount = 0;}
public boolean iterate(Object o) {
if (o!=null) mCount++; return true;}
public Integer terminatePartial() {return mCount;}
public boolean merge(Integer o) {mCount += o; return true;}
public Integer terminate() {return mCount;}
}
44
Overloaded UDAF
public class UDAFSum extends UDAF {
public static class IntEvaluator implements UDAFEvaluator
{
private int mSum;

public void init() {mSum = 0;}
public boolean iterate(Integer o)
{mSum += o; return true;}
public Integer terminatePartial() {return mSum;}
public boolean merge(Integer o) {mSum += o; return
true;}
public Integer terminate() {return mSum;}

}
45
Overloaded UDAF
public static class DblEvaluator implements UDAFEvaluator {
private double mSum;
public void init() {mSum = 0;}
public boolean iterate(Double o)

public Double terminatePartial() {return mSum;}
public boolean merge(Double o)

public Double terminate() {return mSum;}
}
}
46
What Hive Is Not

Not suitable for small data-sets
Does not provide real-time results
Does not support row-level updates
Imposes a schema on the data
Does not support transactions
Does not need expensive server-class hardware,
RDBMS licenses or god-like DBAs to scale
47
More Information
48
External References
Hive home-page: hive.apache.org
Hive wiki: cwiki.apache.org/confluence/display/Hive
Hive tutorial: cwiki.apache.org/confluence/display/Hive/Tutorial
Hive language manual:
cwiki.apache.org/confluence/display/Hive/LanguageManual
Mailing-list: user@hive.apache.org
49
Internal References
Hive at Yahoo: wiki.corp.yahoo.com/view/Grid/Hive
Hive FAQ: wiki.corp.yahoo.com/view/Grid/HiveFAQ
Troubleshooting:
wiki.corp.yahoo.com/view/Grid/HiveTroubleShooting
Internal mailing-list: hive-users@yahoo-inc.com

Hive CLI yinst package: hive_cli
Installation instructions:
wiki.corp.yahoo.com/view/Grid/HiveInstallation
50
Questions?
51

Hive Intro

Uploaded by

Copyright:

Available Formats

You might also like

Hive Intro

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hive Intro

Uploaded by

Copyright:

Available Formats

Apache Hive

Motivation for Hive

Large amount of data to analyze

Joining across large data sets is quite tricky

Motivation for Hive

MapReduce programs in Java each time

An active open-source project since August 2008

Used in many companies; a diverse set of contributors

Hive provides an view of your data as tables with rows

AND dt <= 2011_07_13) t

SELECT * FROM used_car WHERE chrometrimid >

Files for table-data

Table a row-based store for data in a database; each row

Boolean BOOLEAN (TRUE / FALSE)

n(starts from zero) in array A

Collections can be nested arbitrarily

or FALSE based on comparison

and quite a lot more

WHERE location = 'Bangalore';

Manually clean up after dropping tables/partitions

CREATE EXTERNAL TABLE foo(name string, age int)

ALTER TABLE employees

ALTER TABLE employees ADD COLUMNS (emp_id STRING);

ALTER TABLE all_employees

SELECT e.name, d.dept_name

SORT BY local ordering of results on each reducer

SequenceFile serialized objects, possiblycompressed

CREATE TABLE employees(name STRING, age INT)

ROW FORMAT DELIMITED

Compressed TextFile cannot usually be split

SequenceFile or RCFile recommended instead

Put always-used Hive CLI commands in $HOME/.hiverc

Mandatory LOCATION clause in CREATE TABLE

(e.g. SET mapred.job.queue.name=unfunded;)

Extensible using user-defined functions

User-defined Aggregate Function (UDAF) for many-to-one

User-defined Table-generating Function (UDTF) for one-to-

" STEPHEN KING"

public Text evaluate(Text s) {

to_value = new Text(s);

if (a == null || b == null) return null;

private int mSum;

public Integer terminate() {return mSum;}

public boolean iterate(Double o)

public boolean merge(Double o)

What Hive Is Not

RDBMS licenses or god-like DBAs to scale

Internal mailing-list: hive-users@yahoo-inc.com

You might also like