Professional Documents
Culture Documents
Hive Intro
Hive Intro
Hive Intro
An Introduction
Agenda
Overview
A quick look at Hive and its background.
Structure
A peek at the structure of Hive.
Language
How to write DDL and DML statements in Hive.
Hive at Yahoo
Working with Hive on Yahoos grids.
Advanced Features
Some more things you can do with Hive.
More Information
Where to look when you need more details or help.
Overview
What Hive Is
A Hadoop-based system for managing and querying
structured data
Example
SELECT COUNT(1) AS job_count, t.wait_time
FROM
(SELECT ROUND(wait_time/1000) AS wait_time, job_id
FROM starling_jobs
WHERE grid = MB
AND dt >= 2011_07_11
8 Simple steps
Login to grid gateway machine.
Create a hdfs file to store your hive
metadata,
Ex:hadoop fs -mkdir
/user/vmoorthy/warehouse
Go to hive shell by running hive
SET mapred.job.queue.name=unfunded;
job in the unfunded queue
-- to run your
8 Simple steps ()
Create a database specifying the location for meta data
store.
Ex:CREATE DATABASE autos LOCATION
'/user/vmoorthy/warehouse';
USE autos;
-- to work with previously created database
named 'autos
CREATE TABLE used_car(chromeTrimId INT,trimId INT,
usedCarCondition STRING, usedCarMileage INT,
usedCarPrice INT, chromeModelId INT, modelId INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n' LOCATION
'/user/vmoorthy/usedCarTrim';
-- create a table for the tab separated hdfs file
named usedCarTrim
8 Simple steps ()
Now, you are ready to run select queries on the above
table.
Ex:-
3030;
10
Structure
11
Architecture
Hive
JDBC
Command-line
Interface
Web
Interface
ODBC
Thrift
Server
Meta-store
Driver
(Compiler, Optimizer, Executor)
Hadoop
Database
12
Query Execution
Query
Parser
Logical Plan
Generator
Optimizer
Physical Plan
Generator
MapReduce Job(s)
Executor
13
Storage
Table metadata is stored in meta-store
Directories for databases, tables and partitions
<partition-directory>
<data-file1>
<data-file2>
[]
<data-filen>
14
Language
15
Data Model
Database a namespace for tables and other units of data
(default if none specified)
16
Primitive Data-types
Integers TINYINT (1 byte), SMALLINT (2 bytes), INT (4
bytes), BIGINT (8 bytes)
String STRING
Implicit and explicit casting supported
17
Complex Data-types
Arrays a list of elements of the same data-type accessible
using an index. A[n] denotes the element at index
18
Operators
Relational =, !=, <, <=, etc. as well as IS
NULL, IS NOT NULL, LIKE, etc. Generate TRUE
19
Built-in Fuctions
Mathematical round(), floor(), rand(), etc.
String concat(), substr(), regexp_replace(),
etc.
Time to_date(), from_unixtime(), year(),
month(), etc.
Aggregates count(), sum(), min(), max(),
avg()
20
Creating a Table
CREATE TABLE employees(name STRING, age INT);
or
CREATE TABLE IF NOT EXISTS employees(name STRING, age
INT);
or
CREATE TABLE employees(name STRING, age INT)
PARTITIONED BY (join_dt STRING);
or
CREATE TABLE employees(name STRING, age INT)
STORED AS SequenceFile;
etc.
21
Loading Data
LOAD DATA INPATH '/foo/bar/snafu.txt'
INTO TABLE employees;
or
LOAD DATA LOCAL INPATH '/homes/wombat/emp_2011-12-01.txt'
INTO TABLE employees
PARTITION (join_dt=2011_12_01);
or
INSERT OVERWRITE TABLE employees
SELECT name, age FROM all_employees
22
Querying Data
SELECT * FROM employees;
or
SELECT * FROM employees LIMIT 10;
or
SELECT name, age FROM employees
WHERE age > 30;
or
SET hive.exec.compress.output=false;
SET hive.cli.print.header=true;
INSERT OVERWRITE LOCAL DIRECTORY /homes/wombat/blr
SELECT * FROM all_employees
WHERE location = Bangalore;
etc.
23
External Tables
Data not managed by Hive
Useful when data is already processed and in a usable state
LOCATION /user/bar/wombat;
24
Altering a Table
ALTER TABLE employees RENAME TO blr_employees;
25
Databases
CREATE DATABASE foo;
or
CREATE DATABASE IF NOT EXISTS foo;
or
CREATE DATABASE foo LOCATION /snafu/wombat;
USE foo;
SELECT * FROM bar LIMIT 10;
or
SELECT * FROM foo.bar LIMIT 10;
DROP DATABASE foo;
or
DROP DATABASE IF EXISTS foo;
26
Other Operations
SHOW TABLES;
SHOW PARTITIONS all_employees;
SHOW PARTITIONS all_employees
PARTITION (location=Bangalore);
DESCRIBE employees;
DROP TABLE employees;
or
DROP TABLE IF EXISTS employees;
27
Joins
SELECT e.name, d.dept_name
FROM departments d JOIN all_employees e
ON (e.dept_id = d.dept_id);
or
ON (e.dept_id = d.dept_id);
28
Ordering of Data
ORDER BY global ordering of results based on the
selected columns
29
File-formats
TextFile plain-text files; fields delimited with ^A by
default
30
TextFile Delimiters
Default field-separator is ^A; row-separator is \n
John Doe^A36\n
Jane Doe^A33\n
Default list-separator is ^B; value-separator is ^C
John Doe^Adept^Cfinance^Bemp_id^C2357\n
31
Buckets
Distribute partition-data into files based on columns
Improves performance for filters with these columns
Works best when data is uniformly distributed
CREATE TABLE employees(name STRING, age INT)
CLUSTERED BY (name) INTO 31 BUCKETS;
32
Compressed Storage
Saves space and generally improves performance
Direct support for reading compressed files
LOAD DATA LOCAL INPATH /foo/bar/emp_data.bz2
INTO TABLE all_employees;
33
Tips
Judicious use of partitions and buckets can drastically
improve the performance of your queries
34
Hive at Yahoo
35
Specifics
Hive CLI available as /home/y/bin/hive on gateways of
supported grids
36
Advanced Features
37
User-defined Functions
Many very useful built-in functions
SHOW FUNCTIONS;
DESCRIBE FUNCTION foo;
many mapping
E.g. explode(), etc.
38
Custom UDF
package com.yahoo.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.io.Text;
@Description(
name = "toupper",
value = "_FUNC_(str) - Converts a string to uppercase",
extended = "Example:\n" +
" > SELECT toupper(author_name) FROM authors a;\n" +
39
Custom UDF ()
public class ToUpper extends UDF {
40
UDF Usage
add jar build/ql/test/test-udfs.jar;
CREATE TEMPORARY FUNCTION TO_UPPER AS
com.yahoo.hive.udf.ToUpper';
SELECT TO_UPPER(src.value) FROM src;
DROP TEMPORARY FUNCTION TO_UPPER;
41
Overloaded UDF
public class UDFExampleAdd extends UDF {
public Integer evaluate(Integer a, Integer b) {
if (a == null || b == null) return null;
return a + b;
}
public Double evaluate(Double a, Double b) {
42
Overloaded UDF
add jar build/contrib/hive_contrib.jar;
CREATE TEMPORARY FUNCTION example_add AS
'org.apache.hadoop.hive.contrib.udf.UDFExampleAdd';
SELECT example_add(1, 2) FROM src;
SELECT example_add(1.1, 2.2) FROM src;
43
UDAF Example
SELECT page_url, count(1), count(DISTINCT user_id)
FROM mylog;
public class UDAFCount extends UDAF {
public static class Evaluator implements UDAFEvaluator {
private int mCount;
public void init() {mcount = 0;}
public boolean iterate(Object o) {
if (o!=null) mCount++; return true;}
public Integer terminatePartial() {return mCount;}
public boolean merge(Integer o) {mCount += o; return true;}
public Integer terminate() {return mCount;}
}
44
Overloaded UDAF
public class UDAFSum extends UDAF {
public static class IntEvaluator implements UDAFEvaluator
{
45
Overloaded UDAF
public static class DblEvaluator implements UDAFEvaluator {
private double mSum;
public void init() {mSum = 0;}
46
47
More Information
48
External References
Hive home-page: hive.apache.org
Hive wiki: cwiki.apache.org/confluence/display/Hive
Hive tutorial: cwiki.apache.org/confluence/display/Hive/Tutorial
Hive language manual:
cwiki.apache.org/confluence/display/Hive/LanguageManual
Mailing-list: user@hive.apache.org
49
Internal References
Hive at Yahoo: wiki.corp.yahoo.com/view/Grid/Hive
Hive FAQ: wiki.corp.yahoo.com/view/Grid/HiveFAQ
Troubleshooting:
wiki.corp.yahoo.com/view/Grid/HiveTroubleShooting
50
Questions?
51