0% found this document useful (0 votes)

1K views51 pages

Apache Hive for Data Analysts

This document provides an overview of Apache Hive, including its structure, language, and advanced features. It discusses Hive's motivation as a tool for querying large datasets using SQL without needing to write MapReduce programs. The document outlines Hive's architecture, data model, query execution process, storage model, and the SQL-like language used to define tables, load and query data, and more. It also covers advanced topics like user-defined functions, partitions, file formats, and best practices.

Uploaded by

Saeed Meethal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views51 pages

Apache Hive for Data Analysts

Uploaded by

Saeed Meethal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Apache Hive

An Introduction

Agenda
Overview
A quick look at Hive and its background.

Structure
A peek at the structure of Hive.

Language
How to write DDL and DML statements in Hive.

Hive at Yahoo
Working with Hive on Yahoos grids.

Advanced Features
Some more things you can do with Hive.

More Information
Where to look when you need more details or help.

Overview

Motivation for Hive

Companies are no longer dealing with gigabytes, but rather terabytes

Large amount of data to analyze

Researchers want to study and understand the data Business fold wants to slice and dice the data & metrics in various ways Every one impatient Give me answers now

Joining across large data sets is quite tricky

Motivation for Hive

Started in January 2007 at Facebook Query data on Hadoop without having to write complex

MapReduce programs in Java each time

SQL chosen for familiarity and tools-support

An active open-source project since August 2008

Top-level Apache project (hive.apache.org)

Used in many companies; a diverse set of contributors

What Hive Is
A Hadoop-based system for managing and querying structured data

Hive provides an view of your data as tables with rows

and columns Uses HDFS for storing data Provides a SQL-like interface for querying data Uses MapReduce for executing queries Scales well to handle massive data-sets
6

Example
SELECT COUNT(1) AS job_count, t.wait_time FROM (SELECT ROUND(wait_time/1000) AS wait_time, job_id FROM starling_jobs WHERE grid = MB AND dt >= 2011_07_11

AND dt <= 2011_07_13) t

GROUP BY t.wait_time;

8 Simple steps
Login to grid gateway machine. Create a hdfs file to store your hive metadata, Ex:hadoop fs -mkdir /user/vmoorthy/warehouse Go to hive shell by running hive SET mapred.job.queue.name=unfunded; job in the unfunded queue -- to run your

8 Simple steps ()
Create a database specifying the location for meta data store. Ex:CREATE DATABASE autos LOCATION '/user/vmoorthy/warehouse'; USE autos; -- to work with previously created database named 'autos CREATE TABLE used_car(chromeTrimId INT,trimId INT, usedCarCondition STRING, usedCarMileage INT, usedCarPrice INT, chromeModelId INT, modelId INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION '/user/vmoorthy/usedCarTrim'; -- create a table for the tab separated hdfs file named usedCarTrim

8 Simple steps ()
Now, you are ready to run select queries on the above table. Ex:3030; SELECT * FROM used_car WHERE chrometrimid >

Structure

Architecture
Hive
JDBC Command-line Interface Web Interface ODBC

Thrift Server Meta-store

Driver (Compiler, Optimizer, Executor)

Hadoop

Database

Query Execution
Query

Parser

Logical Plan Generator

Optimizer

Physical Plan Generator

MapReduce Job(s)

Executor

Storage
Table metadata is stored in meta-store Directories for databases, tables and partitions

Files for table-data

<warehouse-directory> <database-directory> <table-directory>

<partition-directory>
<data-file1> <data-file2> []

<data-filen>

Language

Data Model
Database a namespace for tables and other units of data (default if none specified)

Table a row-based store for data in a database; each row

having one or more columns Partition a key-based separation of data in a table for reducing the amount of data scanned (Optional) Bucket cluster of data in a partition based on a hashing a column-value (Optional)

Primitive Data-types
Integers TINYINT (1 byte), SMALLINT (2 bytes), INT (4 bytes), BIGINT (8 bytes)

Boolean BOOLEAN (TRUE / FALSE)

Floating-Point FLOAT, DOUBLE

String STRING
Implicit and explicit casting supported

Complex Data-types
Arrays a list of elements of the same data-type accessible using an index. A[n] denotes the element at index

n(starts from zero) in array A

Structs a record with named elements. foo.bar denotes the field bar in the struct foo Maps maintains mappings from keys to respective values. M[foo] denotes the value for foo in the map M

Collections can be nested arbitrarily

Operators
Relational =, !=, <, <=, etc. as well as IS NULL, IS NOT NULL, LIKE, etc. Generate TRUE

or FALSE based on comparison

Arithmetic +, -, *, /, etc. Generate number based on the result of the arithmetic operation Logical AND, OR, NOT, etc. Generate TRUE or FALSE based on combining Boolean expressions

Built-in Fuctions
Mathematical round(), floor(), rand(), etc. String concat(), substr(), regexp_replace(),

etc.
Time to_date(), from_unixtime(), year(), month(), etc. Aggregates count(), sum(), min(), max(), avg()

and quite a lot more

Creating a Table
CREATE TABLE employees(name STRING, age INT); or CREATE TABLE IF NOT EXISTS employees(name STRING, age INT); or CREATE TABLE employees(name STRING, age INT) PARTITIONED BY (join_dt STRING); or CREATE TABLE employees(name STRING, age INT) STORED AS SequenceFile; etc.

Loading Data
LOAD DATA INPATH '/foo/bar/snafu.txt' INTO TABLE employees; or LOAD DATA LOCAL INPATH '/homes/wombat/emp_2011-12-01.txt' INTO TABLE employees PARTITION (join_dt=2011_12_01);

or
INSERT OVERWRITE TABLE employees SELECT name, age FROM all_employees

WHERE location = 'Bangalore';

Querying Data
SELECT * FROM employees; or SELECT * FROM employees LIMIT 10; or SELECT name, age FROM employees WHERE age > 30; or SET hive.exec.compress.output=false; SET hive.cli.print.header=true; INSERT OVERWRITE LOCAL DIRECTORY /homes/wombat/blr SELECT * FROM all_employees WHERE location = Bangalore; etc.

External Tables
Data not managed by Hive Useful when data is already processed and in a usable state

Manually clean up after dropping tables/partitions

CREATE EXTERNAL TABLE foo(name string, age int)

LOCATION /user/bar/wombat;

Altering a Table
ALTER TABLE employees RENAME TO blr_employees;

ALTER TABLE employees REPLACE COLUMNS (emp_name STRING, emp_age INT);

ALTER TABLE employees ADD COLUMNS (emp_id STRING);

ALTER TABLE all_employees DROP PARTITION (location=Slackville);

Databases
CREATE DATABASE foo;

or
CREATE DATABASE IF NOT EXISTS foo;

or
CREATE DATABASE foo LOCATION /snafu/wombat; USE foo; SELECT * FROM bar LIMIT 10;

or
SELECT * FROM foo.bar LIMIT 10; DROP DATABASE foo;

or
DROP DATABASE IF EXISTS foo;

Other Operations
SHOW TABLES; SHOW PARTITIONS all_employees; SHOW PARTITIONS all_employees PARTITION (location=Bangalore); DESCRIBE employees; DROP TABLE employees; or DROP TABLE IF EXISTS employees;

Joins
SELECT e.name, d.dept_name FROM departments d JOIN all_employees e ON (e.dept_id = d.dept_id);

SELECT e.name, d.dept_name

FROM departments d LEFT OUTER JOIN all_employees e

ON (e.dept_id = d.dept_id);

Ordering of Data
ORDER BY global ordering of results based on the selected columns

SORT BY local ordering of results on each reducer

based on the selected columns

File-formats
TextFile plain-text files; fields delimited with ^A by default

SequenceFile serialized objects, possiblycompressed RCFile columnar storage of serialized objects, possibly-compressed

TextFile Delimiters
Default field-separator is ^A; row-separator is \n John Doe^A36\n

Jane Doe^A33\n
Default list-separator is ^B; value-separator is ^C John Doe^Adept^Cfinance^Bemp_id^C2357\n

CREATE TABLE employees(name STRING, age INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED by '\t';

Buckets
Distribute partition-data into files based on columns Improves performance for filters with these columns Works best when data is uniformly distributed
CREATE TABLE employees(name STRING, age INT) CLUSTERED BY (name) INTO 31 BUCKETS;

Compressed Storage
Saves space and generally improves performance Direct support for reading compressed files
LOAD DATA LOCAL INPATH /foo/bar/emp_data.bz2 INTO TABLE all_employees;

Compressed TextFile cannot usually be split

SequenceFile or RCFile recommended instead

Tips
Judicious use of partitions and buckets can drastically improve the performance of your queries

Put always-used Hive CLI commands in $HOME/.hiverc

(e.g. SET mapred.job.queue.name=unfunded;) Use EXPLAIN to analyze a query before executing it Use RCFile with compression to save storage and to improve performance

Hive at Yahoo

Specifics
Hive CLI available as /home/y/bin/hive on gateways of supported grids

Mandatory LOCATION clause in CREATE TABLE

Must specify MapReduce queue for submitted Jobs

(e.g. SET mapred.job.queue.name=unfunded;)

No JDBC / ODBC support Integrated with HCatalog

Advanced Features

User-defined Functions
Many very useful built-in functions
SHOW FUNCTIONS; DESCRIBE FUNCTION foo;

Extensible using user-defined functions User-defined Function (UDF) for one-to-one mapping
E.g. round(), concat(), unix_timestamp(), etc.

User-defined Aggregate Function (UDAF) for many-to-one mapping

E.g. sum(), avg(), stddev(), etc.

User-defined Table-generating Function (UDTF) for one-to-

many mapping
E.g. explode(), etc.

Custom UDF
package com.yahoo.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.io.Text;

@Description( name = "toupper", value = "_FUNC_(str) - Converts a string to uppercase", extended = "Example:\n" + " > SELECT toupper(author_name) FROM authors a;\n" +

" STEPHEN KING"

)

Custom UDF ()
public class ToUpper extends UDF {

public Text evaluate(Text s) { Text to_value = new Text(""); if (s != null) { try { to_value.set(s.toString().toUpperCase()); } catch (Exception e) { // Should never happen

to_value = new Text(s);

} } return to_value; } }

UDF Usage
add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION TO_UPPER AS com.yahoo.hive.udf.ToUpper'; SELECT TO_UPPER(src.value) FROM src; DROP TEMPORARY FUNCTION TO_UPPER;

Overloaded UDF
public class UDFExampleAdd extends UDF { public Integer evaluate(Integer a, Integer b) { if (a == null || b == null) return null; return a + b; } public Double evaluate(Double a, Double b) {

if (a == null || b == null) return null;

return a + b; }

Overloaded UDF
add jar build/contrib/hive_contrib.jar; CREATE TEMPORARY FUNCTION example_add AS 'org.apache.hadoop.hive.contrib.udf.UDFExampleAdd'; SELECT example_add(1, 2) FROM src; SELECT example_add(1.1, 2.2) FROM src;

UDAF Example
SELECT page_url, count(1), count(DISTINCT user_id) FROM mylog;
public class UDAFCount extends UDAF { public static class Evaluator implements UDAFEvaluator { private int mCount; public void init() {mcount = 0;} public boolean iterate(Object o) { if (o!=null) mCount++; return true;} public Integer terminatePartial() {return mCount;} public boolean merge(Integer o) {mCount += o; return true;} public Integer terminate() {return mCount;} }

Overloaded UDAF
public class UDAFSum extends UDAF { public static class IntEvaluator implements UDAFEvaluator {

private int mSum;

public void init() {mSum = 0;} public boolean iterate(Integer o) {mSum += o; return true;} public Integer terminatePartial() {return mSum;} public boolean merge(Integer o) {mSum += o; return true;}

public Integer terminate() {return mSum;}

}

Overloaded UDAF
public static class DblEvaluator implements UDAFEvaluator { private double mSum; public void init() {mSum = 0;}

public boolean iterate(Double o)

{mSum += o; return true;} public Double terminatePartial() {return mSum;}

public boolean merge(Double o)

{mSum += o; return true;} public Double terminate() {return mSum;} } }

What Hive Is Not

Not suitable for small data-sets Does not provide real-time results Does not support row-level updates Imposes a schema on the data Does not support transactions Does not need expensive server-class hardware,

RDBMS licenses or god-like DBAs to scale

More Information

External References
Hive home-page: hive.apache.org Hive wiki: cwiki.apache.org/confluence/display/Hive Hive tutorial: cwiki.apache.org/confluence/display/Hive/Tutorial Hive language manual:
cwiki.apache.org/confluence/display/Hive/LanguageManual

Mailing-list: user@hive.apache.org

Internal References
Hive at Yahoo: wiki.corp.yahoo.com/view/Grid/Hive Hive FAQ: wiki.corp.yahoo.com/view/Grid/HiveFAQ

Troubleshooting:
wiki.corp.yahoo.com/view/Grid/HiveTroubleShooting

Internal mailing-list: hive-users@yahoo-inc.com Hive CLI yinst package: hive_cli Installation instructions:
wiki.corp.yahoo.com/view/Grid/HiveInstallation

Questions?

Hive Main
No ratings yet
Hive Main
33 pages
Hadoop Ecosystem: Hive, Pig, Spark Overview
No ratings yet
Hadoop Ecosystem: Hive, Pig, Spark Overview
27 pages
HIVE
No ratings yet
HIVE
80 pages
Overview of Apache Hive Features and Limitations
No ratings yet
Overview of Apache Hive Features and Limitations
35 pages
Overview of Apache Hive Features
No ratings yet
Overview of Apache Hive Features
29 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
61 pages
Hive Intoduction and Tables
No ratings yet
Hive Intoduction and Tables
31 pages
HIVE
No ratings yet
HIVE
28 pages
Complex Data Types in Hive/Impala
No ratings yet
Complex Data Types in Hive/Impala
64 pages
Apache Pig and Hive Data Processing Overview
No ratings yet
Apache Pig and Hive Data Processing Overview
14 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Hive
No ratings yet
Hive
45 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
HiveQL Overview
No ratings yet
HiveQL Overview
71 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Understanding Hive and Pig in Hadoop
No ratings yet
Understanding Hive and Pig in Hadoop
91 pages
Apache Hive: ETL and Data Warehousing Tool
No ratings yet
Apache Hive: ETL and Data Warehousing Tool
44 pages
Introduction to Hive in Big Data
No ratings yet
Introduction to Hive in Big Data
65 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hive for Data Engineers
No ratings yet
Hive for Data Engineers
13 pages
Introduction to Apache Hive Overview
No ratings yet
Introduction to Apache Hive Overview
13 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Introduction to Hive: Data Warehousing Basics
No ratings yet
Introduction to Hive: Data Warehousing Basics
50 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Hive L1
No ratings yet
Hive L1
134 pages
BDAV Practical 4 Hive
No ratings yet
BDAV Practical 4 Hive
21 pages
11lecture - Technology and Tools (HiveHbaseMahout)
No ratings yet
11lecture - Technology and Tools (HiveHbaseMahout)
54 pages
Apache Hive: Data Warehouse & HiveQL Guide
No ratings yet
Apache Hive: Data Warehouse & HiveQL Guide
45 pages
7 Hive
No ratings yet
7 Hive
30 pages
Hive Final
No ratings yet
Hive Final
75 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
BDA Hive
No ratings yet
BDA Hive
22 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive Data Types and Functions Overview
No ratings yet
Hive Data Types and Functions Overview
20 pages
Introduction to Apache Hive Basics
No ratings yet
Introduction to Apache Hive Basics
42 pages
07 Hive 01
No ratings yet
07 Hive 01
21 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Facebook's Hadoop and Hive Insights
No ratings yet
Facebook's Hadoop and Hive Insights
24 pages
Unit V
No ratings yet
Unit V
23 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Unit 3
No ratings yet
Unit 3
23 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
Overview of Hive Data Warehouse System
No ratings yet
Overview of Hive Data Warehouse System
9 pages
Oracle Netapp Best Practices
No ratings yet
Oracle Netapp Best Practices
47 pages
Shared Pool
No ratings yet
Shared Pool
16 pages
Oracle Reference Arch Guide
No ratings yet
Oracle Reference Arch Guide
25 pages
Restore 11g Homes
No ratings yet
Restore 11g Homes
8 pages
Single Client Access Name (SCAN) Oracle RAC 11gR2
No ratings yet
Single Client Access Name (SCAN) Oracle RAC 11gR2
3 pages
RMAN Backup Setup Guide
No ratings yet
RMAN Backup Setup Guide
4 pages
Oracle Hadoop Dataload
No ratings yet
Oracle Hadoop Dataload
13 pages
Oracle to MongoDB Data Replication
No ratings yet
Oracle to MongoDB Data Replication
14 pages
Oracle Asm Emc TF-SRDF 02-06-0
No ratings yet
Oracle Asm Emc TF-SRDF 02-06-0
22 pages
MySQL Replication: SBR vs RBR
No ratings yet
MySQL Replication: SBR vs RBR
17 pages
Oracle Cache Buffer Internals
100% (1)
Oracle Cache Buffer Internals
67 pages
Oracle 11g RAC Architecture Guide
No ratings yet
Oracle 11g RAC Architecture Guide
25 pages
MySql DeepDive
No ratings yet
MySql DeepDive
13 pages
MySQL Training Course Part2
No ratings yet
MySQL Training Course Part2
27 pages
Mysql Checklist
100% (1)
Mysql Checklist
13 pages
MySQL Database Architecture at RightMedia
No ratings yet
MySQL Database Architecture at RightMedia
17 pages
Multiple Voting Disk On Nfs
No ratings yet
Multiple Voting Disk On Nfs
15 pages
Missing Column Stats Impact on SQL Performance
No ratings yet
Missing Column Stats Impact on SQL Performance
41 pages
Library Cache Performance Issues Guide
No ratings yet
Library Cache Performance Issues Guide
20 pages
GC Buffer Busy
No ratings yet
GC Buffer Busy
19 pages
Lehuma Controls XCITE IO Expansion Modules
No ratings yet
Lehuma Controls XCITE IO Expansion Modules
12 pages
100 PHP Interview Questions and Answers Are Below
No ratings yet
100 PHP Interview Questions and Answers Are Below
20 pages
Crystal Oscillators - What They Are and How They Work
No ratings yet
Crystal Oscillators - What They Are and How They Work
5 pages
Question#1/104: Not Be The Same
100% (1)
Question#1/104: Not Be The Same
61 pages
Extruline X en
No ratings yet
Extruline X en
4 pages
Chapter03-Managing Files From The Command Line
50% (2)
Chapter03-Managing Files From The Command Line
7 pages
Ms Mcafee Agent Product Guide PDF
No ratings yet
Ms Mcafee Agent Product Guide PDF
72 pages
Datasheet FPA 1000 Data Sheet EnUS 5046636299
No ratings yet
Datasheet FPA 1000 Data Sheet EnUS 5046636299
8 pages
Sample CLD Exam - Traffic Light PDF
No ratings yet
Sample CLD Exam - Traffic Light PDF
4 pages
Trimble Open Pit Design Datasheet
No ratings yet
Trimble Open Pit Design Datasheet
2 pages
Parallel Programming Models Guide
No ratings yet
Parallel Programming Models Guide
9 pages
O-RAN: Towards An Open and Smart RAN: White Paper October 2018
No ratings yet
O-RAN: Towards An Open and Smart RAN: White Paper October 2018
19 pages
07-CE Amplifier
No ratings yet
07-CE Amplifier
2 pages
Telegram App Launch Log Details
No ratings yet
Telegram App Launch Log Details
3 pages
DC/DC Converter For Hevs and Resonant Active Clamping Technique
No ratings yet
DC/DC Converter For Hevs and Resonant Active Clamping Technique
6 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
PRG6 (1) .Enquiry Routines R10.01
No ratings yet
PRG6 (1) .Enquiry Routines R10.01
28 pages
AD Advanced Notes
No ratings yet
AD Advanced Notes
380 pages
PTS1000 4.2 EN Revk
No ratings yet
PTS1000 4.2 EN Revk
109 pages
Digital Designing in Verilog HDL
No ratings yet
Digital Designing in Verilog HDL
46 pages
ARM Cortex-M4 Registers Guide
No ratings yet
ARM Cortex-M4 Registers Guide
12 pages
Up Iti VR Lab
No ratings yet
Up Iti VR Lab
102 pages
123 Au 032017
No ratings yet
123 Au 032017
2 pages
H19-308 - V4.0 Huawei Real Exam Questions
No ratings yet
H19-308 - V4.0 Huawei Real Exam Questions
12 pages
Cluster Computing
No ratings yet
Cluster Computing
10 pages
HP Officejet Pro 8500A Plus E-All-In-One Printer - A910g (CM756A) Specifications - HP Small & Medium Business Products
No ratings yet
HP Officejet Pro 8500A Plus E-All-In-One Printer - A910g (CM756A) Specifications - HP Small & Medium Business Products
4 pages
AI/ML in SDN and NFV: Applications & Challenges
No ratings yet
AI/ML in SDN and NFV: Applications & Challenges
3 pages
Current Sensing for Engineers
No ratings yet
Current Sensing for Engineers
5 pages
Troubleshooting 320D-2 WBY
No ratings yet
Troubleshooting 320D-2 WBY
110 pages
Acs1 Amplifier + Cab Simulator: Instruction Manual
No ratings yet
Acs1 Amplifier + Cab Simulator: Instruction Manual
12 pages

Apache Hive for Data Analysts

Uploaded by

Apache Hive for Data Analysts

Uploaded by

Apache Hive

Motivation for Hive

Large amount of data to analyze

Joining across large data sets is quite tricky

Motivation for Hive

MapReduce programs in Java each time

An active open-source project since August 2008

Used in many companies; a diverse set of contributors

Hive provides an view of your data as tables with rows

AND dt <= 2011_07_13) t

Thrift Server Meta-store

Driver (Compiler, Optimizer, Executor)

Logical Plan Generator

Physical Plan Generator

Files for table-data

Table a row-based store for data in a database; each row

Boolean BOOLEAN (TRUE / FALSE)

n(starts from zero) in array A

Collections can be nested arbitrarily

or FALSE based on comparison

and quite a lot more

WHERE location = 'Bangalore';

Manually clean up after dropping tables/partitions

CREATE EXTERNAL TABLE foo(name string, age int)

ALTER TABLE employees REPLACE COLUMNS (emp_name STRING, emp_age INT);

ALTER TABLE employees ADD COLUMNS (emp_id STRING);

ALTER TABLE all_employees DROP PARTITION (location=Slackville);

SELECT e.name, d.dept_name

SORT BY local ordering of results on each reducer

CREATE TABLE employees(name STRING, age INT)

ROW FORMAT DELIMITED

Compressed TextFile cannot usually be split

SequenceFile or RCFile recommended instead

Put always-used Hive CLI commands in $HOME/.hiverc

Mandatory LOCATION clause in CREATE TABLE

(e.g. SET mapred.job.queue.name=unfunded;)

User-defined Aggregate Function (UDAF) for many-to-one mapping

User-defined Table-generating Function (UDTF) for one-to-

import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.io.Text;

" STEPHEN KING"

to_value = new Text(s);

if (a == null || b == null) return null;

private int mSum;

public Integer terminate() {return mSum;}

public boolean iterate(Double o)

public boolean merge(Double o)

What Hive Is Not

RDBMS licenses or god-like DBAs to scale

You might also like