P. 1
Hive User Meeting March 2010 Cloudera Quick Start 100325151728 Phpapp01

Hive User Meeting March 2010 Cloudera Quick Start 100325151728 Phpapp01

|Views: 18|Likes:
Published by Faruk Berksöz

More info:

Published by: Faruk Berksöz on Jan 06, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

09/01/2013

pdf

text

original

Hive Quick Start

© 2010 Cloudera, Inc.

Background
•  Started at Facebook •  Data was collected by nightly cron jobs into Oracle DB •  “ETL” via handcoded python •  Grew from 10s of GBs (2006) to 1 TB/ day new data (2007), now 10x that.

© 2010 Cloudera, Inc.

Hadoop as Enterprise Data Warehouse
•  Scribe and MySQL data loaded into Hadoop HDFS •  Hadoop MapReduce jobs to process data •  Missing components:
–  Command-line interface for “end users” –  Ad-hoc query support
• … without writing full MapReduce jobs

–  Schema information
© 2010 Cloudera, Inc.

hypothesis testing © 2010 Cloudera. .g. Inc.Hive Applications •  Log processing •  Text mining •  Document indexing •  Customer-facing business intelligence (e. Google Analytics) •  Predictive modeling..

. Inc.Hive Architecture © 2010 Cloudera.

Data Model •  Tables –  Typed columns (int. string. . Inc.g.. date. array/map/struct for JSON-like data •  Partitions –  e. join optimization) © 2010 Cloudera. boolean) –  Also. to range-partition tables by date •  Buckets –  Hash partitions within ranges (useful for sampling. float.

f FLOAT. a ARRAY<MAP<STRING.Column Data Types CREATE TABLE t ( s STRING. f. Inc. SELECT s. © 2010 Cloudera. a[0][‘foobar’].p2 FROM t. p2:INT>>). . STRUCT<p1:INT.

and many other relational databases © 2010 Cloudera. MySQL. Inc.Metastore •  Database: namespace containing a set of tables •  Holds Table/Partition definitions (column types. mappings to HDFS directories) •  Statistics •  Implemented with DataNucleus ORM. . Runs on Derby.

/user/hive/warehouse •  Table row data stored in subdirectories of warehouse •  Partitions form subdirectories of table directories •  Actual data stored in flat files –  Control char-delimited text. can use arbitrary format © 2010 Cloudera. or SequenceFiles –  With custom SerDe. . Inc.g.Physical Layout •  Warehouse directory in HDFS –  e..

0-bin.0/hive-0.tar.gz $ tar xvzf hive-0.apache.5.5. .Installing Hive From a Release Tarball: $ wget http://archive.gz $ cd hive-0.5.0-bin $ export HIVE_HOME=$PWD $ export PATH=$HIVE_HOME/bin:$PATH © 2010 Cloudera.5.0bin. Inc.tar.org/dist/ hadoop/hive/hive-0.

Installing Hive Building from Source: $ svn co http://svn.org/repos/asf/ hadoop/hive/trunk hive $ cd hive $ ant package $ cd build/dist $ export HIVE_HOME=$PWD $ export PATH=$HIVE_HOME/bin:$PATH © 2010 Cloudera.apache. . Inc.

Installing Hive Other Options: •  Use a Git Mirror: –  git://github.com/apache/hive. .com © 2010 Cloudera.git •  Cloudera Hive Packages –  Redhat and Debian –  Packages include backported patches –  See archive.cloudera. Inc.

17-0.Hive Dependencies •  Java 1. .20 •  Hive *MUST* be able to find Hadoop: –  $HADOOP_HOME=<hadoop-install-dir> –  Add $HADOOP_HOME/bin to $PATH © 2010 Cloudera.6 •  Hadoop 0. Inc.

. Inc.Hive Dependencies •  Hive needs r/w access to /tmp and /user/hive/warehouse on HDFS: $ $ $ $ hadoop hadoop hadoop hadoop fs fs fs fs –mkdir –mkdir –chmod –chmod /tmp /user/hive/warehouse g+w /tmp g+w /user/hive/warehouse © 2010 Cloudera.

. Inc.xml –  DO NOT TOUCH THIS FILE! •  Re(Define) properties in $HIVE_HOME/conf/hive-site.xml •  Use $HIVE_CONF_DIR to specify alternate conf dir location © 2010 Cloudera.Hive Configuration •  Default configuration in $HIVE_HOME/conf/hive-default.

reduce. Inc.g: – mapred. .Hive Configuration •  You can override Hadoop configuration properties in Hive’s configuration. e.tasks=1 © 2010 Cloudera.

.log © 2010 Cloudera.name}/hive.Logging •  Hive uses log4j •  Log4j configuration located in $HIVE_HOME/conf/hivelog4j. Inc.properties •  Logs are stored in /tmp/$ {user.

Starting the Hive CLI •  Start a terminal and run $ hive •  Should see a prompt like: hive> © 2010 Cloudera. . Inc.

•  Add a resource to the DCache: – hive> add [ARCHIVE|FILE|JAR] filename. © 2010 Cloudera. Inc. .Hive CLI Commands •  Set a Hive or Hadoop conf prop: – hive> set propkey=value. •  List all properties and values: – hive> set –v.

•  Describe a table: – hive> describe <tablename>. Inc.Hive CLI Commands •  List tables: – hive> show tables. •  More information: – hive> describe extended <tablename>. © 2010 Cloudera. .

. © 2010 Cloudera.Hive CLI Commands •  List Functions: – hive> show functions. Inc. •  More information: – hive> describe function <functionname>.

Inc. . © 2010 Cloudera. hive> SELECT * FROM <tablename> WHERE freq > 100 SORT BY freq ASC LIMIT 10.Selecting data hive> SELECT * FROM <tablename> LIMIT 10.

Inc. .Manipulating Tables •  DDL operations –  SHOW TABLES –  CREATE TABLE –  ALTER TABLE –  DROP TABLE © 2010 Cloudera.

fields terminated with ^A. . lines terminated with \n © 2010 Cloudera. Inc. •  Assumes default table layout –  Text files. msg STRING).Creating Tables in Hive •  Most straightforward: CREATE TABLE foo(id INT.

g. record separators are possible. .Changing Row Format • Arbitrary field. CSV format: CREATE TABLE foo(id INT. e.. Inc.’ LINES TERMINATED BY ‘\n’. © 2010 Cloudera. msg STRING) DELIMITED FIELDS TERMINATED BY ‘.

Partitioning Data •  One or more partition columns may be specified: CREATE TABLE foo (id INT. msg STRING) PARTITIONED BY (dt STRING). e.: /user/hive/warehouse/foo/dt=2009-03-20/ •  Queries with partition columns in WHERE clause will scan through only a subset of the data © 2010 Cloudera.g. •  Creates a subdirectory for each value of the partition column. . Inc.

Inc.Sqoop = SQL-to-Hadoop © 2010 Cloudera. .

PostgreSQL.Sqoop: Features •  JDBC-based interface (MySQL. . Oracle. etc…) •  Automatic datatype generation –  Reads column info from table and generates Java classes –  Can be used in further MapReduce processing passes •  Uses MapReduce to read tables from database –  Can select individual table (or subset of columns) –  Can read all tables in database •  Supports most JDBC standard types and null values © 2010 Cloudera. Inc.

Example input mysql> use corp. Database changed mysql> describe employees. . Inc. +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | firstname | varchar(32) | YES | | NULL | | | lastname | varchar(32) | YES | | NULL | | | jobtitle | varchar(64) | YES | | NULL | | | start_date | date | YES | | NULL | | | dept_id | int(11) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+ © 2010 Cloudera.

Loading into HDFS $ sqoop --connect jdbc:mysql://db.foo. . Inc.com/corp \ --table employees •  Imports “employees” table into HDFS directory © 2010 Cloudera.

Hive Integration $ sqoop --connect jdbc:mysql://db.foo. . auto-executes Hive script •  Follow-up step: Loading into partitions © 2010 Cloudera. Inc.com/ corp --hive-import --table employees •  Auto-generates CREATE TABLE / LOAD DATA INPATH statements for Hive •  After data is imported to HDFS.

Apache 2.17-0.5.20 © 2010 Cloudera.0 license •  Official subproject of Apache Hadoop •  Current version is 0. . Inc.Hive Project Status •  Open source.0 •  Supports Hadoop 0.

.Conclusions •  Supports rapid iteration of adhoc queries •  High-level Interface (HiveQL) to low-level infrastructure (Hadoop). Inc. •  Scales to handle much more data than many similar systems © 2010 Cloudera.

org/hadoop/Hive Mailing Lists •  hive-user@hadoop. Inc.org IRC •  ##hive on Freenode © 2010 Cloudera. .apache.apache.Hive Resources Documentation •  wiki.

Carl Steinbach carl@cloudera.com .

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->