You are on page 1of 23

Big Data Huawei Course

Hive
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.

Centro de Inovação EDGE - Big Data Course


Table of Contents

1. Foreword ................................................................................................................................. 1

2. Introduction to Hive............................................................................................................ 1

3. Data warehouse x Database ............................................................................................ 1

4. Advantages of Hive ............................................................................................................. 4

5. Disadvantages of Hive........................................................................................................ 4

6. Hive Functions and Architecture .................................................................................... 5

7. Data Storage Model of Hive ............................................................................................ 7

7.1. Partition and Bucket .......................................................................................... 8

7.2. Managed Table and External Table .............................................................. 9

8. Functions of Hive ............................................................................................................... 11

9. Enhanced Features of Hive............................................................................................. 11

9.1. Colocation Overview ....................................................................................... 11

9.2. Using Colocation .............................................................................................. 12

9.3. Encrypting Columns ........................................................................................ 13

9.4. Deleting HBase Records in Batches........................................................... 13


9.5. Controlling Traffic ............................................................................................. 14

9.6. Specifying Row Delimiters ............................................................................ 15

10. Basic Hive Operations ...................................................................................................... 15

10.1. Hive Basic Operations (1) .............................................................................. 16

10.2. Hive Basic Operations (2) .............................................................................. 16

10.3. Hive Basic Operations (3) .............................................................................. 17

10.4. Hive Basic Operations (4) .............................................................................. 18

Centro de Inovação EDGE - Big Data Course


Hive – Huawei Course
1. Foreword

� Hadoop was not easy for end-users


� Hive is an easy way to work with data storage in HDFS.
� Basically, you can just think Hive as SQL for Hadoop clusters.

2. Introduction to Hive

� Hive is an open source data warehouse system on top of HDFS that add structure to the data.
� It can support PB-level distributed data query and management
� Tez (an applica�on framework which allows for a complex DAG of tasks for processing data);

3. Data warehouse x Database

� Data warehouse is a system used for recording and data analysis, generally refers to the
combination of many different databases across an entire enterprise. It usually stores
historical data so that all of the relevant data may be used for analysis.
� Database is an organized collection of data. Data from varied sources are collected into
a single place. This place is the database. It usually stores current data for querying.
� The main difference between them is that database is used for online transactional pro-
cessing while data warehouse is designed for online analytical processing to analyze
questions that are critical for your business.

Centro de Inovação EDGE - Big Data Course 1


� Consider the scenario where a bank ATM has disbursed cash to a costumer, but was
unable to record this event in the bank records. If this story happening frequently, the
bank wouldnʼt stay in business for too long. So, the banking system is designed to make
sure that every transaction gets recorded within the time you stand in front of the ATM
machine. These records formed a database. But suddenly your manager asks for record
about maybe customer group for financial products, you cannot analyze the problem
based on the database, because it changes due to various updates. Thatʼs why we need
a data warehouse where we can extract historical data from different sources and make
analysis for specific problem.
� Hive is built on Hadoop for static batch processing. Hadoop has long latency and con-
sumes a large number of resources during job submission and scheduling.
� Hive does not support quick query of data in a large scale of data set. For example,
querying data in hundreds of MB data set has minutes of delay. Therefore, Hive is inap-
plicable for scenarios where acquire low latency. For example, online transaction pro-
cessing.
� So, in actual application, Hive can be used for data mining, Non-real-time analysis, Data
aggregation, Data warehouse.

� Hive does not support real-time query. We can use Hive for network log analysis, text
analysis and also some report analysis.
� In FusionInsight Hadoop, at the bottom is HDFS and HBase, which are used to store the
data sets. Then is YARN that manages and schedules resources for applications. MapRe-
duce, Spark/Tez are distributed parallel computing engines.

Centro de Inovação EDGE - Big Data Course 2


� On the top is Hive, a platform used for develop SQL like SQL scripts, reduce MapReduce
operations by default.
� All Hive data is stored in HDFS.

Centro de Inovação EDGE - Big Data Course 3


4. Advantages of Hive

� Hive has some advantages. For example, HiveServer which is a process of Hive to pro-
vide external SQL services in cluster mode. Dual-MetaStore which provides metadata
information and query retry after timeout make sure High reliability and tolerance of
Hive.
� It uses SQL-like query, we can define functions and storage formats.

5. Disadvantages of Hive

Centro de Inovação EDGE - Big Data Course 4


� Hive is inapplicable to OLTP (Online Transactional Processing). We cannot add, update
and delete column-level data.
� UDF (User Defined Functions)
� Storage procedure is similar to UDF. To save time and memory, extensive or complex
processing that requires execution of several SQL statements and be saved into storage
procedure, and all applications call the procedure. The major difference between stor-
age procedure and UDF is that UDF can be used like any other expression within SQL
statements whereas storage procedure must be involved using core statements.

6. Hive Functions and Architecture

� Driver: It manages the lifecycle at HQL execution and participates in the entire Hive
tasks. Driver is composed of three parts to break down the Hive query statements:
Compiler, Optimizer and Executor.
� By default, Compiler is used for compiling HQL statements into graph of map and Re-
duce tasks. These tasks are interdependent.
� There are logical optimizers and physical optimizers to optimize HQL execution plans
and MapReduce tasks respectively.
� Executor is used for executing map and Reduce tasks based on task dependencies.
� Driver is the most important part of Hive, the core of Hive.
� Also we have MetaStore which stores metadata of tables, columns and partitions. Driver
can connect with MetaStore to get any information about data if it needs.

Centro de Inovação EDGE - Big Data Course 5


� Hive is a data warehouse infrastructure software that can create interaction be-
tween user and HDFS.
� The user interfaces that Hive supports are Hive command line, Hive WebUI, JDBC and
ODBC.
• Users can use command line interface to directly connect with Hive.
• WebUI is official structure of Hive used for interaction with data.
• Users cannot directly use JDBC or ODBC to interact with Driver. We need a Thrift
Server which provides thrift interfaces as the service of JDBC and ODBC and inte-
grates Hive and other applications.

� In FusionInsight Hadoop, Hive contains the following components: HiveServer, MetaS-


tore and WebHcat.
� HiveServer provides Hive database services externally, translates HQL statements sub-
mitted by users into related YARN tasks or HDFS operations to complete data extrac-
tion, conversion and analysis. Multiple Hive servers can be deployed in a cluster in load
sharing mode.
� Metastore provides Hive metadata services as well as reads, writes, maintains and mod-
ifies the structure and attributes of Hive tables. Metastore also provides Thrift Interfaces
or HiveServer, Spark, WebHcat and other metastore clients to access and operate meta-
data. Multiple Metastore can be deployed in a cluster in load sharing mode.

Centro de Inovação EDGE - Big Data Course 6


� WebHcat provides rest interfaces and uses them to run Hive command to submit
MapReduce task. Multiple WebHcats can be deployed in a cluster in load sharing mode.
� As shown in the figure below, developers make HTTP requests to access Hadoop
MapReduce, Pig, Hive, Hcatalogue DDL from within applications.
� Data and code used by this API are maintained in HDFS.
� Hcatalogue DDL commands are executed directly when requested.
� MapReduce, pig and Hive jobs are placed in a queue by WebHCat servers and can be
monitored for process or stopped as required.
� Developers specify a location in HDFS into which pig, Hive and MapReduce results
should be placed.
� Note that the current version does not support pig interface.

7. Data Storage Model of Hive

Centro de Inovação EDGE - Big Data Course 7


� In the order of granularity. Hive data is organized into database, table, bucket and par-
tition.
� Database is used to avoid naming conflicts for tables, fills partitions, columns and so
on; It can also be used to enforce security for a user or group of users.
� If the database is not specified when the table is created, the default database is used.
� Tables are homogeneous unit of data, which have the same schema. A table corre-
sponds to directory in HDFS basically.

7.1. Partition and Bucket

Centro de Inovação EDGE - Big Data Course 8


� Hive organizes tables into partitions, a way of defining a table into columns grand parts
based on the value of a partition column, such as a date.
� Using partitions can make it faster to queries on slices of the data. To take an example
where partitions are commonly used, imagine log files where each record includes a
timestamp. If we partition by date, then records for the same date would be stored in
the same partition. The advantage to this schema is that queries that are restricted to a
particular date or set of dates will answer much more efficiently, because they only
need to scan the files in the partitions that the query protends to.
� Notice that partitioning doesnʼt preclude more wide-ranging queries. Itʼs still fixable to
query the entire data set across many partitions. A table may be partitioned in multiple
dimensions. For example, in addition to partitioning logs by date, we might also sub
partition each date partition by country to permit efficient queried by location. A par-
tition corresponds to a subdirectory of directory where a table resize and tables or
partitions may be subdivided further into buckets based on the value of a hash function
of some columns of the table to give extra structure to the data that may be used for
more efficient queries. For example, bucketing by user ID means we can quickly evalu-
ate a user-based query by running it on a randomized sample of the total set of users,
a bucket corresponds to a file under the pack where a table or a partition resize.
� There are two reasons why you might want to organize your tables or partitions into
buckets.
� The first is to enable more efficient queries. Bucketing imposes extra structure on the
table, which Hive can take advantage of when performing searching queries. In partic-
ular a joint of two tables that are bucketed on the same columns which include the joint
columns and be efficiently implemented as a map side join.
� The second reason to bucket a table is to make sampling more efficient. When working
with large data sets, itʼs very convenient to try out queries on a reflection of your data
set while you are in the process of developing or refining them. Weʼll see how to do
efficient sampling at the end of this section.
� Note that itʼs not necessary for tables to be partitioned or bucketed. But this obstruc-
tion allows the system to form large quantities of the data during query data process-
ing. Resulting in faster query execution.

7.2. Managed Table and External Table

� There are two types of tables in Hive: Managed table and External table.

Centro de Inovação EDGE - Big Data Course 9


� When you create a table in Hive, by default Hive will manage the data, which means
that Hive moves the data into its warehouse directory. Alternatively, you may create an
external table which tells Hive to refer to the data that is at an existing location outside
the warehouse directory.
� The difference between the two table types is seen in the CREATE or LOAD and DROP
semantics. Letʼs consider a managed table first. When you create or load data into a
managed table, itʼs moved to the Hiveʼs warehouse directory. If the table is later
dropped, the table including its metadata and its data is deleted. It bears repeating that
since the initial load performed a moved operation and the drop performed the delete
operation, the data will no longer exist anymore. This is what it means for Hive to man-
age the data.
� External table behaves differently. You control the creation and deletion of the data. The
location of the external data is specified at the table creation time. With the external
keyword Hive knows that itʼs not managing the data, so it doesnʼt move it to its ware-
house directory. Indeed, it doesnʼt even check whether the external location exists at
the time it is defined.
� Itʼs a useful feature because it means you can create the data lately after creating the
table. When you drop an external table, Hive Will leave the data untouched and only
delete the metadata.
� How to choose between the two? In most cases there is not much difference between
these two. So, itʼs just a matter of preference.

Centro de Inovação EDGE - Big Data Course 10


� In general, if you are doing all your processing with Hive then use managed tables. But
if you wish to use Hive and other tools on the same data set, then use external tables.
A common pattern is to use an external table to access an initial data set store in HDFS,
THEN USE A Hive transform to move the data into a managed Hive table. This works
the other way around, too. An external table can be used to export data from Hive for
other applications to use.

8. Functions of Hive

� Hive provides multiple built-in functions such as Mathematical Function, Date Function
and String Function.
� If built-in functions are not enough for users, Hive also supports self-defined functions.

9. Enhanced Features of Hive

9.1. Colocation Overview

Centro de Inovação EDGE - Big Data Course 11


� Based on Hive, provided by the Hive open-sourced community, Hive in FusionInsight
HD has a lot of features that customized for enterprises such as Colocation Table Cre-
ation, Column Encryption and Syntax Enhancement.
� HDFS Colocation is the data location controlled function, provided by the Hadoop dis-
tributed file system. The HDFS colocation interface is used to store associated data or
data on which associated operations are performed on the same storage mode.
� Hive supports the HDFS colocation function. When Hive tables are created, after the
location information is set for table files, data files or related tables are stored on the
same storage nodes to insure convenient and efficient data computing among associ-
ated tables.

9.2. Using Colocation

Centro de Inovação EDGE - Big Data Course 12


� This is a command for how to implement Hive Colocation. First use HDFS interface to
create groupID and locatorID for example. Then create tables with specific groupID and
locatorID.
� Here notice that you must use the insert statement to import data to this type of table
to make the HDFS Colocation feature take effect.
� File format can only be textfile and RC file (Record Columnar File).

9.3. Encrypting Columns

� Hive also support encryption of one or multiple columns in a table. When creating a
Hive table you can specify the columns to be encrypted and the encryption algorithm.
� When data is inserted into the table using the insert statement, the related columns are
encrypted.
� Hive column encryption does not support the fill and Hive over HBase scenarios.
� Hive supports two column encryption algorithms currently: AES and SMS4. Here shows
an example of encrypting the phone column and the address column using the AES
algorithm.

9.4. Deleting HBase Records in Batches

� Due to limitations of underlying storage system, Hive does no support deleting a single
piece of table data. But FusionInsight HD Hive supports deleting a single piece of HBase

Centro de Inovação EDGE - Big Data Course 13


table data using specific statement like this, Hive can delete one or multiple pieces of
data from a HBase table that matches the expression.

9.5. Controlling Traffic

Centro de Inovação EDGE - Big Data Course 14


� Why adds this feature? Because there are too many requests of clients, it may cause the
restart of the client. So, FusionInsight Hive supports to control the connection of users.

9.6. Specifying Row Delimiters

� In most cases a carriage return character is used as the row delimiter in Hive tables
stored in text files. That is the carriage return character is used as the terminator of a
row using query. However, some data files are delimited by special characters other
than a carriage return character. FusionInsight Hive allows you to use different charac-
ters or character combinations to delimit rows of Hive task data. Look at these steps,
user needs to set the input format and output format first when creating a table, and
then specify the delimiter in this way.

� With all these features, FusionInsight Hive is better than the community version in reli-
ability, tolerance, scalability and performance.

10. Basic Hive Operations

� HQL is a query language of Hive, which is very similar to SQL and also includes three
types: DDL (Data Definition Language), DML (Data Manipulation Language), DQL (Data
query Language).

Centro de Inovação EDGE - Big Data Course 15


10.1.Hive Basic Operations (1)

� Here are a few examples of Hive basic operations. First is to create a managed table.
� Example “employee” is the table name and then give the field name, type and comment
which is optional.
� The last line specifies the load delimiter as comma and stored as textfile.

� The difference of creating an external table is to add EXTERNAL and also specify the
storage location.

10.2.Hive Basic Operations (2)

Centro de Inovação EDGE - Big Data Course 16


� The first statement is to modify the column “AFTER dateincompany” column into
“money” with string type and this comment.
� The second is to add columns to an existing table given the table name, the column
name and its type.
� The third statement is to change file format into textfile.
� You can delete some records of a table given the condition. After using the DROP state-
ment, the table doesnʼt exist anymore.
� The fifth statement is to list columns and column types of a table.
� The last one will show the statement of how to create this table.

10.3.Hive Basic Operations (3)

� Notice that when data is imported to a Hive table, a data validity check is not per-
formed. It only performed when data is read.

Centro de Inovação EDGE - Big Data Course 17


10.4.Hive Basic Operations (4)

� Group by means to divide a data set into several smaller data sets based on certain
rules and data process is performed on the smaller sets.
� Data can be grouped by using the aggregation function.
� UNION ALL is used to combine results set of two or more selected statements. The re-
sults set can contain the same value.
� We can join two tables with the same column
� When a query item is the condition of another query item this is called subquery.

Centro de Inovação EDGE - Big Data Course 18


Centro de Inovação EDGE - Big Data Course 19
Centro de Inovação EDGE - Big Data Course 20

You might also like