Professional Documents
Culture Documents
Hive
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.
1. Foreword ................................................................................................................................. 1
2. Introduction to Hive............................................................................................................ 1
5. Disadvantages of Hive........................................................................................................ 4
2. Introduction to Hive
� Hive is an open source data warehouse system on top of HDFS that add structure to the data.
� It can support PB-level distributed data query and management
� Tez (an applica�on framework which allows for a complex DAG of tasks for processing data);
� Data warehouse is a system used for recording and data analysis, generally refers to the
combination of many different databases across an entire enterprise. It usually stores
historical data so that all of the relevant data may be used for analysis.
� Database is an organized collection of data. Data from varied sources are collected into
a single place. This place is the database. It usually stores current data for querying.
� The main difference between them is that database is used for online transactional pro-
cessing while data warehouse is designed for online analytical processing to analyze
questions that are critical for your business.
� Hive does not support real-time query. We can use Hive for network log analysis, text
analysis and also some report analysis.
� In FusionInsight Hadoop, at the bottom is HDFS and HBase, which are used to store the
data sets. Then is YARN that manages and schedules resources for applications. MapRe-
duce, Spark/Tez are distributed parallel computing engines.
� Hive has some advantages. For example, HiveServer which is a process of Hive to pro-
vide external SQL services in cluster mode. Dual-MetaStore which provides metadata
information and query retry after timeout make sure High reliability and tolerance of
Hive.
� It uses SQL-like query, we can define functions and storage formats.
5. Disadvantages of Hive
� Driver: It manages the lifecycle at HQL execution and participates in the entire Hive
tasks. Driver is composed of three parts to break down the Hive query statements:
Compiler, Optimizer and Executor.
� By default, Compiler is used for compiling HQL statements into graph of map and Re-
duce tasks. These tasks are interdependent.
� There are logical optimizers and physical optimizers to optimize HQL execution plans
and MapReduce tasks respectively.
� Executor is used for executing map and Reduce tasks based on task dependencies.
� Driver is the most important part of Hive, the core of Hive.
� Also we have MetaStore which stores metadata of tables, columns and partitions. Driver
can connect with MetaStore to get any information about data if it needs.
� There are two types of tables in Hive: Managed table and External table.
8. Functions of Hive
� Hive provides multiple built-in functions such as Mathematical Function, Date Function
and String Function.
� If built-in functions are not enough for users, Hive also supports self-defined functions.
� Hive also support encryption of one or multiple columns in a table. When creating a
Hive table you can specify the columns to be encrypted and the encryption algorithm.
� When data is inserted into the table using the insert statement, the related columns are
encrypted.
� Hive column encryption does not support the fill and Hive over HBase scenarios.
� Hive supports two column encryption algorithms currently: AES and SMS4. Here shows
an example of encrypting the phone column and the address column using the AES
algorithm.
� Due to limitations of underlying storage system, Hive does no support deleting a single
piece of table data. But FusionInsight HD Hive supports deleting a single piece of HBase
� In most cases a carriage return character is used as the row delimiter in Hive tables
stored in text files. That is the carriage return character is used as the terminator of a
row using query. However, some data files are delimited by special characters other
than a carriage return character. FusionInsight Hive allows you to use different charac-
ters or character combinations to delimit rows of Hive task data. Look at these steps,
user needs to set the input format and output format first when creating a table, and
then specify the delimiter in this way.
� With all these features, FusionInsight Hive is better than the community version in reli-
ability, tolerance, scalability and performance.
� HQL is a query language of Hive, which is very similar to SQL and also includes three
types: DDL (Data Definition Language), DML (Data Manipulation Language), DQL (Data
query Language).
� Here are a few examples of Hive basic operations. First is to create a managed table.
� Example “employee” is the table name and then give the field name, type and comment
which is optional.
� The last line specifies the load delimiter as comma and stored as textfile.
� The difference of creating an external table is to add EXTERNAL and also specify the
storage location.
� Notice that when data is imported to a Hive table, a data validity check is not per-
formed. It only performed when data is read.
� Group by means to divide a data set into several smaller data sets based on certain
rules and data process is performed on the smaller sets.
� Data can be grouped by using the aggregation function.
� UNION ALL is used to combine results set of two or more selected statements. The re-
sults set can contain the same value.
� We can join two tables with the same column
� When a query item is the condition of another query item this is called subquery.