You are on page 1of 31

Datatypes in Hive:

Every other Object in Hive (say Database, Table, Partition etc.) is a reference to the HDFS. A
Database basically points to a folder in HDFS, a table does the same under the database folder,
partition again references a sub-folder under the table folder and it goes on till the leaf node
which is the file or data.

Now every table [by default] in Hive is an internal table unless specified external explicitly while
creating the table. When you create an internal table it has a strong reference to the inline file
system, which mean if you drop the table in Hive, the data it is referencing will also be deleted.

The external table is the opposite of the internal table. It does have reference to the data but has a
loose coupling with the data. When you drop the table in Hive the data remains intact.

Now the question is when would you use which one ? There are many scenarios. The very basic
is: Lets say you are having a table in Hive which is the Central Dump of all the data and I want
to use it temporarily for some computation. But I belong to some other user-group who doesn’t
have access on the data. Instead of giving access to me directly you can simply ask me to create
an External Table on the same data. I can use the data as I want and when done I will simply
drop my table leaving your data un-touched.
Placing userinfo to Movie table. Hive wont complain at loading as it apply schema on read.

Here we can use TRUNCATE TABLE movie or


Partition in Managed Table:

Values of our partitioning columns wont be saved in files. When we fetch the value for partitioned
column, Hive will fetch from the directory name.
Predicate
****If we have both static and dynamic partitions, then static partition should come first.

Dynamic partition column order should be in the same order as when we created the table

**Default Max Dynamic partition size is 1000


Too many partition = Too many mapper
Order By – Guarantees globally orders records. Uses Reducer. We have to minimize the use of Reducer.
There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the
order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set
hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be
one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could
take a very long time to finish.

In Hive, ORDER BY guarantees total ordering of data, but for that it has to be passed on to a single
reducer, which is normally performance intensive and therefore in strict mode, hive makes it compulsory
to use LIMIT with ORDER BY so that reducer doesn’t get overburdened.

Ordering : Total Ordered data.


Outcome : Single output i.e. fully ordered.
For Example
1 hive> SELECT emp_id, emp_salary FROM employees ORDER BY emp_salary DESC;

Reducer :

1 emp_id | emp_salary
2 10 5000
3 11 4000
4 17 3100
5 16 3000
6 13 2600
7 14 2500
8 20 2000
9 19 1800

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the
same Distribute By columns will go to the same reducer.

SORT BY
Doesn’t guarantee globally ordered data. Here output of each reducer is sorted.
When we don’t want globally sorted data, or when we have some logic in reducer which needs the
intermediate result to be sorted, we will go for Sorting.

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order
will be dependent on the column types. If the column is of numeric type, then the sort order is also in
numeric order. If the column is of string type, then the sort order will be lexicographical order.
Ordering : It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.
Outcome : N or more sorted files with overlapping ranges.
Let’s understand with an example of below query:-
1 hive> SELECT emp_id, emp_salary FROM employees SORT BY emp_salary DESC;

Lets assume the number of reducers were set to 2 and output of each reducer is as follows –

Reducer 1 :
1 emp_id | emp_salary
2 10 5000
3 16 3000
4 13 2600
5 19 1800
Example ( taken directly from Hive wiki ):-

We are Distributing By x on the following 5 rows to 2 reducer:


1 x1
2 x2
3 x4
4 x3
5 x1
Reducer 1
1 x1
2 x2
3 x1
Reducer 2
1 x4
2 x3
Note that all rows with the same key x1 is guaranteed to be distributed to the same
reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent
positions.
Here it uses 3rd bucket out of 64 bucket.
Streamable:
A – Memory, B will be streamed
LEAD on column Basket if we apply, while processing ID, LEAD will give Susan(we can get to know
basket has changed), LAG will give MIKE(4)
SECOND COURSE:
Order Management activity must be immediate to system.

Uses old data to check the revenue.


Like volatile tables in Teradata. Once session is over, both the table and the contents are deleted.

It can be created in the name of permanent table. But as long as temporary table is there, it will hide that
permanent table to be used.
Loading data with LOAD DATA Command:

If we load data from two location with same file name into same table without overwrite,

Writing to Multiple Tables:


CloudEra training VM Points:
* show create table ratings_external
*Dropping a internal table deletes the table, directory and file
*Even if file is not there, select * will work. It wont fetch any data
* PARTITIONED BY (`rating` tinyint)
*We cannot ignore PARTITION during LOAD DATA if the table is partitioned. “Error while compiling
statement: FAILED: SemanticException [Error 10062]: Need to specify partition columns because the
destination table is partitioned”
* Error while compiling statement: FAILED: SemanticException [Error 10096]: Dynamic partition strict
mode requires at least one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
*Load statement can be used only for STATIC PARTION not for Dynamic
`posted` timestamp,
`cust_id` int,
`prod_id` int,
`message` string)
PARTITIONED BY (
`rating` tinyint)
CLUSTERED BY (message) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:8020/dualcore/ratings'
Output:

You might also like