Professional Documents
Culture Documents
velocity, and a variety of data that is increasing day by day. Using traditional data
management systems, it is difficult to process Big Data. Therefore, the Apache Software
Foundation introduced a framework called Hadoop to solve Big Data management and
processing challenges.
Hadoop
What is Hive
Hive is not
•A relational database
•A design for OnLine Transaction Processing (OLTP)
•A language for real-time queries and row-level updates
Features of Hive
Architecture of Hive
This component diagram contains different units. The following table describes each unit:
User Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive Web UI,
Interface
Hive command line, and Hive HD Insight (In Windows server).
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
HiveQL
of the replacements of traditional approach for MapReduce program. Instead of
Process
writing MapReduce program in Java, we can write a query for MapReduce job
Engine
and process it.
HDFS or Hadoop distributed file system or HBASE are the data storage techniques to store
HBASE data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step
Operation
No.
Execute Query
1 The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
Get Plan
2 The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.
Get Metadata
3
The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5 The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.
Execute Plan
6
The driver sends the execute plan to the execution engine.
Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine
7
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
Metadata Ops
7.1 Meanwhile in execution, the execution engine can execute metadata operations
with Metastore.
Fetch Result
8
The execution engine receives the results from Data nodes.
Send Results
9
The execution engine sends those resultant values to the driver.
Send Results
10
The driver sends the results to Hive Interfaces.
Hive - Partitioning
Hive organizes tables into partitions. It is a way of dividing a table into related parts based
on the values of partitioned columns such as date, city, and department. Using partition, it
is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that
may be used for more efficient querying. Bucketing works based on the value of hash
function of some column of a table.
For example, a table named Tab1 contains employee data such as id, name, dept, and yoj
(i.e., year of joining). Suppose you need to retrieve the details of all employees who joined
in 2012. A query searches the whole table for the required information. However, if you
partition the employee data with the year and store it in a separate file, it reduces the
query processing time. The following example shows how to partition a file and its data:
The following file contains employeedata table.
/tab1/employeedata/file1
id, name, dept, yoj
1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
/tab1/employeedata/2013/file3
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
Adding a Partition
We can add partitions to a table by altering the table. Let us assume we have a table
called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.
Syntax:
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
The following query is used to add a partition to the employee table.
Renaming a Partition
Dropping a Partition
his chapter describes how to create and manage views. Views are generated based on
user requirements. You can save any result set data as a view. The usage of view in Hive
is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all
DML operations on a view.
Creating a View
You can create a view at the time of executing a SELECT statement. The syntax is as
follows:
Let us take an example for view. Assume employee table as given below, with the fields Id,
Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details
who earn a salary of more than Rs 30000. We store the result in a view
named emp_30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;
Dropping a View
Creating an Index
Example
Let us take an example for index. Use the same employee table that we have used earlier
with the fields Id, Name, Salary, Designation, and Dept. Create an index named
index_salary on the salary column of the employee table.
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
It is a pointer to the salary column. If the column is modified, the changes are stored using
an index value.
Dropping an Index
c. Hive Partitioning
Hive Partition – Hive Optimization Techniques, Hive reads all the data
in the directory Without partitioning. Further, it applies the query
filters on it. Since all data has to be read this is a slow as well as
expensive.
Also, users need to filter the data on specific column values frequently.
Although, users need to understand the domain of the data on which
they are doing analysis, to apply the partitioning in the Hive.
Basically, by Partitioning all the entries for the various columns of the
dataset are segregated and stored in their respective partition.
Hence, While we write the query to fetch the values from the table,
only the required partitions of the table are queried. Thus it reduces
the time taken by the query to yield the result.
d. Bucketing in Hive
Bucketing in Hive – Hive Optimization Techniques, let’s suppose a
scenario. At times, there is a huge dataset available. However, after
partitioning on a particular field or fields, the partitioned file size
doesn’t match with the actual expectation and remains huge.
Still, we want to manage the partition results into different parts. Thus,
to solve this issue of partitioning, Hive offers Bucketing concept.
Basically, that allows the user to divide table data sets into more
manageable parts.
Object 3
e. Vectorization In Hive
Vectorization In Hive – Hive Optimization Techniques, to improve the
performance of operations we use Vectorized query execution. Here
operations refer to scans, aggregations, filters, and joins. It happens
by performing them in batches of 1024 rows at once instead of single
row each time.
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Then, prepare the data for CBO by running Hive’s “analyze” command
to collect various statistics on the tables for which we want to use
CBO.
g. Hive Indexing
Hive Index – Hive Optimization Techniques, one of the best ways is
Indexing. To increase your query performance indexing will definitely
help. Basically, for the original table use of indexing will create a
separate called index table which acts as a reference.
Moreover, there is no need for the query to scan all the rows in the
table while we perform a query on a table that has an index, it turned
out as the major advantage of using indexing. Further, it checks the
index first and then goes to the particular column and performs the
operation.
Hence, maintaining indexes will be easier for Hive query to look into
the indexes first and then perform the needed operations within less
amount of time. Well, time is the only factor that everyone focuses on,
eventually.
This was all about Hive Optimization Techniques Tutorial. Hope you
like our explanation of Hive Performance Tuning.
Object 4
So, this was all in Hive Query Optimization Techniques. Hope you like
our explanation.
Conclusion – Hive Optimization
Techniques
Hence, we hope this article ‘’Top 7 Hive Optimization techniques‘’
helped you in understanding how to optimize hive queries for faster
execution, Hive Performance Tuning with these Best Hive Optimization
techniques: Execution Engine, Usage of Suitable File Format, Hive
Partitioning, Bucketing in Hive, Vectorization in Hive, Cost-Based
Optimization in Hive, and Hive Indexing.
a) Partitioning
The Hive command for Partitioning is:
[php]CREATE TABLE table_name (column1 data_type, column2
data_type) PARTITIONED BY (partition1 data_type, partition2
data_type,….);[/php]
b) Bucketing
The Hive command for Bucketing is:
[php]CREATE TABLE table_name PARTITIONED BY (partition1
data_type, partition2 data_type,….) CLUSTERED BY (column_name1,
column_name2, …) SORTED BY (column_name [ASC|DESC], …)]
INTO num_buckets BUCKETS;[/php]
Object 8
Conclusion
In conclusion to Hive Partitioning vs Bucketing, we can say that both
partition and bucket distributes a subset of the table’s data to a
subdirectory. Hence, Hive organizes tables into partitions. And it
subdivides partition into buckets.
If you like this blog or have any query related to Hive Partitioning vs
Bucketing, so, please let us know by leaving a comment.
Introduction
Anything and everything related to data in the 21st century has become of prime
relevance. And one of the key skills for any data science aspirant is mastering SQL
functions for effective and efficient data retrieval. SQL is widely used for querying
directly from databases and is, therefore, one of the most commonly used
languages for data analysis tasks. But it comes with its own intricacies and
nuances.
When it comes to SQL functions, there are a plethora of them. You need to know
the right function at the right time to achieve what you are looking for. But the
majority of us including me have a tendency to skip this topic or keep it hanging
till a distant future. And trust me it is a blunderous mistake to leave these topics
unturned in your learning journey.
Therefore, in this article, I will take you through some of the most common SQL
functions that you are bound to use regularly for your data analysis tasks.
If you are interested in learning SQL in a course format, please refer to our course
– Structured Query Language (SQL) for Data Science
Table of Contents
•Introducing the Dataset
•Aggregate functions in SQL
•Count
•Sum
•Average
•Min and Max
•Mathematical functions in SQL
•Absolute
•Ceil and Floor
•Truncate
•Modulo
•String functions in SQL
•Lower and Upper
•Concat
•Trim
•Date and Time functions in SQL
•Date and Time
•Extract
•Date format
•Windows functions in SQL
•Rank
•Percent value
•Nth value
•Miscellaneous functions
•Convert
•Isnull
•If
I will show you the practical application of all the functions covered in this article
by working with a dummy dataset. Let’s assume there is a retail chain all over the
country. The following SQL table records who bought items from the retail shop,
on what date they bought the item, the city they are from, and the purchase
amount.
We are going to use this example we learn the different functions in this article.
Aggregate functions
• Count
One of the most important aggregate functions is the count() function. It returns
the number of records from a column in the table. In our table, we can use
the count() function to get the number of cities where the order came from. We
do that as follows:
You would have noticed two things here. Firstly, the Null function counts the null
values. Then, duplicate values are counted multiple times. To deal with this
problem, we can pair it with the distinct() function which will count only the
distinct values in the column.
• Sum
The sum can be calculated using the sum() function which works on the column
name.
But what if we want to calculate the total amount for every city?
For that to happen, we can combine this function with the Groupby clause to
group the output by the city. Here is how you can make it happen.
This shows us that the company had Indore as the highest income generating city
for us.
• Average
Anyone who has done some data analysis in the past knows that average is a
better metric than just computing the sum of the numerical values.
In our example, we have multiple orders from the same city, therefore, it would
be more prudent to calculate the average amount rather than the total sum.
Finally, aggregate value analysis isn’t complete without computing the min and
max values. These can be simply computed using the min() and max() functions.
Mathematical functions
Most of the time you would have to deal with numbers in the SQL table for data
analysis. To deal with these numbers, you need mathematical functions. These
might have a trivial definition but when it comes to the analysis, they are the
most prolifically used functions.
• Absolute
abs() is the most common mathematical function. It calculates the absolute value
of a numeric value that you pass as an argument.
To understand where it is helpful, let’s first find out the deviation of the amount
for every record from the average amount from our table.
Now, as you can see we have some negative values here. These can be easily
converted to positives using the abs() function as shown below:
When dealing with numeric values, some of them might have decimal values.
How do you deal with those? You can simply convert them to either the next
higher integer using ceil() or the previous lower integer using floor().
In our table, the Amount column has lots of decimal values. We can convert them
to integers using ceil() or the floor() function.
• Truncate
Sometimes you would not want to convert a decimal value to an integer but
truncate the number of decimal places in the number. Truncate() function
achieves it. All you have to do is pass the decimal number as the first argument
and the number of places you want to truncate it to as the second argument.
As you can see, I have truncated the values to one decimal place.
• Modulo
The modulo function is a powerful and important function. Modulo returns the
remainder left when the second number divides the first number. It is used by
calling the function mod(x,y) where the result is the remainder left when x is
divided by y.
It has a very important function in the analysis. You can use it to find the odd or
even records from the SQL table. For example, in our example table, I can use
modulo function to find those records which had an odd number of quantities.
When you are working with SQL tables, you will have to deal with strings all the
time. They are especially important when you want to output the result in a
presentable way.
• Concat
concat() function joins two or more strings into one. All you have to do is provide
as argument the strings you want to concatenate.
As you would have noticed, even if one of the values is Null, the whole output is
returned as a Null value.
• Trim
Trim is a very important function not just in SQL, but in any language there is. It is
one of the most important string functions. It removes any leading or trailing
whitespace from the string. For example, in our sample table, there are many
trailing and leading whitespaces in the lastname column. We can remove these
using the trim() function.
To begin with, there is no doubt about the relevance of date and time features.
But this is only the case if you know how to handle them well! Check out the
following date and time functions to master your analysis skills.
If you have a common column for date and time as I have in the sample table,
then you will need to use the date() and time() functions to extract the respective
values.
• Extract
But sometimes you might want to go a step further and analyze how many of the
orders were placed on a particular day of the week or month, or maybe the time
of the day. For that, you need to use the super convenient extract() function.
A complete list of all the units that you can extract from the date is as follows:
As you can see, there is a lot of analysis that you can do with
the extract() function!
• Date format
Sometimes the dates in the database will be saved in a different format compared
to how you would want to view them. Therefore, to change the date format, you
can use the date_format() function. The syntax is as follows: date_format(date,
format)
Currently, the dates saved in the sample table are in the year-month-day format.
Using this function, I will output the dates in the day-month name-year format.
There are a lot of opportunities to change the format according to your
requirements. You can find all the format at this link.
Windows functions
Window function
Here, we get the aggregate value of all the rows. Now let’s use the windows
function for this aggregate function and see what happens.
As you must have noticed, we still get the aggregate sum values but they are
segregated by the different city groups. Notice that we calculate the output for
every row.
The OVER clause turns the simple aggregate function into a windows function.
The syntax is simple and as follows:
window_function_name(<expression>) OVER
( <partition_clause> <order_clause>)
The part before the OVER clause is the aggregate function or a windows
function. We will cover a few window functions in the subsequent sections.
The part after the OVER clause can be divided into two parts:
•Partition_clause defines the partition between rows. The window function
operates within each partition. The Partition by clause defines it.
•Order_clause orders the rows within the partition. The Order by clause defines it.
We will explore these in detail when we explore a few more window functions in
the subsequent sections.
• Rank
The simple window function is the rank() function. As the name suggests, it ranks
the rows within a partition group based on a condition.
It has the following syntax: Rank() Over ( Partition by <expression> Order By
<expression>)
Let’s use this function to rank the rows in our table based on the amount of order
within each city.
Consequently, the rows have been ranked within their respective partition group
(or city).
• Percent_value
It is an important window function that finds the relative rank of a row within a
group. It determines the percentile value of each row.
Let’s use this function to determine the amount percentile of each customer in
the table.
• Nth_value
Sometimes you want to find out which row had the highest, lowest, or the nth
highest value. For example, the highest scorer in school, top sales performer, etc.
is in situations like these where you need the nth_value() windows function.
As a result, the function returns nth row value from an ordered set of rows. The
syntax is as follows:
Let’s use this function to find out who was the top buyer in the table.
Miscellaneous functions
So far we have discussed very specific functions. Now we will explore some
miscellaneous functions that can’t be categorized within a specific functional
group but are of immense value.
• Convert
Sometimes you would want to convert the output value into a specified datatype.
Moreover, you can think of it like casting, where you can change the data type of
the value. Its syntax is simple: convert(value, type)
We can use it to convert the data type of the date column before we print the
value.
• Isnull
Generally, if you don’t specify the non-value for your attribute, chances are you
will end up with some null values in the column. But you can easily deal with
them using the isnull() function.
You just have to write the expression within the function. It will return 1 for a null
and 0 otherwise.
Looks like we have some null values for lastname attribute in the table!
• If
Finally, the most important function you will ever use in SQL is the if() function. It
lets you define the if-conditionality which you encounter in any programming
language.
Using this function, let’s find out which customer paid more than 1000 amount
for their order.
Moreover, the use of this function is boundless and it is rightly used regularly for
data analysis tasks.
Endnotes
To summarize, we have covered a lot of basic SQL functions that are bound to be
used quite a lot in day to day data analysis tasks. You may further broaden the
application of some of the functions by reading the following article-
•8 SQL Techniques to Perform Data Analysis for Analytics and Data Science
If you are interested in learning SQL in a course format, please refer to our course
– Structured Query Language (SQL) for Data Science
Hope this article helps you bring out more from your dataset. And if you have any
favorite SQL function that you find useful or use quite often, do comment below
and share your experience!
Object 9
What is Partitions?
Hive Partitions is a way to organizes tables into partitions by dividing
tables into different parts based on partition keys.
Partition is helpful when the table has one or more Partition keys.
Partition keys are basic elements for determining how the data is
stored in the table.
For Example: –
“Client having Some E –commerce data which belongs to India
operations in which each state (38 states) operations mentioned in as
a whole. If we take state column as partition key and perform
partitions on that India data as a whole, we can able to get Number of
partitions (38 partitions) which is equal to number of states (38)
present in India. Such that each state data can be viewed separately
in partitions tables.
Sample Code Snippet for partitions
1.Creation of Table all states
create table all states(state string, District string,Enrolments string)
row format delimited
fields terminated by ',';
Object 10
Object 11
10 Most Common Interview Questions and Answers ????
Object 12
From the above code, we do following things
1.Creation of table all states with 3 column names such as state,
district, and enrollment
2.Loading data into table all states
3.Creation of partition table with state as partition key
4.In this step Setting partition mode as non-strict( This mode will
activate dynamic partition mode)
5.Loading data into partition tablestate_part
6.Actual processing and formation of partition tables based on
state as partition key
7.There is going to 38 partition outputs in HDFS storage with the
file name as state name. We will check this in this step. In This
step, we seeing the 38 partition outputs in HDFS
What is Buckets?
Buckets in hive is used in segregating of hive table-data into multiple
files or directories. it is used for efficient querying.
•The data i.e. present in that partitions can be divided further
into Buckets
•The division is performed based on Hash of particular columns
that we selected in the table.
•Buckets use some form of Hashing algorithm at back end to read
each record and place it into buckets
•In Hive, we have to enable buckets by using
the set.hive.enforce.bucketing=true;
Step 1) Creating Bucket as shown below.
•Once the data get loaded it automatically, place the data into 4
buckets
Step 2) Loading Data into table sample bucket
Assuming that”Employees table” already created in Hive system. In
this step, we will see the loading of Data from employees table into
table sample bucket.
Before we start moving employees data into buckets, make sure that it
consist of column names such as first_name, job_id, department,
salary and country.
Here we are loading data into sample bucket from employees table.
From the above screenshot, we can see that the data from the
employees table is transferred into 4 buckets created in step 1.
Indexing in Hive: What is View &
Index with Example
ByDavid TaylorUpdated
What is a View?
Views are similar to tables,
which are generated based on
the requirements.
• We can save any result set
data as a view in Hive
• Usage is similar to as
views used in SQL
• All type of DML operations
can be performed on a
view
Object 13
Creation of View:
Syntax:
Create VIEW <VIEWNAME> AS SELECT
Example:
Hive>Create VIEW Sample_ViewAS SELECT * FROM employees
WHERE salary>25000
What is Index?
Indexes are pointers to particular column name of a table.
• The user has to manually define the index
Syntax:
Create INDEX <INDEX_NAME> ON TABLE < TABLE_NAME(column
names)>
Example:
Create INDEX sample_Index ON TABLE
guruhive_internaltable(id)
• Group by query
• Sort by
• Cluster By
• Distribute By
Creating Table in Hive
Before initiating with our main topic for this tutorial, first we will create
a table to use it as references for the following tutorial.
Object 15
Object 16
Object 17
Order by query:
The ORDER BY syntax in HiveQL is similar to the syntax of ORDER BY
in SQL language.
Order by is the clause we use with “SELECT” statement in Hive
queries, which helps sort data. Order by clause use columns on Hive
tables for sorting particular column values mentioned with Order by.
For whatever the column name we are defining the order by clause
the query will selects and display results by ascending or descending
order the particular column values.
If the mentioned order by field is a string, then it will display the result
in lexicographical order. At the back end, it has to be passed on to a
single reducer.
From the Above screen shot, we can observe the following
1.It is the query that performing on the “employees_guru” table
with the ORDER BY clause with Department as defined ORDER BY
column name.
“Department” is String so it will display results based on
lexicographical order.
2.This is actual output for the query. If we observe it properly, we
can see that it get results displayed based on Department
column such as ADMIN, Finance and so on in orderQuery to be
perform.
Query :
SELECT * FROM employees_guru ORDER BY Department;
Group by query:
Group by clause use columns on Hive tables for grouping particular
column values mentioned with the group by. For whatever the column
name we are defining a “groupby” clause the query will selects and
display results by grouping the particular column values.
For example, in the below screen shot it’s going to display the total
count of employees present in each department. Here we have
“Department” as Group by value.
From the above screenshot, we will observe the following
1.It is the query that is performed on the “employees_guru” table
with the GROUP BY clause with Department as defined GROUP BY
column name.
2.The output showing here is the department name, and the
employees count in different departments. Here all the
employees belong to the specific department is grouped by and
displayed in the results. So the result is department name with
the total number of employees present in each department.
Query:
SELECT Department, count(*) FROM employees_guru GROUP BY
Department;
Sort by:
Sort by clause performs on column names of Hive tables to sort the
output. We can mention DESC for sorting the order in descending
order and mention ASC for Ascending order of the sort.
In this sort by it will sort the rows before feeding to the reducer.
Always sort by depends on column types.
For instance, if column types are numeric it will sort in numeric order if
the columns types are string it will sort in lexicographical order.
Cluster By:
Cluster By used as an alternative for both Distribute BY and Sort BY
clauses in Hive-QL.
Cluster BY clause used on tables present in Hive. Hive uses the
columns in Cluster by to distribute the rows among reducers. Cluster
BY columns will go to the multiple reducers.
• It ensures sorting orders of values present in multiple reducers
Object 18