You are on page 1of 65

The term ‘Big Data’ is used for collections of large datasets that include huge volume, high

velocity, and a variety of data that is increasing day by day. Using traditional data
management systems, it is difficult to process Big Data. Therefore, the Apache Software
Foundation introduced a framework called Hadoop to solve Big Data management and
processing challenges.

Hadoop

Hadoop is an open-source framework to store and process Big Data in a distributed


environment. It contains two modules, one is MapReduce and another is Hadoop
Distributed File System (HDFS).
•MapReduce: It is a parallel programming model for processing large amounts of
structured, semi-structured, and unstructured data on large clusters of commodity
hardware.
•HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to
store and process the datasets. It provides a fault-tolerant file system to run on
commodity hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and
Hive that are used to help Hadoop modules.
•Sqoop: It is used to import and export data to and from between HDFS and
RDBMS.
•Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
•Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.
Note: There are various ways to execute MapReduce operations:
•The traditional approach using Java MapReduce program for structured, semi-
structured, and unstructured data.
•The scripting approach for MapReduce to process structured and semi structured
data using Pig.
•The Hive Query Language (HiveQL or HQL) for MapReduce to process structured
data using Hive.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It


resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

•A relational database
•A design for OnLine Transaction Processing (OLTP)
•A language for real-time queries and row-level updates

Features of Hive

•It stores schema in a database and processed data into HDFS.


•It is designed for OLAP.
•It provides SQL type language for querying called HiveQL or HQL.
•It is familiar, fast, scalable, and extensible.

Architecture of Hive

The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name Operation

User Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive Web UI,
Interface
Hive command line, and Hive HD Insight (In Windows server).

Hive chooses respective database servers to store the schema or Metadata of


Meta Store
tables, databases, columns in a table, their data types, and HDFS mapping.

HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
HiveQL
of the replacements of traditional approach for MapReduce program. Instead of
Process
writing MapReduce program in Java, we can write a query for MapReduce job
Engine
and process it.

The conjunction part of HiveQL process Engine and MapReduce is Hive


Execution
Execution Engine. Execution engine processes the query and generates results as
Engine
same as MapReduce results. It uses the flavor of MapReduce.

HDFS or Hadoop distributed file system or HBASE are the data storage techniques to store
HBASE data into file system.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step
Operation
No.
Execute Query

1 The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.

Get Plan

2 The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.

Get Metadata
3
The compiler sends metadata request to Metastore (any database).

Send Metadata
4
Metastore sends metadata as a response to the compiler.

Send Plan

5 The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.

Execute Plan
6
The driver sends the execute plan to the execution engine.

Execute Job

Internally, the process of execution job is a MapReduce job. The execution engine
7
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.

Metadata Ops

7.1 Meanwhile in execution, the execution engine can execute metadata operations
with Metastore.
Fetch Result
8
The execution engine receives the results from Data nodes.

Send Results
9
The execution engine sends those resultant values to the driver.

Send Results
10
The driver sends the results to Hive Interfaces.

Hive - Partitioning

Hive organizes tables into partitions. It is a way of dividing a table into related parts based
on the values of partitioned columns such as date, city, and department. Using partition, it
is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that
may be used for more efficient querying. Bucketing works based on the value of hash
function of some column of a table.
For example, a table named Tab1 contains employee data such as id, name, dept, and yoj
(i.e., year of joining). Suppose you need to retrieve the details of all employees who joined
in 2012. A query searches the whole table for the required information. However, if you
partition the employee data with the year and store it in a separate file, it reduces the
query processing time. The following example shows how to partition a file and its data:
The following file contains employeedata table.
/tab1/employeedata/file1
id, name, dept, yoj
1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013

The above data is partitioned into two files using year.


/tab1/employeedata/2012/file2
1, gopal, TP, 2012
2, kiran, HR, 2012

/tab1/employeedata/2013/file3
3, kaleel,SC, 2013
4, Prasanth, SC, 2013

Adding a Partition

We can add partitions to a table by altering the table. Let us assume we have a table
called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.

Syntax:

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec


[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;

partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
The following query is used to add a partition to the employee table.

hive> ALTER TABLE employee


> ADD PARTITION (year=’2012’)
> location '/2012/part2012';

Renaming a Partition

The syntax of this command is as follows.

ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION


partition_spec;
The following query is used to rename a partition:

hive> ALTER TABLE employee PARTITION (year=’1203’)


> RENAME TO PARTITION (Yoj=’1203’);

Dropping a Partition

The following syntax is used to drop a partition:

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec,


PARTITION partition_spec,...;
The following query is used to drop a partition:

hive> ALTER TABLE employee DROP [IF EXISTS]


> PARTITION (year=’1203’);

his chapter describes how to create and manage views. Views are generated based on
user requirements. You can save any result set data as a view. The usage of view in Hive
is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all
DML operations on a view.

Creating a View

You can create a view at the time of executing a SELECT statement. The syntax is as
follows:

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT


column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...
Example

Let us take an example for view. Assume employee table as given below, with the fields Id,
Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details
who earn a salary of more than Rs 30000. We store the result in a view
named emp_30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+

The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;

Dropping a View

Use the following syntax to drop a view:


DROP VIEW view_name

The following query drops a view named as emp_30000:


hive> DROP VIEW emp_30000;

Creating an Index

An Index is nothing but a pointer on a particular column of a table. Creating an index


means creating a pointer on a particular column of a table. Its syntax is as follows:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]

Example

Let us take an example for index. Use the same employee table that we have used earlier
with the fields Id, Name, Salary, Designation, and Dept. Create an index named
index_salary on the salary column of the employee table.
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

It is a pointer to the salary column. If the column is modified, the changes are stored using
an index value.

Dropping an Index

The following syntax is used to drop an index:


DROP INDEX <index_name> ON <table_name>

The following query drops an index named index_salary:


hive> DROP INDEX index_salary ON employee;
Best Hive Optimization
Techniques – Hive Performance
Boost your career with Big Data Get Exclusive Offers on Big
Data Course!!
There are several types of Hive Query Optimization techniques are
available while running our hive queries to improve Hive performance
with some Hive Performance tuning techniques.
So, in this Hive Optimization Techniques article, Hive Optimization
Techniques for Hive Queries we will learn how to optimize hive queries
to execute them faster on our cluster, types of Hive Optimization
Techniques for Queries: Execution Engine, Usage of Suitable File
Format, Hive Partitioning, Bucketing in Apache Hive, Vectorization in
Hive, Cost-Based Optimization in Hive, and Hive Indexing.
So, let’s start Hive Query Optimization Tutorial.

What are Hive Optimization


Techniques?
However, to run queries on petabytes of data we all know that hive is
a query language which is similar to SQL built on Hadoop
ecosystem. So, there are several Hive optimization techniques to
improve its performance which we can implement when we run our
hive queries.

Types of Query Optimization


Techniques in Hive
Object 1

Following are the Hive optimization techniques for Hive Performance


Tuning, let’s discuss them one by one:

Types of Hive optimization Techniques


Object 2

a. Tez-Execution Engine in Hive


Tez Execution Engine – Hive Optimization Techniques, to increase the
Hive performance of our hive query by using our execution engine as
Tez. On defining Tez, it is a new application framework built
on Hadoop Yarn.
That executes complex-directed acyclic graphs of general data
processing tasks. However, we can consider it to be a much more
flexible and powerful successor to the map-reduce framework.

In addition, to write native YARN applications on Hadoop that bridges


the spectrum of interactive and batch workloads Tez offers an API
framework to developers. To be more specific, to work with petabytes
of data over thousands of nodes it allows those data access
applications.

b. Usage of Suitable File Format in Hive


ORCFILE File Formate – Hive Optimization Techniques, if we use
appropriate file format on the basis of data. It will drastically increase
our query performance.
Basically, for increasing your query performance ORC file format is
best suitable. Here, ORC refers to Optimized Row Columnar. That
implies we can store data in an optimized way than the other file
formats.
To be more specific, ORC reduces the size of the original data up to
75%. Hence, data processing speed also increases. On comparing to
Text, Sequence and RC file formats, ORC shows better performance.
Basically, it contains rows data in groups. Such as Stripes along with
a file footer. Therefore, we can say when Hive is processing the data
ORC format improves the performance.

c. Hive Partitioning
Hive Partition – Hive Optimization Techniques, Hive reads all the data
in the directory Without partitioning. Further, it applies the query
filters on it. Since all data has to be read this is a slow as well as
expensive.

Also, users need to filter the data on specific column values frequently.
Although, users need to understand the domain of the data on which
they are doing analysis, to apply the partitioning in the Hive.

Basically, by Partitioning all the entries for the various columns of the
dataset are segregated and stored in their respective partition.
Hence, While we write the query to fetch the values from the table,
only the required partitions of the table are queried. Thus it reduces
the time taken by the query to yield the result.

d. Bucketing in Hive
Bucketing in Hive – Hive Optimization Techniques, let’s suppose a
scenario. At times, there is a huge dataset available. However, after
partitioning on a particular field or fields, the partitioned file size
doesn’t match with the actual expectation and remains huge.
Still, we want to manage the partition results into different parts. Thus,
to solve this issue of partitioning, Hive offers Bucketing concept.
Basically, that allows the user to divide table data sets into more
manageable parts.

Hence, to maintain parts that are more manageable we can use


Bucketing. Through it, the user can set the size of the manageable
parts or Buckets too.

Object 3
e. Vectorization In Hive
Vectorization In Hive – Hive Optimization Techniques, to improve the
performance of operations we use Vectorized query execution. Here
operations refer to scans, aggregations, filters, and joins. It happens
by performing them in batches of 1024 rows at once instead of single
row each time.

However, this feature is introduced in Hive 0.13. It significantly


improves query execution time, and is easily enabled with two
parameters settings:

set hive.vectorized.execution = true

set hive.vectorized.execution.enabled = true

f. Cost-Based Optimization in Hive (CBO)


Cost-Based Optimization in Hive – Hive Optimization
Techniques, before submitting for final execution Hive optimizes each
Query’s logical and physical execution plan. Although, until now these
optimizations are not based on the cost of the query.

However, CBO, performs, further optimizations based on query cost in


a recent addition to Hive. That results in potentially different decisions:
how to order joins, which type of join to perform, the degree of
parallelism and others.

To use CBO, set the following parameters at the beginning of your


query:

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;
Then, prepare the data for CBO by running Hive’s “analyze” command
to collect various statistics on the tables for which we want to use
CBO.
g. Hive Indexing
Hive Index – Hive Optimization Techniques, one of the best ways is
Indexing. To increase your query performance indexing will definitely
help. Basically, for the original table use of indexing will create a
separate called index table which acts as a reference.

As we know, there are many numbers of rows and columns, in a Hive


table. Basically, it will take a large amount of time if we want to
perform queries only on some columns without indexing. Because
queries will be executed on all the columns present in the table.

Moreover, there is no need for the query to scan all the rows in the
table while we perform a query on a table that has an index, it turned
out as the major advantage of using indexing. Further, it checks the
index first and then goes to the particular column and performs the
operation.

Hence, maintaining indexes will be easier for Hive query to look into
the indexes first and then perform the needed operations within less
amount of time. Well, time is the only factor that everyone focuses on,
eventually.
This was all about Hive Optimization Techniques Tutorial. Hope you
like our explanation of Hive Performance Tuning.

Object 4

So, this was all in Hive Query Optimization Techniques. Hope you like
our explanation.
Conclusion – Hive Optimization
Techniques
Hence, we hope this article ‘’Top 7 Hive Optimization techniques‘’
helped you in understanding how to optimize hive queries for faster
execution, Hive Performance Tuning with these Best Hive Optimization
techniques: Execution Engine, Usage of Suitable File Format, Hive
Partitioning, Bucketing in Hive, Vectorization in Hive, Cost-Based
Optimization in Hive, and Hive Indexing.

Hive Partitioning vs Bucketing –


Advantages and Disadvantages
Boost your career with Big Data Get Exclusive Offers on Big
Data Course!!
In our previous Hive tutorial, we have discussed Hive Data Models
in detail. In this tutorial, we are going to cover the feature wise
difference between Hive partitioning vs bucketing.
This blog also covers Hive Partitioning example, Hive Bucketing
example, Advantages and Disadvantages of Hive Partitioning and
Bucketing.
So, let’s start Hive Partitioning vs Bucketing.

What is Hive Partitioning and


Bucketing?
Apache Hive is an open source data warehouse system used for
querying and analyzing large datasets. Data in Apache Hive can be
categorized into Table, Partition, and Bucket. The table in Hive is
logically made up of the data being stored.
It is of two type such as an internal table and external table. Refer
this guide to learn what is Internal table and External Tables and the
difference between both. Let us now discuss the Partitioning and
Bucketing in Hive in detail-

Hive partitioning vs Bucketing


Object 5

• Partitioning – Apache Hive organizes tables into partitions for


grouping same type of data together based on a column or
partition key. Each table in the hive can have one or more
partition keys to identify a particular partition. Using partition we
can make it faster to do queries on slices of the data.
• Bucketing – In Hive Tables or partition are subdivided into
buckets based on the hash function of a column in the table to
give extra structure to the data that may be used for more
efficient queries.

Comparison between Hive


Partitioning vs Bucketing
We have taken a brief look at what is Hive Partitioning and what is
Hive Bucketing. You can refer our previous blog on Hive Data
Models for the detailed study of Bucketing and Partitioning in
Apache Hive.
In this section, we will discuss the difference between Hive
Partitioning and Bucketing on the basis of different features in detail-

i. Partitioning and Bucketing Commands


in Hive
Object 6

a) Partitioning
The Hive command for Partitioning is:
[php]CREATE TABLE table_name (column1 data_type, column2
data_type) PARTITIONED BY (partition1 data_type, partition2
data_type,….);[/php]
b) Bucketing
The Hive command for Bucketing is:
[php]CREATE TABLE table_name PARTITIONED BY (partition1
data_type, partition2 data_type,….) CLUSTERED BY (column_name1,
column_name2, …) SORTED BY (column_name [ASC|DESC], …)]
INTO num_buckets BUCKETS;[/php]

ii. Apache Hive Partitioning and


Bucketing Example
Hive Data Model
Object 7

a) Hive Partitioning Example


For example, we have a table employee_details containing the
employee information of some company like employee_id, name,
department, year, etc. Now, if we want to perform partitioning on
the basis of department column.
Then the information of all the employees belonging to a particular
department will be stored together in that very partition. Physically, a
partition in Hive is nothing but just a sub-directory in the table
directory.
For example, we have data for three departments in our
employee_details table – Technical, Marketing and Sales. Thus we
will have three partitions in total for each of the departments as we
can see clearly in diagram below.
For each department we will have all the data regarding that very
department residing in a separate sub – directory under the table
directory.
So for example, all the employee data regarding Technical
departments will be stored in
user/hive/warehouse/employee_details/dept.=Technical. So, the
queries regarding Technical employee would only have to look
through the data present in the Technical partition.
Therefore from above example, we can conclude that partitioning is
very useful. It reduces the query latency by scanning
only relevant partitioned data instead of the whole data set.
b) Hive Bucketing Example
Hence, from the above diagram, we can see that how each partition is
bucketed into 2 buckets. Therefore each partition, says Technical, will
have two files where each of them will be storing the Technical
employee’s data

iii. Advantages and Disadvantages of


Hive Partitioning & Bucketing
Let us now discuss the pros and cons of Hive partitioning and
Bucketing one by one-
a) Pros and Cons of Hive Partitioning
Pros:
• It distributes execution load horizontally.
• In partition faster execution of queries with the low volume of
data takes place. For example, search population from Vatican
City returns very fast instead of searching entire world
population.
Cons:

Object 8

• There is the possibility of too many small partition creations- too


many directories.
• Partition is effective for low volume data. But there some queries
like group by on high volume of data take a long time to execute.
For example, grouping population of China will take a long time as
compared to a grouping of the population in Vatican City.
• There is no need for searching entire table column for a single
record.
b) Pros and Cons of Hive Bucketing
Pros:
• It provides faster query response like portioning.
• In bucketing due to equal volumes of data in each partition, joins
at Map side will be quicker.
Cons:
• We can define a number of buckets during table creation. But
loading of an equal volume of data has to be done manually by
programmers.
So, this was all about Hive Partitioning vs Bucketing. Hope you like
our explanation.

Conclusion
In conclusion to Hive Partitioning vs Bucketing, we can say that both
partition and bucket distributes a subset of the table’s data to a
subdirectory. Hence, Hive organizes tables into partitions. And it
subdivides partition into buckets.
If you like this blog or have any query related to Hive Partitioning vs
Bucketing, so, please let us know by leaving a comment.
Introduction
Anything and everything related to data in the 21st century has become of prime
relevance. And one of the key skills for any data science aspirant is mastering SQL
functions for effective and efficient data retrieval. SQL is widely used for querying
directly from databases and is, therefore, one of the most commonly used
languages for data analysis tasks. But it comes with its own intricacies and
nuances.

When it comes to SQL functions, there are a plethora of them. You need to know
the right function at the right time to achieve what you are looking for. But the
majority of us including me have a tendency to skip this topic or keep it hanging
till a distant future. And trust me it is a blunderous mistake to leave these topics
unturned in your learning journey.
Therefore, in this article, I will take you through some of the most common SQL
functions that you are bound to use regularly for your data analysis tasks.

If you are interested in learning SQL in a course format, please refer to our course
– Structured Query Language (SQL) for Data Science

Table of Contents
•Introducing the Dataset
•Aggregate functions in SQL
•Count
•Sum
•Average
•Min and Max
•Mathematical functions in SQL
•Absolute
•Ceil and Floor
•Truncate
•Modulo
•String functions in SQL
•Lower and Upper
•Concat
•Trim
•Date and Time functions in SQL
•Date and Time
•Extract
•Date format
•Windows functions in SQL
•Rank
•Percent value
•Nth value
•Miscellaneous functions
•Convert
•Isnull
•If

Introducing the Dataset

I will show you the practical application of all the functions covered in this article
by working with a dummy dataset. Let’s assume there is a retail chain all over the
country. The following SQL table records who bought items from the retail shop,
on what date they bought the item, the city they are from, and the purchase
amount.

We are going to use this example we learn the different functions in this article.

Aggregate functions
• Count

One of the most important aggregate functions is the count() function. It returns
the number of records from a column in the table. In our table, we can use
the count() function to get the number of cities where the order came from. We
do that as follows:
You would have noticed two things here. Firstly, the Null function counts the null
values. Then, duplicate values are counted multiple times. To deal with this
problem, we can pair it with the distinct() function which will count only the
distinct values in the column.

• Sum

Whenever we are dealing with columns related to numbers, we are bound to


check out their total sum. For example in our table, the total sum of Amount is
important to analyze the sales that occurred.

The sum can be calculated using the sum() function which works on the column
name.

But what if we want to calculate the total amount for every city?

For that to happen, we can combine this function with the Groupby clause to
group the output by the city. Here is how you can make it happen.
This shows us that the company had Indore as the highest income generating city
for us.

• Average

Anyone who has done some data analysis in the past knows that average is a
better metric than just computing the sum of the numerical values.
In our example, we have multiple orders from the same city, therefore, it would
be more prudent to calculate the average amount rather than the total sum.

• Min and Max

Finally, aggregate value analysis isn’t complete without computing the min and
max values. These can be simply computed using the min() and max() functions.
Mathematical functions

Most of the time you would have to deal with numbers in the SQL table for data
analysis. To deal with these numbers, you need mathematical functions. These
might have a trivial definition but when it comes to the analysis, they are the
most prolifically used functions.

• Absolute

abs() is the most common mathematical function. It calculates the absolute value
of a numeric value that you pass as an argument.

To understand where it is helpful, let’s first find out the deviation of the amount
for every record from the average amount from our table.
Now, as you can see we have some negative values here. These can be easily
converted to positives using the abs() function as shown below:

• Ceil and Floor

When dealing with numeric values, some of them might have decimal values.
How do you deal with those? You can simply convert them to either the next
higher integer using ceil() or the previous lower integer using floor().
In our table, the Amount column has lots of decimal values. We can convert them
to integers using ceil() or the floor() function.
• Truncate

Sometimes you would not want to convert a decimal value to an integer but
truncate the number of decimal places in the number. Truncate() function
achieves it. All you have to do is pass the decimal number as the first argument
and the number of places you want to truncate it to as the second argument.

As you can see, I have truncated the values to one decimal place.

• Modulo

The modulo function is a powerful and important function. Modulo returns the
remainder left when the second number divides the first number. It is used by
calling the function mod(x,y) where the result is the remainder left when x is
divided by y.

It has a very important function in the analysis. You can use it to find the odd or
even records from the SQL table. For example, in our example table, I can use
modulo function to find those records which had an odd number of quantities.

Or I could find even quantities if I negate the above result by using


the not keyword.
String functions

When you are working with SQL tables, you will have to deal with strings all the
time. They are especially important when you want to output the result in a
presentable way.

• Lower and Upper

You can convert the string values to uppercase or lowercase by using


the upper() or lower() functions respectively. In short, this helps in bringing more
consistency to the record values.

• Concat

concat() function joins two or more strings into one. All you have to do is provide
as argument the strings you want to concatenate.
As you would have noticed, even if one of the values is Null, the whole output is
returned as a Null value.

• Trim

Trim is a very important function not just in SQL, but in any language there is. It is
one of the most important string functions. It removes any leading or trailing
whitespace from the string. For example, in our sample table, there are many
trailing and leading whitespaces in the lastname column. We can remove these
using the trim() function.

• As you can see, the


function has trimmed any leading or trailing whitespaces from the string.
Date and time functions

To begin with, there is no doubt about the relevance of date and time features.
But this is only the case if you know how to handle them well! Check out the
following date and time functions to master your analysis skills.

• Date and Time

If you have a common column for date and time as I have in the sample table,
then you will need to use the date() and time() functions to extract the respective
values.

• Extract

But sometimes you might want to go a step further and analyze how many of the
orders were placed on a particular day of the week or month, or maybe the time
of the day. For that, you need to use the super convenient extract() function.

The syntax is simple: extract(unit from date)

The unit can be anything from year, month, to a minute, or second.


You can even extract the week of the year or the quarter of the year.

A complete list of all the units that you can extract from the date is as follows:
As you can see, there is a lot of analysis that you can do with
the extract() function!

• Date format

Sometimes the dates in the database will be saved in a different format compared
to how you would want to view them. Therefore, to change the date format, you
can use the date_format() function. The syntax is as follows: date_format(date,
format)

Currently, the dates saved in the sample table are in the year-month-day format.
Using this function, I will output the dates in the day-month name-year format.
There are a lot of opportunities to change the format according to your
requirements. You can find all the format at this link.

Windows functions

Window functions are important functions but can be tricky to understand.


Therefore, we first start by understanding the basic window function.

Window function

A window function performs calculation similar to an aggregate function, but with


a slight twist. While the regular aggregate functions group the rows into a single
output value, window function does not do that. The window function works on a
subset of rows but does not reduce the number of rows. The rows retain their
individual identity. To understand it better, let’s compare a simple aggregate
function sum().

Here, we get the aggregate value of all the rows. Now let’s use the windows
function for this aggregate function and see what happens.
As you must have noticed, we still get the aggregate sum values but they are
segregated by the different city groups. Notice that we calculate the output for
every row.

The OVER clause turns the simple aggregate function into a windows function.
The syntax is simple and as follows:

window_function_name(<expression>) OVER
( <partition_clause> <order_clause>)

The part before the OVER clause is the aggregate function or a windows
function. We will cover a few window functions in the subsequent sections.

The part after the OVER clause can be divided into two parts:
•Partition_clause defines the partition between rows. The window function
operates within each partition. The Partition by clause defines it.
•Order_clause orders the rows within the partition. The Order by clause defines it.

We will explore these in detail when we explore a few more window functions in
the subsequent sections.

• Rank

The simple window function is the rank() function. As the name suggests, it ranks
the rows within a partition group based on a condition.
It has the following syntax: Rank() Over ( Partition by <expression> Order By
<expression>)

Let’s use this function to rank the rows in our table based on the amount of order
within each city.

Consequently, the rows have been ranked within their respective partition group
(or city).

• Percent_value

It is an important window function that finds the relative rank of a row within a
group. It determines the percentile value of each row.

Its syntax is as follows: Percent_rank() Over(Partition by <expression> Order by


<expression>)

Although the partition clause is optional.

Let’s use this function to determine the amount percentile of each customer in
the table.
• Nth_value

Sometimes you want to find out which row had the highest, lowest, or the nth
highest value. For example, the highest scorer in school, top sales performer, etc.
is in situations like these where you need the nth_value() windows function.

As a result, the function returns nth row value from an ordered set of rows. The
syntax is as follows:

nth_value() order (partition by <expression> order by <expression>)

Let’s use this function to find out who was the top buyer in the table.
Miscellaneous functions

So far we have discussed very specific functions. Now we will explore some
miscellaneous functions that can’t be categorized within a specific functional
group but are of immense value.

• Convert

Sometimes you would want to convert the output value into a specified datatype.
Moreover, you can think of it like casting, where you can change the data type of
the value. Its syntax is simple: convert(value, type)

We can use it to convert the data type of the date column before we print the
value.
• Isnull

Generally, if you don’t specify the non-value for your attribute, chances are you
will end up with some null values in the column. But you can easily deal with
them using the isnull() function.

You just have to write the expression within the function. It will return 1 for a null
and 0 otherwise.

Looks like we have some null values for lastname attribute in the table!
• If

Finally, the most important function you will ever use in SQL is the if() function. It
lets you define the if-conditionality which you encounter in any programming
language.

It has a simple syntax: if(expression, value_if_true, value_if_false)

Using this function, let’s find out which customer paid more than 1000 amount
for their order.

Moreover, the use of this function is boundless and it is rightly used regularly for
data analysis tasks.

Endnotes

To summarize, we have covered a lot of basic SQL functions that are bound to be
used quite a lot in day to day data analysis tasks. You may further broaden the
application of some of the functions by reading the following article-
•8 SQL Techniques to Perform Data Analysis for Analytics and Data Science

If you are interested in learning SQL in a course format, please refer to our course
– Structured Query Language (SQL) for Data Science
Hope this article helps you bring out more from your dataset. And if you have any
favorite SQL function that you find useful or use quite often, do comment below
and share your experience!

Hive Partitions & Buckets with


Example
ByDavid TaylorUpdated

Tables, Partitions, and Buckets


are the parts of Hive data
modeling.

Object 9
What is Partitions?
Hive Partitions is a way to organizes tables into partitions by dividing
tables into different parts based on partition keys.
Partition is helpful when the table has one or more Partition keys.
Partition keys are basic elements for determining how the data is
stored in the table.
For Example: –
“Client having Some E –commerce data which belongs to India
operations in which each state (38 states) operations mentioned in as
a whole. If we take state column as partition key and perform
partitions on that India data as a whole, we can able to get Number of
partitions (38 partitions) which is equal to number of states (38)
present in India. Such that each state data can be viewed separately
in partitions tables.
Sample Code Snippet for partitions
1.Creation of Table all states
create table all states(state string, District string,Enrolments string)

row format delimited

fields terminated by ',';

2.Loading data into created table all states


Load data local inpath '/home/hduser/Desktop/AllStates.csv' into table 
allstates;

3.Creation of partition table


create table state_part(District string,Enrolments string) PARTITIONED BY(state 
string);

4.For partition we have to set this property


set hive.exec.dynamic.partition.mode=nonstrict
5.Loading data into partition table
INSERT OVERWRITE TABLE state_part PARTITION(state)
SELECT district,enrolments,state from  allstates;

6.Actual processing and formation of partition tables based on


state as partition key
7.There are going to be 38 partition outputs in HDFS storage with
the file name as state name. We will check this in this step
The following screen shots will show u the execution of above
mentioned code

Object 10

Object 11
10 Most Common Interview Questions and Answers ????

Object 12
From the above code, we do following things
1.Creation of table all states with 3 column names such as state,
district, and enrollment
2.Loading data into table all states
3.Creation of partition table with state as partition key
4.In this step Setting partition mode as non-strict( This mode will
activate dynamic partition mode)
5.Loading data into partition tablestate_part
6.Actual processing and formation of partition tables based on
state as partition key
7.There is going to 38 partition outputs in HDFS storage with the
file name as state name. We will check this in this step. In This
step, we seeing the 38 partition outputs in HDFS

What is Buckets?
Buckets in hive is used in segregating of hive table-data into multiple
files or directories. it is used for efficient querying.
•The data i.e. present in that partitions can be divided further
into Buckets
•The division is performed based on Hash of particular columns
that we selected in the table.
•Buckets use some form of Hashing algorithm at back end to read
each record and place it into buckets
•In Hive, we have to enable buckets by using
the set.hive.enforce.bucketing=true;
Step 1) Creating Bucket as shown below.

From the above screen shot


•We are creating sample_bucket with column names such as
first_name, job_id, department, salary and country
•We are creating 4 buckets overhere.

•Once the data get loaded it automatically, place the data into 4
buckets
Step 2) Loading Data into table sample bucket
Assuming that”Employees table” already created in Hive system. In
this step, we will see the loading of Data from employees table into
table sample bucket.
Before we start moving employees data into buckets, make sure that it
consist of column names such as first_name, job_id, department,
salary and country.
Here we are loading data into sample bucket from employees table.

Step 3)Displaying 4 buckets that created in Step 1

From the above screenshot, we can see that the data from the
employees table is transferred into 4 buckets created in step 1.
Indexing in Hive: What is View &
Index with Example
ByDavid TaylorUpdated

What is a View?
Views are similar to tables,
which are generated based on
the requirements.
• We can save any result set
data as a view in Hive
• Usage is similar to as
views used in SQL
• All type of DML operations
can be performed on a
view
Object 13

Creation of View:

Syntax:
Create VIEW <VIEWNAME> AS SELECT

Example:
Hive>Create VIEW Sample_ViewAS SELECT * FROM employees 
WHERE salary>25000

In this example, we are creating view Sample_View where it will


display all the row values with salary field greater than 25000.

What is Index?
Indexes are pointers to particular column name of a table.
• The user has to manually define the index

• Wherever we are creating index, it means that we are creating


pointer to particular column name of table
• Any Changes made to the column present in tables are stored
using the index value created on the column name.

Syntax:
Create INDEX <INDEX_NAME> ON TABLE < TABLE_NAME(column 
names)>

Example:
Create INDEX sample_Index ON TABLE 
guruhive_internaltable(id)

Here we are creating index on table guruhive_internaltable for column


name id.
Hive Queries: Order By, Group By,
Distribute By, Cluster By Examples
ByDavid TaylorUpdated

Hive provides SQL type


querying language for the ETL
purpose on top of Hadoop file
system.
Hive Query language (HiveQL)
provides SQL type environment
in Hive to work with tables,
databases, queries.
We can have a different type of
Clauses associated with Hive to
Object 14

perform different type data


manipulations and querying. For better connectivity with different
nodes outside the environment. HIVE provide JDBC connectivity as
well.
Hive queries provides the following features:
• Data modeling such as Creation of databases, tables, etc.

• ETL functionalities such as Extraction, Transformation, and


Loading data into tables
• Joins to merge different data tables

• User specific custom scripts for ease of code

• Faster querying tool on top of Hadoop

In this article, you will learn-


• Order by query

• Group by query

• Sort by

• Cluster By

• Distribute By
Creating Table in Hive
Before initiating with our main topic for this tutorial, first we will create
a table to use it as references for the following tutorial.

Object 15

Object 16

Here in this tutorial, we are going to create table “employees_guru”


with 6 columns.

Object 17

From the above screen shot,


1.We are creating table “employees_guru” with 6 column values
such as Id, Name, Age, Address, Salary, Department, which
belongs to the employees present in organization “guru.”
2.Here in this step we are loading data into employees_guru table.
The data that we are going to load will be placed under
Employees.txt file

Order by query:
The ORDER BY syntax in HiveQL is similar to the syntax of ORDER BY
in SQL language.
Order by is the clause we use with “SELECT” statement in Hive
queries, which helps sort data. Order by clause use columns on Hive
tables for sorting particular column values mentioned with Order by.
For whatever the column name we are defining the order by clause
the query will selects and display results by ascending or descending
order the particular column values.
If the mentioned order by field is a string, then it will display the result
in lexicographical order. At the back end, it has to be passed on to a
single reducer.
From the Above screen shot, we can observe the following
1.It is the query that performing on the “employees_guru” table
with the ORDER BY clause with Department as defined ORDER BY
column name.
“Department” is String so it will display results based on
lexicographical order.
2.This is actual output for the query. If we observe it properly, we
can see that it get results displayed based on Department
column such as ADMIN, Finance and so on in orderQuery to be
perform.
Query :
SELECT * FROM employees_guru ORDER BY Department;

Group by query:
Group by clause use columns on Hive tables for grouping particular
column values mentioned with the group by. For whatever the column
name we are defining a “groupby” clause the query will selects and
display results by grouping the particular column values.
For example, in the below screen shot it’s going to display the total
count of employees present in each department. Here we have
“Department” as Group by value.
From the above screenshot, we will observe the following
1.It is the query that is performed on the “employees_guru” table
with the GROUP BY clause with Department as defined GROUP BY
column name.
2.The output showing here is the department name, and the
employees count in different departments. Here all the
employees belong to the specific department is grouped by and
displayed in the results. So the result is department name with
the total number of employees present in each department.
Query:
SELECT Department, count(*) FROM employees_guru GROUP BY 
Department;
Sort by:
Sort by clause performs on column names of Hive tables to sort the
output. We can mention DESC for sorting the order in descending
order and mention ASC for Ascending order of the sort.
In this sort by it will sort the rows before feeding to the reducer.
Always sort by depends on column types.
For instance, if column types are numeric it will sort in numeric order if
the columns types are string it will sort in lexicographical order.

From the above screen shot we can observe the following:


1.It is the query that performing on the table “employees_guru”
with the SORT BY clause with “id” as define SORT BY column
name. We used keyword DESC.
2.So the output displayed will be in descending order of “id”.
Query:
SELECT * from employees_guru SORT BY Id DESC;

Cluster By:
Cluster By used as an alternative for both Distribute BY and Sort BY
clauses in Hive-QL.
Cluster BY clause used on tables present in Hive. Hive uses the
columns in Cluster by to distribute the rows among reducers. Cluster
BY columns will go to the multiple reducers.
• It ensures sorting orders of values present in multiple reducers

For example, Cluster By clause mentioned on the Id column name of


the table employees_guru table. The output when executing this query
will give results to multiple reducers at the back end. But as front end
it is an alternative clause for both Sort By and Distribute By.
This is actually back end process when we perform a query with sort
by, group by, and cluster by in terms of Map reduce framework. So if
we want to store results into multiple reducers, we go with Cluster By.
From the above screen shot we are getting the following observations:
1.It is the query that performs CLUSTER BY clause on Id field value.
Here it’s going to get a sort on Id values.
2.It displays the Id and Names present in the guru_employees sort
ordered by
Query:
SELECT  Id, Name from employees_guru CLUSTER BY Id;
Distribute By:
Distribute BY clause used on tables present in Hive. Hive uses the
columns in Distribute by to distribute the rows among reducers. All
Distribute BY columns will go to the same reducer.

Object 18

• It ensures each of N reducers gets non-overlapping ranges of


column
• It doesn’t sort the output of each reducer
From the above screenshot, we can observe the following
1.DISTRIBUTE BY Clause performing on Id of “empoloyees_guru”
table
2.Output showing Id, Name. At back end, it will go to the same
reducer
Query:
SELECT  Id, Name from employees_guru DISTRIBUTE BY Id;

You might also like