You are on page 1of 33

DSCI 5350 - Big Data Analytics

Lecture 4 – Impala and Hive

Kashif Saeed
1
Lecture Outline

• RDBMS vs. Hadoop


• What is Impala
• What is Hive
• How do Impala and Hive Compare
• Querying using Impala and Hive
• Industry usage

2
RDBMS vs. Hadoop

• RDBMS
Fast query response time
Support for transactions
Capability to modify existing records
Capability to serve thousands of simultaneous clients
RDBMS requires ‘schema on write’ – data structure has to be defined
before it is written
• Hadoop is NOT RDBMS
Generally not as fast query response time
No support for transactions
Cannot modify existing records; can only append or overwrite
Does not have referential integrity enforcement
Hadoop requires ‘schema on read’ – data can be written before data
structures are defined
Hadoop is used for building a data lake, hence it does not have to support transaction processing or the
3
ability to modify records
4
Introduction to Impala and Hive

• Apache Hive is a high level abstractor on top of


MapReduce
Uses SQL called HiveQL
Originally developed at Facebook in 2007; now an open
source Apache project
• Cloudera Impala uses a High-Performance
dedicated SQL engine
Uses Impala SQL
Inspired by Google Dremel project
Developed at Cloudera in 2012; now an open source
Apache project

5
Difference between Impala and Hive

• Hive has more features


Complex data types like arrays and maps are supported
Highly extensible
Uses MapReduce behind the scene
• Impala is faster
Specialized SQL engine offers 5-50x better performance
Does not use MapReduce; rather uses its own engine
which uses MPP (massively parallel processing)
Ideal for interactive queries and data analytics
Comes integrated with Cloudera CDH

Apache has another open source product called Drill which was inspired by Google Dremel project. Drill
is not part of Cloudera distribution nor it is supported by Cloudera. 6
How Impala/Hive Load & Store Data

• Queries operate on tables just like in RDBMS


A table is an HDFS sub-directory containing one or more
files
Default path: /user/hive/warehouse/<table-name>
Table data is stored in the directory above
• The Metadata is stored in the Metastore
Metastore is a service, which by default, uses an Apache
Derby embedded database
In Production Hadoop environments, companies use either
MySQL or Oracle for the Metastore
The VM used for the class uses MySQL for the Metastore

7
Both Impala and Hive share the same Metastore and data files.
8
Impala and Hive Commands

9
Creating a Database

Creating a Database
• Adds the database definition to Metastore
• Creates storage directory in HDFS

• Conditional creation of database (for scripting)

10
Removing a Database

Removing a Database
• DROP command is used to remove a database
• Only works if the database does not have any tables
• Use CASCADE to remove a database if there are tables in the
database
CASCADE only works in Hive; does not work in Impala
Will remove data from HDFS

11
Data Types

• Each Column is assigned a specific data type at the


time of table creation
• Hive supports some complex data types like Arrays,
Maps, Structs etc.
• Below are some common data types:

12
Creating a Table

• Below is the syntax for creating a table

• Creates a subdirectory in the warehouse directory in HDFS


• Default Database:
/user/hive/warehouse/tablename
• Named Database:
/user/hive/warehouse/dbname/tablename

13
Creating a Table -continued

• ROW FORMAT DELIMITED


Tells hive to expect one row per line in the file
Uses the new line character to determine the new line
• FIELDS TEMINATED BY
Tells hive how to determine columns in the file
Tab delimited files will be specified as
FIELD TERMINATED BY ‘\t’
• STORED AS
Tells hive the File format
STORED AS TEXTFILE is the default
Other formats will be discussed later in this course

14
Example: Table Creation

• The following command creates a new table ‘jobs’


• Data is stored as text with four comma-separated fields per line

• Example of a record in the table above:

15
• Notice that STORED AS TEXTFILE is omitted in
this example because it is the default
• This table is created in the database currently in use
• To ensure creation of the table in a specific database,
you can either make use of the use database
databasename; command OR fully qualify the
table name at the time of creation as
databasename.tablename

16
Creating Tables based on Existing Table

•Use LIKE to create a table based on an


existing table
It will add the table metadata to the Metastore
It will also create a new directory for the table in
the data warehouse
•This uses the column names and definitions
based on the existing table; the new table
created contains no data

17
Creating Tables based on SELECT

• Used for storing the results of a query in a table


• Often called ‘Create table As Select’ (CTAS)
• Column names and definitions are derived from the
existing table
• New table contains the result of the SELECT statement

18
Controlling Table Data Location

• By default, table data is stored in the warehouse directory


• Warehouse directory may be available to several users
• Use LOCATION to specify the directory where table data
should reside instead of the default location
• Use ALTER TABLE …SET LOCATION command to change
the location of an existing table

19
EXTERNAL Tables

• Dropping a table removes it’s definition from the


Metastore and it’s data from HDFS
• Removing the data can be problematic, especially if done
by mistake
• EXTERNAL tables preserve the data in HDFS upon deletion
• Extremely important feature in an environment where
multiple people and different scripting languages are
accessing the tables

To change this after the table is created:

ALTER TABLE abc


Set TBLPROPERTIES(‘EXTERNAL’=‘TRUE’);

20
Exploring Tables

• SHOW Tables - lists all tables in the current database

• DESCRIBE - lists the fields in the specified table

21
• DESCRIBE FORMATTED - shows the table properties:

• Show CREATE TABLE - displays the SQL Command used to create the
table

22
Loading Data into Tables

•Unlike RBDMS, Impala and Hive do not validate


data on insert
Simply moves the files in place
Loading data into tables therefore is very fast
Errors in file format will be discovered when you
perform queries
Missing or invalid data will be represented as NULL

23
Loading Data from HDFS files

• You can add file to the table directory using hdfs dfs
command
• In the example below, it loads data into the sales table

• Alternatively, use the LOAD DATA INPATH command

24
Overwriting Data from Files

• Add the OVERWRITE keyword to delete all records


before import
Removes all files within the table’s directory
Then moves the new file into the directory

25
Appending to a Table

• You can populate a table through a query using


INSERT INTO
• This appends records to the table

• You can specify a WHERE clause to control which


records are appended

26
Loading data from a Relational Database

• Sqoop can import data directly into Hive and Impala


• Add the --hive–import option to your sqoop command
Creates a table in the Hive Metastore
Imports data from RDBMS to the table’s directory in HDFS
--hive–import creates a table accessible in both Hive and
Impala

27
Hands-on

• Hive – Activity 1
• Hive – Activity 2

28
29
Why do we need Hive or Impala?

• Brings data analysis to Analysts instead of


Programmers
No software development/programming required
Leverages knowledge of SQL instead of programming
• More productive than writing MapReduce code
5 lines of HiveQL/Impala might be equivalent to 100 lines of
MapReduce
• Works with majority of BI tools
BI tools establish connectivity with Hadoop at the Hive or
Impala level
This is because most BI tools are better working with
tables than files

30
Interacting with Hive & Impala

• Hive and Impala support many interfaces


Command Line Shell
o Impala: Impala Shell
o Hive: Beeline
Hue Web UI
o Hive Query Editor
o Impala Query Editor
o Metastore Manager
JDBC/ODBC
Mostly used by BI tools
Most BI tools have support for both Hive and Impala via ODBC/JDBC

31
Using the Impala Shell

• Execute the impala-shell command to start the shell

• Use –i hostname:port to connect to a different server

- Notice that 21000 is the default port at which Cloudera listens


- You can also use the CONNECT hostname command to connect to a different server
32
• Terminate statements using a semicolon;

• Hit [Enter] to execute a query or a command

• Use the quit command to exit the shell


• You can also use exit to quit the shell, but quit is the preferred
command

• Use Impala-shell --help for a full list of options

• For a complete list of all Impala shell commands, visit


https://www.cloudera.com/documentation/enterprise/5-5-
x/topics/impala_shell_commands.html

• Hive and Impala commands are pretty similar

33

You might also like