DSCI 5350 - Lecture 4 PDF

DSCI 5350 - Big Data Analytics
Lecture 4 – Impala and Hive
Kashif Saeed
1
Lecture Outline
• RDBMS vs. Hadoop

• What is Impala
• What is Hive
• How do Impala and Hive Compare
• Querying using Impala and Hive
• Industry usage
2
RDBMS vs. Hadoop
• RDBMS
Fast query response time
Support for transactions
Capability to modify existing records
Capability to serve thousands of simultaneous clients
RDBMS requires ‘schema on write’ – data structure has to be defined
before it is written
• Hadoop is NOT RDBMS
Generally not as fast query response time
No support for transactions
Cannot modify existing records; can only append or overwrite
Does not have referential integrity enforcement
Hadoop requires ‘schema on read’ – data can be written before data
structures are defined
Hadoop is used for building a data lake, hence it does not have to support transaction processing or the
3
ability to modify records
4
Introduction to Impala and Hive
• Apache Hive is a high level abstractor on top of

MapReduce
Uses SQL called HiveQL
Originally developed at Facebook in 2007; now an open
source Apache project
• Cloudera Impala uses a High-Performance
dedicated SQL engine
Uses Impala SQL
Inspired by Google Dremel project
Developed at Cloudera in 2012; now an open source
Apache project
5
Difference between Impala and Hive
• Hive has more features

Complex data types like arrays and maps are supported
Highly extensible
Uses MapReduce behind the scene
• Impala is faster
Specialized SQL engine offers 5-50x better performance
Does not use MapReduce; rather uses its own engine
which uses MPP (massively parallel processing)
Ideal for interactive queries and data analytics
Comes integrated with Cloudera CDH
Apache has another open source product called Drill which was inspired by Google Dremel project. Drill
is not part of Cloudera distribution nor it is supported by Cloudera. 6
How Impala/Hive Load & Store Data
• Queries operate on tables just like in RDBMS

A table is an HDFS sub-directory containing one or more
files
Default path: /user/hive/warehouse/<table-name>
Table data is stored in the directory above
• The Metadata is stored in the Metastore
Metastore is a service, which by default, uses an Apache
Derby embedded database
In Production Hadoop environments, companies use either
MySQL or Oracle for the Metastore
The VM used for the class uses MySQL for the Metastore
7
Both Impala and Hive share the same Metastore and data files.
8
Impala and Hive Commands
9
Creating a Database
Creating a Database
• Adds the database definition to Metastore
• Creates storage directory in HDFS
• Conditional creation of database (for scripting)
10
Removing a Database
Removing a Database
• DROP command is used to remove a database
• Only works if the database does not have any tables
• Use CASCADE to remove a database if there are tables in the
database
CASCADE only works in Hive; does not work in Impala
Will remove data from HDFS
11
Data Types
• Each Column is assigned a specific data type at the

time of table creation
• Hive supports some complex data types like Arrays,
Maps, Structs etc.
• Below are some common data types:
12
Creating a Table
• Below is the syntax for creating a table
• Creates a subdirectory in the warehouse directory in HDFS

• Default Database:
/user/hive/warehouse/tablename
• Named Database:
/user/hive/warehouse/dbname/tablename
13
Creating a Table -continued
• ROW FORMAT DELIMITED

Tells hive to expect one row per line in the file
Uses the new line character to determine the new line
• FIELDS TEMINATED BY
Tells hive how to determine columns in the file
Tab delimited files will be specified as
FIELD TERMINATED BY ‘\t’
• STORED AS
Tells hive the File format
STORED AS TEXTFILE is the default
Other formats will be discussed later in this course
14
Example: Table Creation
• The following command creates a new table ‘jobs’

• Data is stored as text with four comma-separated fields per line
• Example of a record in the table above:
15
• Notice that STORED AS TEXTFILE is omitted in
this example because it is the default
• This table is created in the database currently in use
• To ensure creation of the table in a specific database,
you can either make use of the use database
databasename; command OR fully qualify the
table name at the time of creation as
databasename.tablename
16
Creating Tables based on Existing Table
•Use LIKE to create a table based on an

existing table
It will add the table metadata to the Metastore
It will also create a new directory for the table in
the data warehouse
•This uses the column names and definitions
based on the existing table; the new table
created contains no data
17
Creating Tables based on SELECT
• Used for storing the results of a query in a table

• Often called ‘Create table As Select’ (CTAS)
• Column names and definitions are derived from the
existing table
• New table contains the result of the SELECT statement
18
Controlling Table Data Location
• By default, table data is stored in the warehouse directory

• Warehouse directory may be available to several users
• Use LOCATION to specify the directory where table data
should reside instead of the default location
• Use ALTER TABLE …SET LOCATION command to change
the location of an existing table
19
EXTERNAL Tables
• Dropping a table removes it’s definition from the

Metastore and it’s data from HDFS
• Removing the data can be problematic, especially if done
by mistake
• EXTERNAL tables preserve the data in HDFS upon deletion
• Extremely important feature in an environment where
multiple people and different scripting languages are
accessing the tables
To change this after the table is created:
ALTER TABLE abc

Set TBLPROPERTIES(‘EXTERNAL’=‘TRUE’);
20
Exploring Tables
• SHOW Tables - lists all tables in the current database
• DESCRIBE - lists the fields in the specified table
21
• DESCRIBE FORMATTED - shows the table properties:
• Show CREATE TABLE - displays the SQL Command used to create the
table
22
Loading Data into Tables
•Unlike RBDMS, Impala and Hive do not validate

data on insert
Simply moves the files in place
Loading data into tables therefore is very fast
Errors in file format will be discovered when you
perform queries
Missing or invalid data will be represented as NULL
23
Loading Data from HDFS files
• You can add file to the table directory using hdfs dfs
command
• In the example below, it loads data into the sales table
• Alternatively, use the LOAD DATA INPATH command
24
Overwriting Data from Files
• Add the OVERWRITE keyword to delete all records

before import
Removes all files within the table’s directory
Then moves the new file into the directory
25
Appending to a Table
• You can populate a table through a query using

INSERT INTO
• This appends records to the table
• You can specify a WHERE clause to control which

records are appended
26
Loading data from a Relational Database
• Sqoop can import data directly into Hive and Impala

• Add the --hive–import option to your sqoop command
Creates a table in the Hive Metastore
Imports data from RDBMS to the table’s directory in HDFS
--hive–import creates a table accessible in both Hive and
Impala
27
Hands-on
• Hive – Activity 1
• Hive – Activity 2
28
29
Why do we need Hive or Impala?
• Brings data analysis to Analysts instead of

Programmers
No software development/programming required
Leverages knowledge of SQL instead of programming
• More productive than writing MapReduce code
5 lines of HiveQL/Impala might be equivalent to 100 lines of
MapReduce
• Works with majority of BI tools
BI tools establish connectivity with Hadoop at the Hive or
Impala level
This is because most BI tools are better working with
tables than files
30
Interacting with Hive & Impala
• Hive and Impala support many interfaces

Command Line Shell
o Impala: Impala Shell
o Hive: Beeline
Hue Web UI
o Hive Query Editor
o Impala Query Editor
o Metastore Manager
JDBC/ODBC
Mostly used by BI tools
Most BI tools have support for both Hive and Impala via ODBC/JDBC
31
Using the Impala Shell
• Execute the impala-shell command to start the shell
• Use –i hostname:port to connect to a different server
- Notice that 21000 is the default port at which Cloudera listens

- You can also use the CONNECT hostname command to connect to a different server
32
• Terminate statements using a semicolon;
• Hit [Enter] to execute a query or a command
• Use the quit command to exit the shell

• You can also use exit to quit the shell, but quit is the preferred
command
• Use Impala-shell --help for a full list of options
• For a complete list of all Impala shell commands, visit

https://www.cloudera.com/documentation/enterprise/5-5-
x/topics/impala_shell_commands.html
• Hive and Impala commands are pretty similar
33

DSCI 5350 - Lecture 4 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSCI 5350 - Lecture 4 PDF

Uploaded by

Copyright:

Available Formats

DSCI 5350 - Big Data Analytics

Lecture 4 – Impala and Hive

• RDBMS vs. Hadoop

• Apache Hive is a high level abstractor on top of

• Hive has more features

• Queries operate on tables just like in RDBMS

• Conditional creation of database (for scripting)

• Each Column is assigned a specific data type at the

• Below is the syntax for creating a table

• Creates a subdirectory in the warehouse directory in HDFS

• ROW FORMAT DELIMITED

• The following command creates a new table ‘jobs’

• Example of a record in the table above:

•Use LIKE to create a table based on an

• Used for storing the results of a query in a table

• By default, table data is stored in the warehouse directory

• Dropping a table removes it’s definition from the

To change this after the table is created:

ALTER TABLE abc

• SHOW Tables - lists all tables in the current database

• DESCRIBE - lists the fields in the specified table

•Unlike RBDMS, Impala and Hive do not validate

• Alternatively, use the LOAD DATA INPATH command

• Add the OVERWRITE keyword to delete all records

• You can populate a table through a query using

• You can specify a WHERE clause to control which

• Sqoop can import data directly into Hive and Impala

• Brings data analysis to Analysts instead of

• Hive and Impala support many interfaces

• Execute the impala-shell command to start the shell

• Use –i hostname:port to connect to a different server

- Notice that 21000 is the default port at which Cloudera listens

• Hit [Enter] to execute a query or a command

• Use the quit command to exit the shell

• Use Impala-shell --help for a full list of options

• For a complete list of all Impala shell commands, visit

• Hive and Impala commands are pretty similar

You might also like