You are on page 1of 24

Best Practices

Native Hadoop tool Sqoop

Version: 2.00
Date: XXX

Copyright 2013 by Teradata. All Rights Reserved.

History / Issues / Approvals

Revision History
Date
March 3 2013

Version
1.0

Description
Initial Draft

Author
Sarang Patil

March 12 2013

1.5

Removed other tools and added Sqoop of Teradata

Sarang Patil

April 2, 2013

2.0

Added interface to Teradata ASTER

Sarang Patil

April 5, 2013

2.2

For Internal DI CoE Review

Sarang Patil

April 9, 2013
2.3
Table 1: Revision History

Review Comments included

Sarang Patil

Issues Documentation
The following Issues were defined during Design/Preparation.
Raised By

Issue

Date
Needed

Sada Shiro
/ Deepak
Mangani l
Chris Ward

We need to include some


examples

Needs to understand the internal


of Sqoop connector
Table 2: Issues Documentation

Resolution/Answer

Date
Completed

Resolved By

Section added.

Section added; No UDA


available at this point

Approvals
The undersigned acknowledge they have reviewed the high-level architecture design and agree with its contents.
Name

Role

Steve Fox

Sr. Architect DI CoE

Email Approval

Approved
Version
Version 2.2

RE List of potential
BP reviewers.msg

Alex Tuabman

Data Integration Consultant

Version 2.2
Approve Best
Practice - Native Hadoop Tool Sqoop.msg

Table 3: Document Signoff

Teradata Confidential and Proprietary

Page i of xxiv

Table of Contents

Table of Contents
Executive summary ................................................................................................................ 5
Apache Sqoop - Overview ...................................................................................................... 6
Ease of Use ..............................................................................................................................................7
Ease of Extension ....................................................................................................................................7
Security ...................................................................................................................................................8

Apache Sqoop help tool....................................................................................................... 9


Best practices for Sqoop Installation ..................................................................................... 10
Server installation .....................................................................................................................................10
Installing Dependencies ............................................................................................................................10
Configuring Server .....................................................................................................................................11
Server Life Cycle ........................................................................................................................................11
Client installation ......................................................................................................................................12
Debugging information .............................................................................................................................12

Best practices for importing data to Hadoop ......................................................................... 13


Importing data to HDFS .............................................................................................................................13
Importing Data into Hive ...........................................................................................................................14
Importing Data Importing Data into HBase ...............................................................................................15

Best practices to exporting data from Hadoop ...................................................................... 17


Best practices NoSQL database ............................................................................................. 18
Best practices operational .................................................................................................... 19
Operational Dos ....................................................................................................................................19
Operational Donts ................................................................................................................................19

Sqoop Examples ................................................................................................................... 20


hdfs to Teradata/Teradata Aster .................................................................. Error! Bookmark not defined.
Moving entire table to hdfs...................................................................................................................21
Moving entire table to hive ...................................................................... Error! Bookmark not defined.
Moving entire table to Hbase ...............................................................................................................21
Teradata/Teradata Aster to hdfs ...............................................................................................................21
Moving entire table from hdfs to Teradata ..........................................................................................21
Moving entire table from hdfs to Teradata ASTER ...............................................................................21

Sqoop informational links .................................................................................................... 22


Teradata Confidential and Proprietary

Page ii of xxiv

Table of Contents

Summary ............................................................................................................................. 23

Teradata Confidential and Proprietary

Page iii of xxiv

Table of Figures

Table of Figures
FIGURE 1SQOOP ARCHITECTURE ...............................................................................................................................7
FIGURE 2SQOOP IMPORT JOB ..................................................................................................................................14
FIGURE 3SQOOP EXPORT JOB ..................................................................................................................................16

Teradata Confidential and Proprietary

Page iv of xxiv

Best Practices Native Hadoop Tool Sqoop

Executive summary
Currently there are three ways proven standard methods to interface Hadoop and Teradata, as well as
Teradata Aster.
1. Using Flat file interfaces
a. Available for Teradata
b. Available for Teradata Aster
2. Using SQL-H Interface.
a. Available for Teradata will be in Q3 of 2013
b. Available for Teradata Aster
3. Using Apache tool Sqoop
a. Available for Teradata
b. Available for Teradata Aster
In the BIG DATA environment it is not recommended to write and read BIG data multiple times flat file
interface will be used only when none of the other options are available to be used. The Best Practices
for these interfaces are documented in Best Practices for Teradata tools and utilities.
Detail Best Practices for SQL-H are documented in separate document Best Practices for Aster Data
Integration. Currently (Q2-2013) the SQL-H interface is available for Teradata Aster Platform. SQL-H
interface for Teradata will be available in Q-3 of 2013.
The scope of the document is detail best practices for native Hadoop tool Sqoop. Current version of
Sqoop is Sqoop2.

Teradata Confidential and Proprietary

Page 5 of 24

Best Practices Native Hadoop Tool Sqoop

Apache Sqoop - Overview


Using Hadoop for analytics and data processing requires loading data into clusters and processing it in
conjunction with other data that often resides in production databases across the enterprise. Loading
bulk data into Hadoop from production systems or accessing it from map reduce applications running on
large clusters can be a challenging task. Users must consider details like ensuring consistency of data,
the consumption of production system resources, data preparation for provisioning downstream
pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data
residing on external systems from within the map reduce applications complicates applications and
exposes the production system to the risk of excessive load originating from cluster nodes.

This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software
Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.
Sqoop allows easy import and export of data from structured data stores such as relational databases,
enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external
system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to
schedule and automate import and export tasks. Sqoop uses a connector based architecture which
supports plugins that provide connectivity to new external systems.
What happens underneath the covers when you run Sqoop is very straightforward. The dataset being
transferred is sliced up into different partitions and a map-only job is launched with individual mappers
responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe
manner since Sqoop uses the database metadata to infer the data types.
In the rest of this post we will walk through an example that shows the various ways you can use Sqoop.
The goal of this post is to give an overview of Sqoop operation without going into much detail or
advanced functionality

Teradata Confidential and Proprietary

Page 6 of 24

Best Practices Native Hadoop Tool Sqoop

Figure 1Sqoop Architecture

Ease of Use
Whereas Sqoop requires client-side installation and configuration, Sqoop 2 will be installed and
configured server-side. This means that connectors will be configured in one place, managed by the
Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database
connectivity will only be needed on the server. Sqoop 2 will be a web-based service: front-ended by a
Command Line Interface (CLI) and browser and back-ended by a metadata repository. Moreover, Sqoop
2's service level integration with Hive and HBase will be on the server-side. Oozie will manage Sqoop
tasks through the REST API. This decouples Sqoop internals from Oozie, i.e. if you install a new Sqoop
connector then you won't need to install it in Oozie also.

Ease of Extension
In Sqoop 2, connectors will no longer be restricted to the JDBC model, but can rather define their own
vocabulary, e.g. Couchbase no longer needs to specify a table name, only to overload it as a backfill or
dump operation.

Teradata Confidential and Proprietary

Page 7 of 24

Best Practices Native Hadoop Tool Sqoop

Common functionality will be abstracted out of connectors, holding them responsible only for data
transport. The reduce phase will implement common functionality, ensuring that connectors benefit
from future development of functionality.
Sqoop 2's interactive web-based UI will walk users through import/export setup, eliminating redundant
steps and omitting incorrect options. Connectors will be added in one place, with the connectors
exposing necessary options to the Sqoop framework. Thus, users will only need to provide information
relevant to their use-case.
With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more
predictable. In the same way, the user will not need to be aware of the functionality of all connectors.
As a result, connectors no longer need to provide downstream functionality, transformations, and
integration with other systems. Hence, the connector developer no longer has the burden of
understanding all the features that Sqoop supports.

Security
Currently, Sqoop operates as the user that runs the 'sqoop' command. The security principal used by a
Sqoop job is determined by what credentials the users have when they launch Sqoop. Going forward,
Sqoop 2 will operate as a server based application with support for securing access to external systems
via role-based access to Connection objects. For additional security, Sqoop 2 will no longer allow code
generation, require direct access to Hive and HBase, nor open up access to all clients to execute jobs.
Sqoop 2 will introduce Connections as First-Class Objects. Connections, which will encompass
credentials, will be created once and then used many times for various import/export jobs. Connections
will be created by the Admin and used by the Operator, thus preventing credential abuse by the end
user. Furthermore, Connections can be restricted based on operation (import/export). By limiting the
total number of physical Connections opens at one time and with an option to disable Connections,
resources can be managed

Teradata Confidential and Proprietary

Page 8 of 24

Best Practices Native Hadoop Tool Sqoop

Apache Sqoop help tool


Sqoop ships with a help tool. To display a list of all available tools,
type the following command:
$ sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
codegen

Generate code to interact with database records

create-hive-table

Import a table definition into Hive

eval

Evaluate a SQL statement and display the results

export

Export an HDFS directory to a database table

help

List available commands

import

Import a table from a database to HDFS

import-all-tables

Import tables from a database to HDFS

list-databases

List available databases on a server

list-tables

List available tables in a database

version

Display version information

See 'sqoop help COMMAND' for information on a specific command.

Teradata Confidential and Proprietary

Page 9 of 24

Best Practices Native Hadoop Tool Sqoop

Best practices for Sqoop Installation


Sqoop ships as one binary package however its compound from two separate parts - client and server.
You need to install server on single node in your cluster. This node will then serve as an entry point for
all connecting Sqoop clients. Server acts as a MapReduce client and therefore Hadoop must be installed
and configured on machine hosting Sqoop server. Clients can be installed on any arbitrary number of
machines. Client is not acting as a MapReduce client and thus you do not need to install Hadoop on
nodes that will act only as a Sqoop client.

Server installation
Copy Sqoop artifact on machine where you want to run Sqoop server. This machine must have installed
and configured Hadoop. You dont need to run any Hadoop related services there, however the machine
must be able to act as an Hadoop client. You should be able to list a HDFS for example:

hadoop dfs -ls

Sqoop server supports multiple Hadoop versions. However as Hadoop major versions are not
compatible with each other, Sqoop have multiple binary artifacts - one for each supported major version
of Hadoop. You need to make sure that youre using appropriated binary artifact for your specific
Hadoop version. To install Sqoop server decompress appropriate distribution artifact in location at your
convenience and change your working directory to this folder.

# Decompress Sqoop distribution tarball


tar -xvf sqoop-<version>-bin-hadoop<hadoop-version>.tar.gz
# Move decompressed content to any location
mv sqoop-<version>-bin-hadoop<hadoop version>.tar.gz /usr/lib/sqoop
# Change working directory
cd /usr/lib/sqoop

Installing Dependencies
You need to install Hadoop libraries into Sqoop server war file. Sqoop provides convenience script
addtowar.sh to do so. If you have installed Hadoop in usual location in /usr/lib and executable
hadoop is in your path, you can use automatic Hadoop installation procedure:

Teradata Confidential and Proprietary

Page 10 of 24

Best Practices Native Hadoop Tool Sqoop

./bin/addtowar.sh -hadoop-auto

In case that you have Hadoop installed in different location, you will need to manually specify Hadoop
version and path to Hadoop libraries. You can use parameter -hadoop-version for specifying
Hadoop major version, were currently support versions 1.x and 2.x. Path to Hadoop libraries can be
specified using -hadoop-path parameter. In case that your Hadoop libraries are in multiple different
folders, you can specify all of them separated by :.
Example of manual installation:

./bin/addtowar.sh -hadoop-version 2.0 -hadoop-path /usr/lib/hadoopcommon:/usr/lib/hadoop-hdfs:/usr/lib/hadoop-yarn

Lastly you might need to install JDBC drivers that are not bundled with Sqoop because of incompatible
licenses. You can add any arbitrary java jar file to Sqoop server using script addtowar.sh with -jars
parameter. Similarly as in case of hadoop path you can enter multiple jars separated with :.
Example of installing MySQL JDBC driver to Sqoop server:

./bin/addtowar.sh -jars /path/to/jar/mysql-connector-java-*-bin.jar

Configuring Server
Before starting server you should revise configuration to match your specific environment. Server
configuration files are stored in server/config directory of distributed artifact along side with other
configuration files of Tomcat.
File sqoop_bootstrap.properties specifies which configuration provider should be used for
loading configuration for rest of Sqoop server. Default value
PropertiesConfigurationProvider should be sufficient.
Second configuration file sqoop.properties contains remaining configuration properties that can
affect Sqoop server. File is very well documented, so check if all configuration properties fits your
environment. Default or very little tweaking should be sufficient most common cases.

Server Life Cycle


After installation and configuration you can start Sqoop server with following command:

Teradata Confidential and Proprietary

Page 11 of 24

Best Practices Native Hadoop Tool Sqoop

./bin/sqoop.sh server start

Similarly you can stop server using following command:

./bin/sqoop.sh server stop

Client installation
Client do not need extra installation and configuration steps. Just copy Sqoop distribution artifact on
target machine and unzip it in desired location. You can start client with following command:

bin/sqoop.sh client

Debugging information
The logs of the Tomcat server is located under the server/logs directory in the Sqoop2 distribution
directory.
The logs of the Sqoop2 server and the Derby repository are located as sqoop.log and derbyrepo.log (by
default unless changed by the above configuration), respectively, under the (LOGS) directory in the
Sqoop2 distribution directory.

Teradata Confidential and Proprietary

Page 12 of 24

Best Practices Native Hadoop Tool Sqoop

Best practices for importing data to Hadoop


The following section describes the option to export data from RDBMS to Hadoop hdfs as well as higher
level constructs like Hive and HBase.

Importing data to HDFS


The following command is used to import all data from a table called ORDERS from a Teradata database:
--$ sqoop import --connect jdbc:teradata://12.13.24.54/localhost/
--table <<TABLE NAME>> --username <<USERNAME>> --password <<Password>>

import: This is the sub-command that instructs Sqoop to initiate an import.

--connect <connect string>, --username <user name>, --password <password>: These are
connection parameters that are used to connect with the database. This is no different from the
connection parameters that you use when connecting to the database via a JDBC connection.

--table <table name>: This parameter specifies the table which will be imported.

The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the
database to gather the necessary metadata for the data being imported.
The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the
actual data transfer using the metadata captured in the previous step.
The imported data is saved in a directory on HDFS based on the table being imported. As is the case with
most aspects of Sqoop operation, the user can specify any alternative directory where the files should
be populated.
By default these files contain comma delimited fields, with new lines separating different records. You
can easily override the format in which data is copied over by explicitly specifying the field separator and
record terminator characters.
Sqoop also supports different data formats for importing data. For example, you can easily import data
in Avro data format by simply specifying the option --as-avrodatafile with the import command.
There are many other options that Sqoop provides which can be used to further tune the import
operation to suit your specific requirements.

Teradata Confidential and Proprietary

Page 13 of 24

Best Practices Native Hadoop Tool Sqoop

Figure 2Sqoop Import Job

Importing Data into Hive


In most cases, importing data into Hive is the same as running the import task and then using Hive to
create and load a certain table or partition. Doing this manually requires that you know the correct type
mapping between the data and other details like the serialization format and delimiters.
Sqoop takes care of populating the Hive meta-store with the appropriate metadata for the table and
also invokes the necessary commands to load the table or partition as the case may be. All of this is
done by simply specifying the option --hive-import with the import command.

$ sqoop import --connect jdbc:teradata://12.13.24.54/


--table <<TABLE NAME>> --username <<USERNAME>> --password <<Password>>
-- hive-import

Teradata Confidential and Proprietary

Page 14 of 24

Best Practices Native Hadoop Tool Sqoop

When you run a Hive import, Sqoop converts the data from the native datatypes within the external
datastore into the corresponding types within Hive.
Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new
line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the
data correctly populated for consumption in Hive.
Once the import is complete, you can see and operate on the table just like any other table in Hive.

Importing Data Importing Data into HBase


You can use Sqoop to populate data in a particular column family within the HBase table. Much like the
Hive import, this can be done by specifying the additional options that relate to the HBase table and
column family being populated. All data imported into HBase is converted to their string representation
and inserted as UTF-8 bytes..

$ sqoop import --connect jdbc:teradata://12.13.24.54/


--table <<TABLE NAME>> --username <<USERNAME>> --password <<Password>>
-- hbase-create-table hbase-table MYTABLE column-family Teradata
In this command the various options specified are as follows:

--hbase-create-table: This option instructs Sqoop to create the HBase table.


--hbase-table: This option specifies the table name to use.
--column-family: This option specifies the column family name to use

Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.
Some connectors support staging tables that help isolate production tables from possible corruption in
case of job failures due to any reason. Staging tables are first populated by the map tasks and then
merged into the target table once all of the data has been delivered it.

Teradata Confidential and Proprietary

Page 15 of 24

Best Practices Native Hadoop Tool Sqoop

Figure 3Sqoop Export Job

Teradata Confidential and Proprietary

Page 16 of 24

Best Practices Native Hadoop Tool Sqoop

Best practices to exporting data from Hadoop


In some cases data processed by Hadoop pipelines may be needed in production systems to help run
additional critical business functions. Sqoop can be used to export such data into external data stores as
necessary.
Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the
ORDERS table in a database somewhere, you could populate it using the following command:
$ sqoop export --connect jdbc:Teradata://12.13.24.54/
--table ORDERS --username test --password **** \
--export -dir /user/stagedata/20130201/ORDERS

export: This is the sub-command that instructs Sqoop to initiate an export.


--connect <connect string>, --username <user name>, --password <password>: These are
connection parameters that are used to connect with the database. This is no different from the
connection parameters that you use when connecting to the database via a JDBC connection.
--table <table name>: This parameter specifies the table which will be populated.
--export-dir <directory path>: This is the directory from which data will be exported.

Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

Teradata Confidential and Proprietary

Page 17 of 24

Best Practices Native Hadoop Tool Sqoop

Best practices NoSQL database


Using specialized connectors, Sqoop can connect with external systems that have optimized import and
export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoops
extension framework and can be added to any existing Sqoop installation. Once a connector is installed,
Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the
connector.
By default Sqoop includes connectors for various popular databases such as Teradata, Teradata Aster,
MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and
PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch
tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be
used to connect to any database that is accessible via JDBC.
Apart from the built-in connectors, many companies have developed their own connectors that can be
plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to
NoSQL datastores
Sqoop2 can transfer large datasets between Hadoop and external datastores such as relational
databases. Beyond this, Sqoop offers many advance features such as different data formats,
compression, working with queries instead of tables etc.

Teradata Confidential and Proprietary

Page 18 of 24

Best Practices Native Hadoop Tool Sqoop

Best practices operational


Operational Dos

If you need to move big data, make it small first, and then move small data.
Prepare data model in advance to ensure that queries touch the least amount of data.
Always create an empty export table.
Do use --escaped-by option during import and --input-escaped-by during export.
Do use fields-terminated-by during import and input-fields-terminated-by during export.
Do specify the direct mode option (--direct), if you use the direct connector
Develop some kind of incremental import when sqoop-ing in large tables.
o If you do not, your Sqoop jobs will take longer and longer as the data grows from the
Compress data in HDFS.
o You will save space on HDFS as your replication factor makes multiple copies of your
data.
o You will benefit in processing as your Map/Reduce jobs have less data to feaster and
HADOOP becomes less I/O bound
Do use --escaped-by option during import and --input-escaped-by during export.
Do use fields-terminated-by during import and input-fields-terminated-by during export.

Operational Donts

Dont use the same table for both import and export
Dont specify the query, if you use the direct connector
Dont have too many partitions same file that will be stored in HDFS
o This translates into time consuming map tasks, use partitioning if possible
o 1000 Partitions will perform better than 10,000 partitions

Teradata Confidential and Proprietary

Page 19 of 24

Best Practices Native Hadoop Tool Sqoop

Technical implementation of Sqoop JDBC


Following section describes how the data is transferred using the JDBC connection; Including the
technical implementation of data pipes in and out Teradata as well as hdfs.
To be updated once we have this information

Teradata Confidential and Proprietary

Page 20 of 24

Best Practices Native Hadoop Tool Sqoop

Sqoop sample use case


Following section describes the use cases and examples on how to transfer the data. To be updated
once we have this information

Exporting data to hdfs


Exporting entire table to hdfs
Exporting table to hive using SQL statement
Exporting table to hive using SQL join statement
Exporting entire table to Hbase

Importing data from hdfs


Importing entire table from hdfs to Teradata
Importing entire table from hdfs to Teradata ASTER
Importing table to hive using SQL statement to Teradata
Importing table to hive using SQL join statement Teradata ASTER

Teradata Confidential and Proprietary

Page 21 of 24

Best Practices Native Hadoop Tool Sqoop

Sqoop informational links


Subject Area

Links to Sqoop Project

Sqoop2 Down load

Download

Sqoop2 Documentation

Documentation

API Documentation

Scoop2 API documentation

Sqoop trouble shooting Guide


Teradata Sqoop connector
Teradata Aster Sqoop connector
Frequently asked questions
Sqoop2 Project Status
Sqoop2 command line interface details
Issues related to Sqoop

Sqoop Troubleshooting Tips


Teradata Sqoop Connector
Teradata Aster Sqoop connector
FAQ
Sqoop2 Project Status
Command Line Client
Issue Tracker (JIRA)

Teradata Confidential and Proprietary

Page 22 of 24

Best Practices Native Hadoop Tool Sqoop

Summary
Sqoop 2 will enable users to use Sqoop effectively with a minimal understanding of its details by having
a web-application run Sqoop, which allows Sqoop to be installed once and used from anywhere.
In addition, having a REST API for operation and management will help Sqoop integrate better with
external systems such as Oozie.
Also, introducing a reduce phase allows connectors to be focused only on connectivity and ensures that
Sqoop functionality is uniformly available for all connectors. This facilitates ease of development of
connectors

Teradata Confidential and Proprietary

Page 23 of 24