Chapter n3 Sqoop

SQOOP
Instructor: Oussama Derbel

Introduction
■ What is Hadoop?
– Apache Hadoop is an open source software framework used to develop data processing
applications which are executed in a distributed computing environment.
– Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers.
■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater
computational power at low cost.

Introduction
■ Core of Hadoop
HDFS
Storage Part
( Hadoop Distributed File System)
MAPREDUCE Processing Part

Introduction
■ Apache Hadoop consists of two sub-projects
1. Hadoop MapReduce: MapReduce is a computational model
and software framework for writing applications which are
run on Hadoop. These MapReduce programs are capable of
processing enormous data in parallel on large clusters of
computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of
the storage part of Hadoop applications.

Note
MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.
Introduction
How to import/export database from/to RDBM from/to HDFS?

What is SQOOP?
■ SQOOP stands for SQL to Hadoop
■ Sqoop is a command-line interface application for transferring data between relational databases and
Hadoop.
■ It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be
run multiple times to import updates made to a database since the last import.
■ Using Sqoop, Data can be moved into HDFS/hive/hbase from MySQL/ PostgreSQL/Oracle/SQL
Server/DB2 and vise versa.

What is SQOOP?
■ SQOOP location between RDBMS and Hadoop

SQOOP working
■ Step 1: Sqoop send the request to Relational DB to send the return the metadata information
about the table(Metadata here is the data about the table in relational DB).
■ Step 2: From the received information it will generate the java classes (Reason why you should
have Java configured before get it working-Sqoop internally uses JDBC API to generate data).
■ Step 3: Now Sqoop (As its written in java ?tries to package the compiled classes to be able to
generate table structure), post compiling creates jar file (Java packaging standard).
SQOOP CONNECTORS
■ MySQL
■ Netezza
■ Oracle JDBC
■ PostgreSQL
■ Teradata
■ SQL SreverR2
SQOOP Import/Export Data
Structured Data
Data Node Stores Data Import Data
Server:
Data Node Stores Data
Name Node Sqoop MySQL

Oracle
SQL Sserver
Stores Meta Information
Export Data

SQOOP Import
Structured Data
Server:
Import Data
Name Node MySQL
Stores Meta Information

Sqoop

What Sqoop Does
• Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores
such as relational databases, Apache Sqoop:
• Allows data imports from external datastores and enterprise data warehouses into Hadoop
• Parallelizes data transfer for fast performance and optimal system Utilization
• Copies data quickly from external systems to Hadoop
• Makes data analysis more efficient
• Mitigates excessive loads to external systems.

SQOOP Import data
Simple Sqoop import Command:
Sqoop import \
--connect jdbc:mysql://localhost/retail_db\
--username root \
--password cloudera\
--table customers
SQOOP Import data
Specifying Mappers Sqoop Import Command:
Sqoop import \
--connect jdbc:mysql://localhost/retail_db\
--username root --password cloudera\
--table customers
-m 2
Managing destination directory
Defining warehouse directory
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--username root --password cloudera \
--table customers \
--warehouse-dir /user/cloudera/new-warehouse
Defining target directory
sqoop import \
--table customers \
--target-dir /user/cloudera/customer-new
Delete target directory if already exists
sqoop import \
--table customers \
--target-dir /user/cloudera/customer-new\
--delete-target-dir
Working With File Format
Importing as avro files
sqoop import \
--table customers \
--target-dir /user/cloudera/customer-avro \
--as-avrodatafile
Importing as parquet files
sqoop import \
--table customers \
--target-dir /user/cloudera/customer-parquet \
--as-parquetfile
Importing as Sequence files
sqoop import \
--table customers \
--target-dir /user/cloudera/customer-sequence \
--as-sequencefile
Conditional/Selective Imports
Conditional Imports
sqoop import \
--table customers \
--target-dir /user/cloudera/customer-name-m \
--where “customer_fname=’Mary’”
Selective Column Imports
sqoop import \
--table customers \
--target-dir /user/cloudera/customer-selected \
--columns “customer_fname,customer_lname,customer_city’”
Using query
sqoop import \
--target-dir /user/cloudera/customer-queries \
--query “Select* from customers where customer_id > 100 AND
$CONDITIONS”\
--split-by “customer_id”
Thank you

Chapter n3 Sqoop

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter n3 Sqoop

Uploaded by

Copyright:

Available Formats

SQOOP

Instructor: Oussama Derbel

applications which are executed in a distributed computing environment.

computational power at low cost.

MAPREDUCE Processing Part

■ Apache Hadoop consists of two sub-projects

1. Hadoop MapReduce: MapReduce is a computational model

and software framework for writing applications which are

run on Hadoop. These MapReduce programs are capable of

processing enormous data in parallel on large clusters of

2. HDFS (Hadoop Distributed File System): HDFS takes care of

the storage part of Hadoop applications.

How to import/export database from/to RDBM from/to HDFS?

■ SQOOP stands for SQL to Hadoop

Server/DB2 and vise versa.

■ SQOOP location between RDBMS and Hadoop

Name Node Sqoop MySQL

Data Node Stores Data

Stores Meta Information

Data Node Stores Data

such as relational databases, Apache Sqoop:

• Copies data quickly from external systems to Hadoop

• Makes data analysis more efficient

• Mitigates excessive loads to external systems.

Simple Sqoop import Command:

Specifying Mappers Sqoop Import Command:

--username root --password cloudera\

Defining warehouse directory

--username root --password cloudera \

Defining target directory

--username root --password cloudera \

Delete target directory if already exists

--username root --password cloudera \

Importing as avro files

--username root --password cloudera \

Importing as parquet files

--username root --password cloudera \

Importing as Sequence files

--username root --password cloudera \

--username root --password cloudera \

Selective Column Imports

--username root --password cloudera \

--username root --password cloudera \

--query “Select* from customers where customer_id > 100 AND

You might also like