You are on page 1of 24

SQOOP

Instructor: Oussama Derbel


Introduction

■ What is Hadoop?

– Apache Hadoop is an open source software framework used to develop data processing

applications which are executed in a distributed computing environment.

– Applications built using HADOOP are run on large data sets distributed across clusters of

commodity computers.

■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater

computational power at low cost.


Introduction

■ Core of Hadoop

HDFS
Storage Part
( Hadoop Distributed File System)

MAPREDUCE Processing Part


Introduction

■ Apache Hadoop consists of two sub-projects

1. Hadoop MapReduce: MapReduce is a computational model

and software framework for writing applications which are

run on Hadoop. These MapReduce programs are capable of

processing enormous data in parallel on large clusters of

computation nodes.

2. HDFS (Hadoop Distributed File System): HDFS takes care of

the storage part of Hadoop applications.


Note
MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.
Introduction

How to import/export database from/to RDBM from/to HDFS?


What is SQOOP?

■ SQOOP stands for SQL to Hadoop

■ Sqoop is a command-line interface application for transferring data between relational databases and

Hadoop.

■ It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be

run multiple times to import updates made to a database since the last import.

■ Using Sqoop, Data can be moved into HDFS/hive/hbase from MySQL/ PostgreSQL/Oracle/SQL

Server/DB2 and vise versa.


What is SQOOP?

■ SQOOP location between RDBMS and Hadoop


SQOOP working

■ Step 1: Sqoop send the request to Relational DB to send the return the metadata information

about the table(Metadata here is the data about the table in relational DB).

■ Step 2: From the received information it will generate the java classes (Reason why you should

have Java configured before get it working-Sqoop internally uses JDBC API to generate data).

■ Step 3: Now Sqoop (As its written in java ?tries to package the compiled classes to be able to

generate table structure), post compiling creates jar file (Java packaging standard).
SQOOP CONNECTORS

■ MySQL
■ Netezza
■ Oracle JDBC
■ PostgreSQL
■ Teradata
■ SQL SreverR2
SQOOP Import/Export Data

Structured Data
Data Node Stores Data Import Data
Server:
Data Node Stores Data

Name Node Sqoop MySQL


Oracle
SQL Sserver
Stores Meta Information
Export Data

Data Node Stores Data


SQOOP Import

Structured Data
Data Node Stores Data
Server:
Data Node Stores Data
Import Data
Name Node MySQL

Stores Meta Information


Sqoop

Data Node Stores Data


What Sqoop Does

• Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores

such as relational databases, Apache Sqoop:

• Allows data imports from external datastores and enterprise data warehouses into Hadoop

• Parallelizes data transfer for fast performance and optimal system Utilization

• Copies data quickly from external systems to Hadoop

• Makes data analysis more efficient

• Mitigates excessive loads to external systems.


SQOOP Import data

Simple Sqoop import Command:

Sqoop import \

--connect jdbc:mysql://localhost/retail_db\

--username root \

--password cloudera\

--table customers
SQOOP Import data

Specifying Mappers Sqoop Import Command:

Sqoop import \

--connect jdbc:mysql://localhost/retail_db\

--username root --password cloudera\

--table customers

-m 2
Managing destination directory

Defining warehouse directory

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--table customers \

--warehouse-dir /user/cloudera/new-warehouse
Managing destination directory

Defining target directory

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--table customers \

--target-dir /user/cloudera/customer-new
Managing destination directory

Delete target directory if already exists

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--table customers \

--target-dir /user/cloudera/customer-new\

--delete-target-dir
Working With File Format

Importing as avro files

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--table customers \

--target-dir /user/cloudera/customer-avro \

--as-avrodatafile
Working With File Format

Importing as parquet files

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--table customers \

--target-dir /user/cloudera/customer-parquet \

--as-parquetfile
Working With File Format

Importing as Sequence files

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--table customers \

--target-dir /user/cloudera/customer-sequence \

--as-sequencefile
Conditional/Selective Imports

Conditional Imports

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--table customers \

--target-dir /user/cloudera/customer-name-m \

--where “customer_fname=’Mary’”
Conditional/Selective Imports

Selective Column Imports

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--table customers \

--target-dir /user/cloudera/customer-selected \

--columns “customer_fname,customer_lname,customer_city’”
Conditional/Selective Imports

Using query

sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root --password cloudera \

--target-dir /user/cloudera/customer-queries \

--query “Select* from customers where customer_id > 100 AND

$CONDITIONS”\

--split-by “customer_id”
Thank you

You might also like