You are on page 1of 8

LESSON-16

Sqoop
 Class will start with refreshing the previous class with QA…. (30)

Today’s topics: (Sqoop)

1. Data base/dataset/data warehouse (10) Lesson plan


2. Define sqoop, architecture, default mapper/reducer/db (20)
3. All about sqoop commands (60)
4. How sqoop works (detail) .. .. .. …. .. (30)
5. Sqoop features (20)
6. Kerberos (10)

Data base/dataset/data warehouse


A data base is a collection of data. (Usually from single source) stored in format which can be easily accessed.
Like sakila in our practice
In order to manage our databases we use a software application called DBMS (Data Base Management
System). Like mysql, Hbase.
We connect to DBMS and apply query to modify or run any other action over our data. The DBMS executes
our command and get back with the result
There are two types of DBMS, relational and non-relational (also called Nosql).For example Mysql is RDBMS
and Hbase is NoSQL DBMS.
In RBDMS tables are linked with one another with relationship, like Customers, Products, and Orders. (They
are related issues, and related with a common or similar column)
We use Structured Query Language for storing, manipulating and retrieving data in RDMS.
NoSql or NonRDMS don’t understand SQL commands.
A data set is a collection of numbers or values that relate to a particular subject. For example, the test scores
of each student in a particular class is a data set. A table can be termed as structured data set, when it
represents similar data on particular subject.
A data warehouse (DW) is a collection of corporate information and data derived from operational systems and
external data sources (therefore from multiple sources). A data warehouse is designed to support data analysis
and reporting.

Sqoop, architecture, default mapper and default data base

What is Sqoop (SQL - HADOOP)

Sqoop is a tool designed to transfer data between Hadoop and relational database servers.
It is used to import data from relational databases such as MySQL, Oracle, Teradata,

1
netezza to Hadoop( HDFS, Hive, Hbase) and export from Hadoop file system to relational
databases. 
Sqoop Architecture

Default database and mapper and


reducer
Mysql is the default database for
sqoop. And there are 4 mappers by
default, and there is no reducer in
sqoop. Sqoop imports data in parallel and
saves in parts – reduce action is not
required. Since sqoop is importing data
in parallel the time for importing
requires less.
All about sqoop commands
Tools/commands
import #
import individual table from RDBMS to
HDFD
import-all-tables #
import all tables from a part of database
export # exporting data
eval # Evaluate sql query
options-file; # Allows to write sqoop command/script to an external file.
Miscellaneous tools/commands
list-tables # list out all the tables
list-databases # list out all the databases
About Syntax
Generic-args (like class)
Import args (related to import)
Tools/Command (generally have no sign), and sqoop command ends with no symbol.
Subcommand --
General command –
Line break /
2
Generalized Syntax sqoop import (generic-args) (import-args)

Command will start with sqoop (we have to type sqoop every time)
Some points
Meta store Sqoop has meta store which can be configured.
Import file type Text file and binary (seq) file
By default import from mysql to hdfs
Mappers We can config number of mapper during sqoop import command
Target Directory We can mention also target directory – where we want to import data
Since mysql is the default database for sqoop, we can create table in mysql for importing data.
Sqoop is a open source software from apache software foundation and written on java.

Commands
Common commands, mostly discussed so far.. like
Create database <database name>; # to create a database
Use <database>; # to be under intended database
Select database(); # show the database where we are.
Create table <table name>; # To create a table
While creating table schema we have to use varchar instead of string, (Variable character)
Variable character again maybe of different length like tiny/small, (3/5)
We can limit the length of text/variable character varchar(20) as 20 character only.
We can restrict (Constraint) that the field must contain data by NOT NULL, (NN)
We can set the column as Primaary Key by adding Primay Key, (PK)
And also we have Auto Incremental, (AI)
Difference between hadoop fs and hdfs dfs commands, while “hadoop fs –ls” command and
“hdfs dfs –ls” command are showing the same output

3
Command for sqoop import

$ sqoop import \

--connect jdbc:mysql://localhost:3306/<db> \

--username root \

--password cloudera \

--table emp_add \

--m 1 \

--target-dir /queryresult

Command for sqoop export

$ sqoop export \

--connect jdbc:mysql://localhost:3306/<db> \

--username root \

--password cloudera \

--table emp_add \

--export-dir <dir>

Note:There must be target table under target directory

Command for Sqoop job


We can create “sqoop job”, and execute later (create a command, and execute whenever required)
4
For example, the below command is creating “myjob” for import

sqoop job \

--create myjob \

--import \

--connect jdbc:mysql://localhost:3306/<db name> \

--username root \

--password cloudera \

--table employee \

--m 1

And the below command to execute whenever required

sqoop job \

--exec myjob

Similarly we can view the list of job and delete specific job with following commands
sqoop job \
--list

sqoop job \
--delete myjob

Incremental append (To import new created data and add at the end of the previous data)

$ sqoop import \

--connect jdbc:mysql://localhost:3306/<db> \

--username root \

--password cloudera \

--table emp_add \

--m 1 \

--target-dir /tmp/sqoop/cloudera/t89 \

--incremental append \
5
--check-column emp_id \

--last-value 50

 Have to remember or note last 3 lines from the last output to make new command
 Limitation (incremental id), which sqoop require to split for mapping.
 The above limitation can be overcome by adding one column for time stamping.
 Or during importing, we have to select m-1 as mapper so the whole file will be imported in one
part
 If we want to create a sqoop job with above command, we shall put –last-value 0, so every
time sqoop will remember the last value and act accordingly.

How sqoop works

Explanation step by step……also link is provided below over this topics.


1. sqoop will get metadata from the RDBMS (mainly schema)
2. Based on metadata information sqoop will internally create a .java file
3. Generate .class file with .java files (compiling java files)
4. By default the names of these auto generated files will be taken from the metadata of the table
5. Finally sqoop will create a .jar file by compiling the class files. (Archive zip package)
 Sqoop will find out the primary key column (internally)
 Sqoop will run sql command to fetch the table data of the table (which is to be imported)
 Sqoop will internally apply min and max function (SQL comm) over PK column, to get the
range in order to provided information to mapper.
 Which will help mapping data in 4 mappers by default
6. And finally import action will be started by sqoop to store data in parallel way, reduce action is
found to be zero.
7. The .jar file is saved in file explorer the location of which can be checked. (ref link)
6
8. The process from 1 to 5 is called codegen, which is ‘generate code to interact with database
record.’
Sqoop features
 Full load (can load full table with single command)
 Incremental load (can transfer only the updated or added data in the table)
 Parallel import and export (uses yarn, mapreduce and maintain fault tolerant)
 Compression (can compress data)
 Kerberos security integration (using secured protocol)
 Data loading directly to hive (by coding, the target directory can be set)
Kerberos
Kerberos (originate from Cerberus – 3 headed dog for security)
Kerberos is an authentication protocol which uses “tickets” (secret key between client and server) to
allow nodes to identify genuine client.
Different component in the architecture.
1. KDC (Key Distribution Center)
a) Authentication service (TGT) ticket granting ticket
b) Ticket Granting Service (TGS)
2. Server (Containing data)
3. Client (or user requesting data access)

Sequential steps in short

1. Client requests an authentication ticket (TGT) from KDC (Authentication


Service)
7
2. The KDC verifies the credentials and sends back an encrypted TGT
3. The client sends the TGT to the TGS with the Service Principal Name (SPN) the client wants
to access
4. The KDC verifies the TGT of the user and sends a valid session key for the service to the
client
5. The client forwards the session key to the server and gets access.
Links
How sqoop works
https://www.youtube.com/watch?v=8NzcZzCrOcU

short course by tutorial points

https://www.tutorialspoint.com/sqoop/index.htm

sqoop import (different options) and export

https://www.youtube.com/watch?v=r1NLCComQ9Q

Incremental append by creating job

https://www.youtube.com/watch?v=2iwas0ONLA0

You might also like