Professional Documents
Culture Documents
Sqoop
Class will start with refreshing the previous class with QA…. (30)
Sqoop is a tool designed to transfer data between Hadoop and relational database servers.
It is used to import data from relational databases such as MySQL, Oracle, Teradata,
1
netezza to Hadoop( HDFS, Hive, Hbase) and export from Hadoop file system to relational
databases.
Sqoop Architecture
Command will start with sqoop (we have to type sqoop every time)
Some points
Meta store Sqoop has meta store which can be configured.
Import file type Text file and binary (seq) file
By default import from mysql to hdfs
Mappers We can config number of mapper during sqoop import command
Target Directory We can mention also target directory – where we want to import data
Since mysql is the default database for sqoop, we can create table in mysql for importing data.
Sqoop is a open source software from apache software foundation and written on java.
Commands
Common commands, mostly discussed so far.. like
Create database <database name>; # to create a database
Use <database>; # to be under intended database
Select database(); # show the database where we are.
Create table <table name>; # To create a table
While creating table schema we have to use varchar instead of string, (Variable character)
Variable character again maybe of different length like tiny/small, (3/5)
We can limit the length of text/variable character varchar(20) as 20 character only.
We can restrict (Constraint) that the field must contain data by NOT NULL, (NN)
We can set the column as Primaary Key by adding Primay Key, (PK)
And also we have Auto Incremental, (AI)
Difference between hadoop fs and hdfs dfs commands, while “hadoop fs –ls” command and
“hdfs dfs –ls” command are showing the same output
3
Command for sqoop import
$ sqoop import \
--connect jdbc:mysql://localhost:3306/<db> \
--username root \
--password cloudera \
--table emp_add \
--m 1 \
--target-dir /queryresult
$ sqoop export \
--connect jdbc:mysql://localhost:3306/<db> \
--username root \
--password cloudera \
--table emp_add \
--export-dir <dir>
sqoop job \
--create myjob \
--import \
--username root \
--password cloudera \
--table employee \
--m 1
sqoop job \
--exec myjob
Similarly we can view the list of job and delete specific job with following commands
sqoop job \
--list
sqoop job \
--delete myjob
Incremental append (To import new created data and add at the end of the previous data)
$ sqoop import \
--connect jdbc:mysql://localhost:3306/<db> \
--username root \
--password cloudera \
--table emp_add \
--m 1 \
--target-dir /tmp/sqoop/cloudera/t89 \
--incremental append \
5
--check-column emp_id \
--last-value 50
Have to remember or note last 3 lines from the last output to make new command
Limitation (incremental id), which sqoop require to split for mapping.
The above limitation can be overcome by adding one column for time stamping.
Or during importing, we have to select m-1 as mapper so the whole file will be imported in one
part
If we want to create a sqoop job with above command, we shall put –last-value 0, so every
time sqoop will remember the last value and act accordingly.
https://www.tutorialspoint.com/sqoop/index.htm
https://www.youtube.com/watch?v=r1NLCComQ9Q
https://www.youtube.com/watch?v=2iwas0ONLA0