Professional Documents
Culture Documents
■ What is Hadoop?
– Apache Hadoop is an open source software framework used to develop data processing
– Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers.
■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater
■ Core of Hadoop
HDFS
Storage Part
( Hadoop Distributed File System)
computation nodes.
■ Sqoop is a command-line interface application for transferring data between relational databases and
Hadoop.
■ It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be
run multiple times to import updates made to a database since the last import.
■ Using Sqoop, Data can be moved into HDFS/hive/hbase from MySQL/ PostgreSQL/Oracle/SQL
■ Step 1: Sqoop send the request to Relational DB to send the return the metadata information
about the table(Metadata here is the data about the table in relational DB).
■ Step 2: From the received information it will generate the java classes (Reason why you should
have Java configured before get it working-Sqoop internally uses JDBC API to generate data).
■ Step 3: Now Sqoop (As its written in java ?tries to package the compiled classes to be able to
generate table structure), post compiling creates jar file (Java packaging standard).
SQOOP CONNECTORS
■ MySQL
■ Netezza
■ Oracle JDBC
■ PostgreSQL
■ Teradata
■ SQL SreverR2
SQOOP Import/Export Data
Structured Data
Data Node Stores Data Import Data
Server:
Data Node Stores Data
Structured Data
Data Node Stores Data
Server:
Data Node Stores Data
Import Data
Name Node MySQL
• Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores
• Allows data imports from external datastores and enterprise data warehouses into Hadoop
• Parallelizes data transfer for fast performance and optimal system Utilization
Sqoop import \
--connect jdbc:mysql://localhost/retail_db\
--username root \
--password cloudera\
--table customers
SQOOP Import data
Sqoop import \
--connect jdbc:mysql://localhost/retail_db\
--table customers
-m 2
Managing destination directory
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--table customers \
--warehouse-dir /user/cloudera/new-warehouse
Managing destination directory
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--table customers \
--target-dir /user/cloudera/customer-new
Managing destination directory
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--table customers \
--target-dir /user/cloudera/customer-new\
--delete-target-dir
Working With File Format
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--table customers \
--target-dir /user/cloudera/customer-avro \
--as-avrodatafile
Working With File Format
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--table customers \
--target-dir /user/cloudera/customer-parquet \
--as-parquetfile
Working With File Format
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--table customers \
--target-dir /user/cloudera/customer-sequence \
--as-sequencefile
Conditional/Selective Imports
Conditional Imports
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--table customers \
--target-dir /user/cloudera/customer-name-m \
--where “customer_fname=’Mary’”
Conditional/Selective Imports
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--table customers \
--target-dir /user/cloudera/customer-selected \
--columns “customer_fname,customer_lname,customer_city’”
Conditional/Selective Imports
Using query
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--target-dir /user/cloudera/customer-queries \
$CONDITIONS”\
--split-by “customer_id”
Thank you