Professional Documents
Culture Documents
PIG+ SQOOP
PIG Latin Execution
PIG Modes
• Local
• MapReduce (default)
• Tez_local
• Tez (on Hortonworks clusters)
Commonly Used Bag Functions
• FILTER
Only the tuples that match a given criterion are retained in the target bag
• FOREACH
Iterate through each tuple in the source bag
• ORDER
Order tuples by a given field
• DESCRIBE
Describe the schema of a particular bag
• ILLUSTRATE
Describes the schema along with the lineage of the particular bag
Commonly Used Functions
• Evaluation Functions
– AVG, COUNT, MAX, MIN, SIZE, SUM, TOKENIZE
• String Functions
– STARTSWITH, ENDSWITH, LOWER, UPPER, LTRIM, RTRIM,
TRIM, REGEX_EXTRACT
• Datetime Functions
– CurrentTime, DaysBetween, GetDay, GetHour, GetMinute, ToDate
• Mathematical Functions
– ABS, CEIL, EXP, FLOOR, LOG, RANDOM, ROUND
Shell Commands
https://pig.apache.org/docs/r0.15.0/cmds.html#run
Data Structures
B C D
C = FILTER A BY country=='Germany';
B C D
Pig Script
B = LOAD ‘test_data/tweets.txt’
USING TextLoader() ;
Sqoop
Apache Sqoop
• What is it ?
• How does it work ?
• Interfaces
• Examples
• Architecture
What is it ?
• Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can
use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle
or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce,
and then export the data back into an RDBMS.
Scoop –How does it work