Introduction to Apache Hadoop & Pig

Milind Bhandarkar (milindb@yahoo-inc.com) Y!IM: gridsolutions

Wednesday, 22 September 2010

Agenda
• Apache Hadoop • Map-Reduce • Distributed File System • Writing Scalable Applications • Apache Pig •Q&A
2
Wednesday, 22 September 2010

Hadoop: Behind Every Click At Yahoo!
3
Wednesday, 22 September 2010

000+ jobs/day 4 Wednesday. 22 September 2010 .000 + machines in 20+ clusters • Largest cluster is 4.000 machines • 6 Petabytes of data (compressed.Hadoop At Yahoo! (Some Statistics) • 40. unreplicated) • 1000+ users • 200.

-.&-4") +""# '"# +"# "# .--C#2345346# B/./0# "# *""&# Wednesday.!"# $"# %"# *'"# )$1#2345346# +%"#78#29-4/:3# +.'4.-=9>?0#@-A6# B/..0&120.--C#29-4/:3#D78E# +'"# *""# !"#$%&'(%)#*)+.-.<#.%. 22 September 2010 *""%# *""$# *""!# *"+"# /.#') .-%) &"# '"# ("# )"# *"# +45.%) 9&5:2) /-#($4.) 678&40) 3.

reporting. Spam filters • Computational Advertising 6 Wednesday. buzz • Search index • Content Optimization.Sample Applications • Data analysis is the inner loop at Yahoo! • Data ⇒ Information ⇒ Value • Log processing: Analytics. 22 September 2010 .

22 September 2010 .BEHIND EVERY CLICK Wednesday.

22 September 2010 .BEHIND EVERY CLICK Wednesday.

Wednesday. 22 September 2010 .

Who Uses Hadoop ? Wednesday. 22 September 2010 .

Why Clusters ? 10 Wednesday. 22 September 2010 .

et al.Big Datasets (Data-Rich Computing theme proposal. 22 September 2010 . Campbell. 2007) Wednesday. J.

22 September 2010 .com/cost-per-gigabyte) Wednesday.Cost Per Gigabyte (http://www.mkomo.

(Graph by Adam Leventhal. 22 September 2010 . Dec 2009) Storage Trends Wednesday. ACM Queue.

22 September 2010 .Motivating Examples 14 Wednesday.

22 September 2010 .Yahoo! Search Assist Wednesday.

22 September 2010 . 10K bytes each • 10 TB of input data • Output: List(word. List(related words)) 16 Wednesday.Search Assist • Insight: Related concepts appear close together in text corpus • Input: Web pages • 1 Billion Pages.

foreach word in Tokens : Insert (word. Next(word.Search Assist // Input: List(URL. List(RelatedWords. RelatedWord) Group Pairs by word. Tokens)) in Pairs. // Result: Pairs = List (word. count)) foreach word in CountedPairs : Sort Pairs(word. Previous(word. List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs. choose Top 5 Pairs. 22 September 2010 . // Result: List (word. Top5(RelatedWords)) 17 Wednesday. *) descending by count. Text) foreach URL in Input : Words = Tokenize(Text(URL)). Insert (word. Tokens)) in Pairs. // Result: List (word. // Result: List (word.

You Might Also Know Wednesday. 22 September 2010 .

You Might Also Know • Insight:You might also know Joe Smith if a lot of folks you know. know Joe Smith • if you don’t know Joe Smith already • Numbers: • 300 MM users • Average connections per user is 100 19 Wednesday. 22 September 2010 .

Store (u.y).You Might Also Know // Input: List(UserName. y1. y2}) for serving. 22 September 2010 . {y0. y)++. List(Connections)) foreach u in UserList : // 300 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u. 20 Wednesday. Choose Top 3 y. // 3 Trillion Iterations Sort (u.y) in descending order of Count(u.

22 September 2010 .Performance • 101 Random accesses for each user • Assume 1 ms per random access • 100 ms per user • 300 MM users • 300 days on a single machine 21 Wednesday.

MapReduce Paradigm 22 Wednesday. 22 September 2010 .

html 23 Wednesday.Map & Reduce • Primitives in Lisp (& Other functional languages) 1970s • Google Paper 2004 • http://labs.com/papers/ mapreduce.google. 22 September 2010 .

3. 4.Map Output_List = Map (Input_List) Square (1. 36. 100) 24 Wednesday. 8. 81. 25. 22 September 2010 . 64. 16. 6. 7. 9. 4.49. 10) = (1. 9. 2. 5.

4. 25. 81. 9. 16. 100) = 385 25 Wednesday. 22 September 2010 . 36.Reduce Output_Element = Reduce (Input_List) Sum (1.49. 64.

Parallelism • Map is inherently parallel • Each list element processed independently • Reduce is inherently sequential • Unless processing multiple lists • Grouping to produce multiple lists 26 Wednesday. 22 September 2010 .

pig) (apache. java) (DFS. namenode) (datanode.. hadoop) (hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) Output = { (apache. pig) (hadoop. block) (replication. DFS) (streaming. 22 September 2010 .Search Assist Map // Input: http://hadoop. } 27 Wednesday. mapreduce) (hadoop. streaming) (hadoop.. commandline) (hadoop. default).

4) (hadoop.. GroupedList(words)) CountedPairs = CountOccurrences (word.Search Assist Reduce // Input: GroupedList (word. RelatedWords) Output = { (hadoop. 7) (hadoop. DFS. mapreduce. 22 September 2010 . streaming. } 28 Wednesday. 3) (hadoop. 9) .. apache.

22 September 2010 .Issues with Large Data • Map Parallelism: Splitting input data • Shipping input data • Reduce Parallelism: • Grouping related data • Dealing with failures • Load imbalance 29 Wednesday.

Wednesday. 22 September 2010 .

Apache Hadoop
• January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Stable Version: 0.20 S (Strong Authentication) • Latest Version: 0.21 • Major contributors:Yahoo!, Facebook,
Powerset, Cloudera
31
Wednesday, 22 September 2010

Apache Hadoop
• Reliable, Performant Distributed file system • MapReduce Programming framework • Related Projects: HBase, Hive, Pig, Howl, Oozie,

Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ...

32
Wednesday, 22 September 2010

Problem: Bandwidth to Data
• Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than
moving data

• Need visibility into data placement
33
Wednesday, 22 September 2010

2000 DIMMS (16TB RAM) • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently 34 Wednesday. MTBF < 1 day • 4000 disks. 1000 NICs. 8000 cores. it’s a rule ! • 1000 nodes.Problem: Scaling Reliably • Failure is not an option. 22 September 2010 . 25 switches.

22 September 2010 .Hadoop Goals • Scalable: Petabytes (1015 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable • Engineering reliability into every application is expensive 35 Wednesday.

Hadoop MapReduce 36 Wednesday. 22 September 2010 .

Output 37 Wednesday. 22 September 2010 . Map. Shuffle. Value) • Key : Comparable. Reduce. Serializable • Value: Serializable • Input.Think MR • Record = (Key.

Seems Familiar ? cat /var/log/auth. 22 September 2010 .log* | \ grep “session opened” | cut -d’ ‘ -f10 | \ sort | \ uniq -c > \ ~/userlist 38 Wednesday.

Value ) • Output: List(Key . Filtering.Map • Input: (Key . Value ) • Projections. Transformation 1 1 2 2 39 Wednesday. 22 September 2010 .

Shuffle • Input: List(Key . List(Value )))) • Provided by Hadoop 2 2 2 2 40 Wednesday. 22 September 2010 . Value ) • Output • Sort(Partition(List(Key .

Value ) • Aggregation 2 2 3 3 41 Wednesday. List(Value )) • Output: List(Key . 22 September 2010 .Reduce • Input: List(Key .

22 September 2010 .Example: Unigrams • Input: Huge text corpus • Wikipedia Articles (40GB uncompressed) • Output: List of words sorted in descending order of frequency 42 Wednesday.

txt $ cat ~/frequencies.Unigrams $ cat ~/wikipedia.txt | \ sed -e 's/ /\n/g' | grep .txt | \ # cat | \ sort -n -k1.1 -r | # cat > \ ~/unigrams.txt 43 Wednesday. | \ sort | \ uniq -c > \ ~/frequencies. 22 September 2010 .

file-contents): for each word in file-contents: emit (word.MR for Unigrams mapper (filename. 1) reducer (word. values): sum = 0 for each value in values: sum = sum + value emit (word. sum) 44 Wednesday. 22 September 2010 .

frequency) 45 Wednesday.MR for Unigrams mapper (word. word) reducer (frequency. 22 September 2010 . frequency): emit (frequency. words): for each word in words: emit (word.

Hadoop Streaming
• Hadoop is written in Java • Java MapReduce code is “native” • What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers • Text Input and Output
46
Wednesday, 22 September 2010

Hadoop Streaming
• Thin Java wrapper for Map & Reduce Tasks • Forks actual Mapper & Reducer • IPC via stdin, stdout, stderr • Key.toString() \t Value.toString() \n • Slower than Java programs • Allows for quick prototyping / debugging
47
Wednesday, 22 September 2010

Hadoop Streaming
$ bin/hadoop jar hadoop-streaming.jar \ -input in-files -output out-dir \ -mapper mapper.sh -reducer reducer.sh # mapper.sh sed -e 's/ /\n/g' | grep . # reducer.sh uniq -c | awk '{print $2 "\t" $1}'

48
Wednesday, 22 September 2010

22 September 2010 .Running Hadoop Jobs 49 Wednesday.

JobClient: map 100% reduce 33% mapred.JobClient: Job complete: job_200904270516_5709 50 Wednesday.JobClient: map 100% reduce 31% mapred. mapred. 22 September 2010 .JobClient: map 100% reduce 21% mapred.JobClient: map 3% reduce 0% mapred.jar wordcount \ /data/newsarchive/20080923 /tmp/newsout input....JobClient: map 100% reduce 100% mapred.JobClient: map 100% reduce 66% mapred.FileInputFormat: Total input paths to process : 4 mapred.JobClient: map 7% reduce 0% .JobClient: map 0% reduce 0% mapred.Running a Job [milindb@gateway ~]$ hadoop jar \ $HADOOP_HOME/hadoop-examples.JobClient: Running job: job_200904270516_5709 mapred.

JobClient: Job Counters mapred.JobClient: Counters: 18 mapred.JobClient: HDFS_BYTES_WRITTEN=377464307 51 Wednesday.JobClient: Launched reduce tasks=1 mapred.JobClient: FileSystemCounters mapred.Running a Job mapred.JobClient: Data-local map tasks=1 mapred.JobClient: HDFS_BYTES_READ=3068106537 mapred.JobClient: Launched map tasks=25 mapred.JobClient: FILE_BYTES_WRITTEN=724733409 mapred.JobClient: Rack-local map tasks=10 mapred. 22 September 2010 .JobClient: FILE_BYTES_READ=491145085 mapred.

Running a Job
mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: Map-Reduce Framework Combine output records=73828180 Map input records=36079096 Reduce shuffle bytes=233587524 Spilled Records=78177976 Map output bytes=4278663275 Combine input records=371084796 Map output records=313041519 Reduce input records=15784903

52
Wednesday, 22 September 2010

JobTracker WebUI
Wednesday, 22 September 2010

JobTracker Status
Wednesday, 22 September 2010

Jobs Status Wednesday. 22 September 2010 .

22 September 2010 .Job Details Wednesday.

22 September 2010 .Job Counters Wednesday.

Job Progress Wednesday. 22 September 2010 .

All Tasks Wednesday. 22 September 2010 .

22 September 2010 .Task Details Wednesday.

Task Counters Wednesday. 22 September 2010 .

22 September 2010 .Task Logs Wednesday.

22 September 2010 .Hadoop Distributed File System 63 Wednesday.

22 September 2010 (default 64MB) and distributed across cluster nodes .HDFS • Data is organized into files and directories • Files are divided into uniform sized blocks • HDFS exposes block placement so that computation can be migrated to data 64 Wednesday.

22 September 2010 .HDFS • Blocks are replicated (default 3) to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) corruption detection and recovery • HDFS keeps checksums of data for 65 Wednesday.

HDFS • Master-Worker Architecture • Single NameNode • Many (Thousands) DataNodes 66 Wednesday. 22 September 2010 .

e.HDFS Master (NameNode) • Manages filesystem namespace • File metadata (i. 22 September 2010 . “inode”) • Mapping inode to list of blocks + locations • Authorization & Authentication • Checkpoint & journal namespace changes 67 Wednesday.

22 September 2010 .Namenode • Mapping of datanode to list of blocks • Monitor datanode health • Replicate missing blocks • Keeps ALL namespace in memory • 60M objects (File/Block) in 16GB 68 Wednesday.

22 September 2010 .Datanodes • Handle block storage on multiple volumes & block integrity nodes • Clients access the blocks directly from data • Periodically send heartbeats and block reports to Namenode • Blocks are stored as underlying OS’s files 69 Wednesday.

HDFS Architecture Wednesday. 22 September 2010 .

Replication • A file’s replication factor can be changed dynamically (default 3) • Block placement is rack aware • Block under-replication & over-replication is detected by Namenode • Balancer application rebalances blocks to balance datanode utilization 71 Wednesday. 22 September 2010 .

..... 22 September 2010 ....] [-count[-q] <path>] [-help [cmd]] 72 Wednesday..] [-chown [-R] [OWNER][:[GROUP]] PATH.. <dst>] [-copyFromLocal <localsrc> ..MODE].] [-chgrp [-R] GROUP PATH. | OCTALMODE> PATH. <dst>] [-moveFromLocal <localsrc> ..Accessing HDFS hadoop fs [-fs <local | file system URI>] [-conf <configuration file>] [-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>] [-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>] [-rmr <src>] [-put <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>] [-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-tail [-f] <path>] [-text <path>] [-chmod [-R] <MODE[.

create(uri).delete(path.HDFS Java API // Get default file system instance fs = Filesystem.open(path. // Create.get(new Configuration()).get(URI. 22 September 2010 . // Or Get file system instance from URI fs = Filesystem.listStatus(path). …). recursive). 73 Wednesday. new Configuration()). InputStream in = fs. list. boolean isDone = fs. … OutputStream out = fs. FileStatus[] fstat = fs.create(path. …). open.

0. writeFile).h” hdfsFS fs = hdfsConnectNewInstance("default". 0. Wednesday. sizeof(buffer)). sizeof(buffer)). 0). hdfsFile writeFile = hdfsOpenFile(fs. hdfsFile readFile = hdfsOpenFile(fs. O_WRONLY|O_CREAT. hdfsCloseFile(fs. 0).libHDFS #include “hdfs. 0. (void*)buffer. tSize num_written = hdfsWrite(fs. writeFile.txt”. 0). “/tmp/test. readFile. tSize num_read = hdfsRead(fs. hdfsCloseFile(fs. (void*)buffer. hdfsDisconnect(fs). readFile). O_RDONLY. “/tmp/test.txt”. 22 September 2010 74 . 0.

Hadoop Scalability Wednesday. 22 September 2010 .

22 September 2010 T minutes on one node. the N nodes should process same data in (T/N) minutes . then N nodes should process (k*N) MB/s • If some fixed amount of data is processed in • Linear Scalability Wednesday.Scalability of Parallel Programs • If one node processes k MB/s.

22 September 2010 .Latency Goal: Minimize program execution time Wednesday.

Throughput Goal: Maximize Data processed per unit time Wednesday. 22 September 2010 .

22 September 2010 .CN • If partial computation C can be speeded up i by a factor of Si by ∑Ci/∑(Ci/Si) • Then overall computation speedup is limited Wednesday.. C1.Amdahl’s Law • If computation C is split into N different parts.

5 Wednesday. P2. and P4 is 10 seconds 0 • Sequential runtime = 620 seconds • P . P1.5 seconds • Effective Speedup: 22. P3 are 200 seconds each.5 seconds) • After parallelization. P . runtime = 27. 22 September 2010 .Amdahl’s Law • Suppose Job has 5 phases: P 1 2 3 is 10 seconds. P parallelized on 100 machines with speedup of 80 (Each executes in 2.

225 • Goal: Maximize efficiency. then Efficiency = S/N • In previous example. 22 September 2010 . Efficiency = 0. while meeting required SLA Wednesday. speedup on N processors is S.Efficiency • Suppose.

22 September 2010 .Amdahl’s Law Wednesday.

Map Reduce Data Flow Wednesday. 22 September 2010 .

MapReduce Dataflow Wednesday. 22 September 2010 .

22 September 2010 .MapReduce Wednesday.

22 September 2010 .Job Submission Wednesday.

22 September 2010 .Initialization Wednesday.

22 September 2010 .Scheduling Wednesday.

22 September 2010 .Execution Wednesday.

Map Task Wednesday. 22 September 2010 .

22 September 2010 .Reduce Task Wednesday.

Hadoop Performance Tuning Wednesday. 22 September 2010 .

Example • “Bob” wants to count records in event logs (several hundred GB) • Used Identity Mapper (/bin/cat) & Single counting reducer (/bin/wc -l) • What is he doing wrong ? • This happened. 22 September 2010 . really ! Wednesday.

22 September 2010 .MapReduce Performance • Minimize Overheads • Task Scheduling & Startup • Network Connection Setup • Reduce intermediate data size • Map Outputs + Reduce Inputs • Opportunity to load balance Wednesday.

22 September 2010 .Number of Maps • Number of Input Splits. computed by InputFormat • Default FileInputFormat: • Number of HDFS Blocks • Good locality • Does not cross file boundary Wednesday.

min.Changing Number of Maps • mapred.size • Concatenate small files into big files • Change dfs. 22 September 2010 .size • Implement InputFormat • CombineFileInputFormat Wednesday.map.tasks • mapred.block.split.

reduce.Number of Reduces • mapred.tasks • Shuffle Overheads (M*R) • 1-2 GB of data per reduce task • Consider time needed for re-execution • Memory consumed per group/bag Wednesday. 22 September 2010 .

22 September 2010 .Shuffle • Often the most expensive component • M * R Transfers over the network • Sort map outputs (intermediate data) • Merge reduce inputs Wednesday.

8 x 1Gbps 8 x 1Gbps 40 x 1Gbps 40 x 1Gbps 40 x 1Gbps 40 Typical Hadoop Network Architecture Wednesday. 22 September 2010 .

Improving Shuffle • Avoid shuffling/sorting if possible • Minimize redundant transfers • Compress intermediate data Wednesday. 22 September 2010 .

tasks to zero • Known as map-only computations • Filters. Transformations • Number of output files = number of input splits = number of input blocks • May overwhelm HDFS Wednesday. Projections. 22 September 2010 .Avoid Shuffle • Set mapred.reduce.

Minimize Redundant Transfers • Combiners • Intermediate data compression Wednesday. 22 September 2010 .

and often the same class Wednesday. 22 September 2010 .Combiners • When Maps produce many repeated keys • Combiner: Local aggregation after Map & before Reduce • Side-effect free • Same interface as Reducers.

output. native Wednesday.compress=true to compress job output • Set mapred. 22 September 2010 . LZO. bzip2.compress.map.output=true to compress map outputs gzip • Codecs: Java zlib (default).Compression • Often yields huge performance gains • Set mapred.

22 September 2010 .Load Imbalance • Inherent in application • Imbalance in input splits • Imbalance in computations • Imbalance in partitions • Heterogenous hardware • Degradation over time Wednesday.

execution=true • Can dramatically bring in long tails on jobs Wednesday.speculative.Speculative Execution • Runs multiple instances of slow tasks • Instance that finishes first.reduce.map. 22 September 2010 .execution=true • mapred. succeeds • mapred.speculative.

1 • Use higher-level languages (e.Best Practices . Pig) • Coalesce small files into larger ones & use bigger HDFS block size • Tune buffer sizes for tasks to avoid spill to disk. 22 September 2010 . consume one slot per task size • Use combiner if local aggregation reduces Wednesday.g.

Best Practices .2 • Use compression everywhere • CPU-efficient for intermediate data • Disk-efficient for output data • Use Distributed Cache to distribute small side-files (Less than DFS block size) RPCs from tasks • Minimize number of NameNode/JobTracker Wednesday. 22 September 2010 .

22 September 2010 .Apache Pig Wednesday.

22 September 2010 .What is Pig? • System for processing large semi-structured query execution data sets using Hadoop MapReduce platform • Pig Latin: High-level procedural language • Pig Engine: Parser. Optimizer and distributed Wednesday.

22 September 2010 .Pig vs SQL • • • • • Pig is procedural (How) Nested relational data model Schema is optional Scan-centric analytic workloads Limited query optimization • • • • • SQL is declarative Flat relational data model Schema is required OLTP + OLAP workloads Significant opportunity for query optimization Wednesday.

22 September 2010 .Pig vs Hadoop • Increase programmer productivity • Decrease duplication of effort • Insulates against Hadoop complexity • Version Upgrades • JobConf configuration tuning • Job Chains Wednesday.

22 September 2010 .Example • Input: User profiles. Page visits • Find the top 5 most visited pages by users aged 18-25 Wednesday.

22 September 2010 .In Native Hadoop Wednesday.

115 Wednesday. url). Summed = foreach Grouped generate group. Top5 = limit Sorted 5. Joined = join Filtered by name. Pages = load ‘pages’ as (user. store Top5 into ‘top5sites’. 22 September 2010 . Sorted = order Summed by clicks desc. Filtered = filter Users by age >= 18 and age <= 25.In Pig Users = load ‘users’ as (name. Grouped = group Joined by url. age). Pages by user. COUNT(Joined) as clicks.

22 September 2010 .Natural Fit Wednesday.

22 September 2010 .Comparison Wednesday.

Flexibility & Control • Easy to plug-in user code • Metadata is not mandatory • Pig does not impose a data model on you • Fine grained control • Complex data types Wednesday. 22 September 2010 .

Pig Data Types • Tuple: Ordered set of fields • Field can be simple or complex type • Nested relational model • Bag: Collection of tuples • Can contain duplicates • Map: Set of (key. 22 September 2010 . value) pairs Wednesday.

22 September 2010 .7182818 • chararray : UTF-8 String • bytearray : blob Wednesday.Simple data types • int : 42 • long : 42L • float : 3.1415f • double : 2.

{ (2. f3: map[] ) A = { (1.txt AS (f1:int . n2:int)}.f1 or A. (4.f2 or A.f3 or A. 3).A.$1 -.Expressions A = LOAD ‘data. f2:{t:(n1:int.$2 . 6) }.A. 22 September 2010 -. [ ‘yahoo’#’mail’ ] } 121 Wednesday.A.$0 -.

generate word tokens • Group by word • Count words in each group Wednesday.Counting Word Frequencies • Input: Large text document • Process: • Load the file • For each line. 22 September 2010 .

Load myinput = load '/user/milindb/text. 22 September 2010 .txt' USING TextLoader() as (myword:chararray). (program program) (pig pig) (program pig) (hadoop pig) (latin latin) (pig latin) 123 Wednesday.

22 September 2010 . (program) (program) (pig) (pig) (program) (pig) (hadoop) (pig) (latin) (latin) (pig) (latin) 124 Wednesday.Tokenize words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)).

{(program). (latin)}) (hadoop. (pig. {(hadoop)}) (program. (pig). (pig). 22 September 2010 . (pig)}) (latin. (pig). {(latin). (program). {(pig).Group grouped = GROUP words BY $0. (latin). (program)}) 125 Wednesday.

1L) (program. 3L) 126 Wednesday.Count counts = FOREACH grouped GENERATE group. COUNT(words). (pig. 22 September 2010 . 5L) (latin. 3L) (hadoop.

pig latin hadoop program 5 3 1 3 127 Wednesday. 22 September 2010 .Store store counts into ‘/user/milindb/output’ using PigStorage().

user. logname.Example: Log Processing -. 128 Wednesday. store Analyzed into ‘analyzedurls’. -.use a custom loader Logs = load ‘apachelogfile’ using CommonLogLoader() as (addr.py’. bytes). Grouped = group Cleaned by url. p. method. 22 September 2010 . uri. canonicalize(url) as url.apply your own function Cleaned = foreach Logs generate addr.run the result through a binary Analyzed = stream Grouped through ‘urlanalyzer. time. -.

age: int. 22 September 2010 .ordering will be by type Sorted = order Good by gpa.declare your types Grades = load ‘studentgrades’ as (name: chararray.0. -.Schema on the fly -. gpa: double). Good = filter Grades by age > 18 and gpa > 3. 129 Wednesday. store Sorted into ‘smartgrownups’.

userid).value in turn. COUNT(DistinctUsers). DisinctCount = foreach Grouped { Userid = Logs. 130 Wednesday.userid. 22 September 2010 . DistinctUsers = distinct Userid.Code inside {} will be applied to each -. generate group. -.Nested Data Logs = load ‘weblogs’ as (url. } store DistinctCount into ‘distinctcount’. Grouped = group Logs by url.

22 September 2010 .Pig Architecture Wednesday.

22 September 2010 .Pig Frontend Wednesday.

Logical Plan • Directed Acyclic Graph • Logical Operator as Node • Data flow as edges • Logical Operators • One per Pig statement • Type checking with Schema Wednesday. 22 September 2010 .

22 September 2010 .Pig Statements Load Read data from the file system Write data to the file system Write data to stdout Store Dump Wednesday.

.through Wednesday.Generate Apply expression to each record and generate one or more records Apply predicate to each record and remove records where false Stream records through user-provided binary Filter Stream.. 22 September 2010 .Pig Statements Foreach.

22 September 2010 ..by Wednesday.Pig Statements Group/CoGroup Collect records with the same key from one or more inputs Join two or more inputs based on a key Sort records based on a key Join Order.

Physical Plan • Pig supports two back-ends • Local • Hadoop MapReduce • 1:1 correspondence with most logical operators • Except Distinct. Join etc Wednesday. 22 September 2010 . Cogroup. Group.

Distinct • Coalesce operators into Map and Reduce stages • Job. 22 September 2010 . Order. Cogroup.jar is created and submitted to Hadoop JobControl Wednesday.MapReduce Plan • Detect Map-Reduce boundaries • Group.

Explain. Dump. Illustrate • Advantages • In-memory pipelining • Filter re-ordering across multiple commands Wednesday. 22 September 2010 .Lazy Execution • Nothing really executes until you request output • Store. Describe.

cross. order Wednesday. join. 22 September 2010 . 1 reducer • PARALLEL keyword • group. distinct. cogroup.Parallelism • Split-wise parallelism on Map-side operators • By default.

43) (rachel hernandez.Running Pig $ pig grunt > A = load ‘students’ as (name. 22 September 2010 . (jessica thompson. 2. 40. age. A: (name. gpa ) 141 Wednesday. 73. 23. grunt > store B into ‘good_students’.63) (victor zipper. 3.5’.60) grunt > describe A. age. 1. grunt > B = filter A by gpa > ‘3. grunt > dump A. gpa).

pig • Local mode • $ pig –x local • Java mode (embed pig statements in java) • Keep pig. 22 September 2010 .Running Pig • Batch mode • $ pig myscript.jar in the class path Wednesday.

Pig for SQL Programmers Wednesday. 22 September 2010 .

. Wednesday..SQL to Pig SQL .FROM MyTable. Pig A = LOAD ‘MyTable’ USING PigStorage(‘\t’) AS (col1:int. 22 September 2010 . col3.... B = FOREACH A GENERATE col1 + col2.. SELECT col1 + col2.. col2:int. col3:int).WHERE col2 > 2 C = FILTER B by col2 > 2. . col3 ..

col2 FLATTEN(group).col3). Wednesday.ORDER BY col1 G = ORDER F BY $0. . SUM(A..SQL to Pig SQL Pig D = GROUP A BY (col1. sum(col3) E = FOREACH D GENERATE FROM X GROUP BY col1. .HAVING sum(col3) > 5 F = FILTER E BY $2 > 5. 22 September 2010 . col2.... col2) SELECT col1.

count(M). GENERATE FLATTEN(group). L = FOREACH K { M = DISTINCT A. J = DISTINCT I. 22 September 2010 . SELECT col1.col2.SQL to Pig SQL Pig SELECT DISTINCT col1 from X I = FOREACH A GENERATE col1. count (DISTINCT col2) FROM X GROUP BY col1 K = GROUP A BY col1. } Wednesday.

col1. B. O = FOREACH N GENERATE A. -. flatten(B). B by col1 INNER.col3. P = FOREACH O GENERATE A. B. 22 September 2010 .SQL to Pig SQL Pig N = JOIN A by col1 INNER.col1. SELECT A.Or col3 FROM A JOIN B USING (col1) N = COGROUP A by col1 INNER.col3 Wednesday. O = FOREACH N GENERATE flatten(A). B by col1 INNER. B.col1.

22 September 2010 .Questions ? Wednesday.

Sign up to vote on this title
UsefulNot useful