Apache Pig

Pittsburgh Hadoop User Group
11/3/2009

Dmitriy Ryaboy Ashutosh Chauhan

In This Talk
• What is Pig and why it’s needed
– This part is going to be brief. – It’s a Hadoop User Group, after all.

• Examples
– As well as some extra motivation

• Advanced Features • Improvements currently in development • Interesting research problems
– Want to get involved?

What is Pig
or, “duality of Pig”

Pig compiles data analysis tasks into Map-Reduce jobs and runs them on Hadoop.

Pig Latin is a language for expressing data transformation flows.

Pig can be made to understand other languages, too. There is an SQL prototype. Just a question of compiling a language into internal operator tree.
See: Smith, Agent. The Matrix Trilogy, Warner Bros.

Who Uses Pig
• Yahoo
– “40% of all Hadoop jobs are run with Pig”

• Twitter
– “Some Java-based MapReduce, some Hadoop Streaming” – “Most analysis, and most interesting analysis, done in Pig”

• LinkedIn, AOL, CoolIris, Ning, eBuddy …

Why use Pig
• Express data transformation tasks in a few lines. • “Reach out and touch a Petabyte.”

Image credit: Michelangelo, with apologies.

Pig Latin Example

Why not plain Map/Reduce?

Brevity.

Corresponding Pig Script

Why Not SQL
• Pig Latin allows expressing transformations as a sequence of steps. SQL describes desired outcome. • Writing down a sequence of steps is intuitive to developers.
– Allows expressing more complex data flows – But still no loops or conditionals.

• Support for nested structures (arrays, maps) • Much easier to work with groups
– Let me show you…

Top 5 scores for each player
• Data: < playerId, score, date > • For each player
• Best 5 scores • Date each of these scores was achieved

• Common task; this particular one lifted from http://stackoverflow.com/questions/1467898 • Classically painful in SQL

SQL

• • • •

Self-join Data explosion for same values of tblAbc.aa Readability? I like SQL. But this is far from straightforward.

SQL

We can fix these problems, by going procedural. There goes the brevity and declarativity.

Pig Latin

Diving Deeper

http://www.flickr.com/photos/lollaping/2573664394/

User-Defined Functions
• Java Interfaces
– Reading / Writing Data
• Allows reading from DBs, HBase, custom file formats • Significant API overhaul underway.

– Transformation / Evaluation of Tuples – Group Aggregation

• See Piggybank for examples
http://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/piggybank/

• Support for other languages in the works
https://issues.apache.org/jira/browse/PIG-928

Streaming

Multiquery Optimization
• IO Cost of scanning the data dominates most jobs • Share data scans • This example requires separate pipelines for state and demographics • Or does it?
load users filter out bots

group by state

group by demographic

apply UDFs

apply UDFs

store into ‘bystate’

store into ‘bydemo’

Slide credit: Alan Gates et al, VLDB 2009

Multiquery Optimization
• Tag each record with the pipeline it belongs to • Regular Hadoop grouping on [pipeline #, key] • Multiplexer in reduce stage sends records to appropriate pipeline • Use a single script to compute many things from shared sources.
map filter split local rearrange local rearrange

reduce package foreach multiplex package foreach

Slide credit: Alan Gates et al, VLDB 2009

Hash Join
• Default algorithm • Mapper emits ([key, relid], tuple) • Reducer sees all tuples from each relation with the same key, performs join
– easy, due to sorted order

• Optimization: entries in the last relation can be streamed through (not accumulated in memory for each key) • Can join multiple tables, supports outer joins • If one relation has many duplicate keys, put it last!

Fragment-Replicate Join
• Join large table A with 1 or more small tables (B, C) • Fragment large table across mappers • Replicate smaller tables in entirety to all mappers
A B C B C

• Supports multiple tables • Use when all small tables together can fit in memory on a single map task
3pa 2pa M M

B C

Merge Join
• • • • 2 inputs, both sorted on join key Build sparse index into B Partition A across mappers Use index to read B from corresponding block on each mapper • Bounded buffering on both sides
– least memory intensive

Index

23pa M pa M

• Significant speed gains, especially on skewed data

1pa M

Skewed Join
• 1+ keys with more tuples per key than fit in reducer memory, in both relations • Build histogram based on sample • Round-robin skewed keys into multiple reducers • Stream other table, sending each record to all reducers responsible for key

http://wiki.apache.org/pig/PigSkewedJoinSpec

In Development
• Release 0.5 : Hadoop 20 compatibility
– Or apply a patch, or get Cloudera distro – Imminent

• Release 0.6 (or later)
– – – – – – – Load/Store redesign. See PIG-966 UDFs in other languages, see PIG-923 Columnar Storage, see Zebra in /contrib Stored Schemas, see PIG-760 Metadata service (“Owl”), see PIG-823 SQL compiler, see PIG-824 Auto selection of joins when stats are known, see us

Research Problems
aka, when pigs fly

• Extending the language
– Functions – Conditional logic – Loops

• Smarter data storage
– Learn from workload
• auto-selected column families • different sort orders • overlapping projections

– Go Faster
• pushing queries to storage • lazy decompression

Research Problems
aka, when pigs fly

• Adaptive optimization
– Change plan mid-flight based on observation of current environment – Can we do this without waiting for MR stage to finish?

• Continuous Queries, Approximate Early Results
– Initial work at Berkeley: Map-Reduce Online *

• Cost-Based Optimization
– Appropriate cost model? – balance of disk IO, network traffic, task overhead, memory limitations on individual tasks… – Different from, but similar to, distributed DBs

* http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

Further Reading
• Gates et al, Building a High-Level Dataflow System on top of MapReduce: The Pig Experience” VLDB 2009 http://infolab.stanford.edu/~olston/publications/vldb09.pdf • Olston et al, “Generating example data for dataflow programs” SIGMOD 2009 (best paper)
http://infolab.stanford.edu/~olston/publications/sigmod09.pdf

• Olston et al, “Pig Latin: A not-so-foreign language for data processing” SIGMOD 2008 http://infolab.stanford.edu/~olston/publications/sigmod08.pdf • Olston et al, “Automatic optimization of parallel dataflow programs” USENIX 2008 http://infolab.stanford.edu/~olston/publications/usenix08.pdf • More: http://wiki.apache.org/pig/PigTalksPapers

Performance
average over 12 benchmark queries
Pig speed as factor of Hadoop time (average runtime)
7.6

2.5 1.8

1.6

1.5

1.4

1.22

1

alpha

11/21/2008

1/20/2009

2/23/2009

5/28/2009

7/28/2009

8/27/2009

10/18/2009

Performance
weighted average over 12 benchmark queries
Pig speed as factor of Hadoop time (weighted average)
11.20

3.26 2.20

1.97

1.83

1.68

1.53 1.04

alpha

11/21/2008

1/20/2009

2/23/2009

5/28/2009

7/28/2009

8/27/2009

10/18/2009

Queries?
dvryaboy@cmu.edu achauha1@cs.cmu.edu http://hadoop.apache.org/pig pig-user@apache.org @squarecog