• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Pig Latin: A Not-So-Foreign Language for Data Processing
Christopher Olston
Yahoo! Research
Benjamin Reed
Yahoo! Research
Utkarsh Srivastava
Yahoo! Research
Ravi Kumar
§
Yahoo! Research
Andrew Tomkins
Yahoo! Research
ABSTRACT
There is a growing need for ad-hoc analysis of extremelylarge data sets, especially at internet companies where inno-vation critically depends on being able to analyze terabytesof data collected every day. Parallel database products, e.g.,Teradata, offer a solution, but are usually prohibitively ex-pensive at this scale. Besides, many of the people who ana-lyze this data are entrenched procedural programmers, whofind the declarative, SQL style to be unnatural. The successof the more procedural
map-reduce
programming model, andits associated scalable implementations on commodity hard-ware, is evidence of the above. However, the map-reduceparadigm is too low-level and rigid, and leads to a great dealof custom user code that is hard to maintain, and reuse.We describe a new language called
Pig Latin 
that we havedesigned to fit in a sweet spot between the declarative styleof SQL, and the low-level, procedural style of map-reduce.The accompanying system, Pig, is fully implemented, andcompiles Pig Latin into physical plans that are executedover
Hadoop
, an open-source, map-reduce implementation.We give a few examples of how engineers at Yahoo! are usingPig to dramatically reduce the time required for the develop-ment and execution of their data analysis tasks, compared tousing Hadoop directly. We also report on a novel debuggingenvironment that comes integrated with Pig, that can leadto even higher productivity gains. Pig is an open-source,Apache-incubator project, and available for general use.
Categories and Subject Descriptors:
H.2.3 Database Management: Languages
General Terms:
Languages.
olston@yahoo-inc.com
breed@yahoo-inc.com
utkarsh@yahoo-inc.com
§
ravikuma@yahoo-inc.com
atomkins@yahoo-inc.com
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.
SIGMOD’08,
June 9–12, 2008, Vancouver, BC, Canada.Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00.
1. INTRODUCTION
At a growing number of organizations, innovation revolvesaround the collection and analysis of enormous data setssuch as web crawls, search logs, and click streams. Inter-net companies such as Amazon, Google, Microsoft, and Ya-hoo! are prime examples. Analysis of this data constitutesthe innermost loop of the product improvement cycle. Forexample, the engineers who develop search engine rankingalgorithms spend much of their time analyzing search logslooking for exploitable trends.The sheer size of these data sets dictates that it be storedand processed on highly parallel systems, such as shared-nothing clusters. Parallel database products, e.g., Teradata,Oracle RAC, Netezza, offer a solution by providing a simpleSQL query interface and hiding the complexity of the phys-ical cluster. These products however, can be prohibitivelyexpensive at web scale. Besides, they wrench programmersaway from their preferred method of analyzing data, namelywriting imperative scripts or code, toward writing declara-tive queries in SQL, which they often find unnatural, andoverly restrictive.As evidence of the above, programmers have been flock-ing to the more procedural
map-reduce
[4] programmingmodel. A map-reduce program essentially performs a group-by-aggregation in parallel over a cluster of machines. Theprogrammer provides a map function that dictates how thegrouping is performed, and a reduce function that performsthe aggregation. What is appealing to programmers aboutthis model is that there are only two high-level declarativeprimitives (map and reduce) to enable parallel processing,but the rest of the code, i.e., the map and reduce functions,can be written in any programming language of choice, andwithout worrying about parallelism.Unfortunately, the map-reduce model has its own set of limitations. Its one-input, two-stage data flow is extremelyrigid. To perform tasks having a different data flow, e.g., joins or
n
stages, inelegant workarounds have to be devised.Also, custom code has to be written for even the most com-mon operations, e.g., projection and filtering. These factorslead to code that is difficult to reuse and maintain, and inwhich the semantics of the analysis task are obscured. More-over, the opaque nature of the map and reduce functionsimpedes the ability of the system to perform optimizations.We have developed a new language called
Pig Latin 
thatcombines the best of both worlds: high-level declarativequerying in the spirit of SQL, and low-level, procedural pro-gramming `a la map-reduce.
 
Example
1.
Suppose we have a table
urls: (url,category, pagerank)
. The following is a simple SQL query that finds, for each sufficiently large category, the averagepagerank of high-pagerank urls in that category.
SELECT category, AVG(pagerank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) >
10
6
An equivalent Pig Latin program is the following. (Pig Latin is described in detail in Section 3; a detailed under-standing of the language is not required to follow this exam-ple.)
good_urls = FILTER urls BY pagerank > 0.2;groups = GROUP good_urls BY category;big_groups = FILTER groups BY COUNT(good_urls)>
10
6
;
output = FOREACH big_groups GENERATEcategory, AVG(good_urls.pagerank);
As evident from the above example, a Pig Latin programis a sequence of steps, much like in a programming language,each of which carries out a single data transformation. Thischaracteristic is immediately appealing to many program-mers. At the same time, the transformations carried out ineach step are fairly high-level, e.g., filtering, grouping, andaggregation, much like in SQL. The use of such high-levelprimitives renders low-level manipulations (as required inmap-reduce) unnecessary.In effect, writing a Pig Latin program is similar to specify-ing a query execution plan (i.e., a dataflow graph), therebymaking it easier for programmers to understand and controlhow their data processing task is executed. To experiencedsystem programmers, this method is much more appealingthan encoding their task as an SQL query, and then coerc-ing the system to choose the desired plan through optimizerhints. (Automatic query optimization has its limits, espe-cially with uncataloged data, prevalent user-defined func-tions, and parallel execution, which are all features of ourtarget environment; see Section 2.) Even with a Pig Latinoptimizer in the future, users will still be able to request con-formation to the execution plan implied by their program.Pig Latin has several other unconventional features thatare important for our setting of casual ad-hoc data analy-sis by programmers. These features include support for aflexible, fully nested data model, extensive support for user-defined functions, and the ability to operate over plain inputfiles without any schema information. Pig Latin also comeswith a novel debugging environment that is especially use-ful when dealing with enormous data sets. We elaborate onthese features in Section 2.Pig Latin is fully implemented by our system,
Pig 
, and isbeing used by programmers at Yahoo! for data analysis. PigLatin programs are currently compiled into (ensembles of)map-reduce jobs that are executed using
Hadoop
, an open-source, scalable implementation of map-reduce. Alternativebackends can also be plugged in. Pig is an open-sourceproject in the Apache incubator
1
.The rest of the paper is organized as follows. In the nextsection, we describe the various features of Pig, and theunderlying motivation for each. In Section 3, we dive intothe Pig Latin data model and language. In Section 4, we
1
http://incubator.apache.org/pig
describe the current implementation of Pig. We describeour novel debugging environment in Section 5, and outlinea few real usage scenarios of Pig in Section 6. Finally, wediscuss related work in Section 7, and conclude.
2. FEATURES AND MOTIVATION
The overarching design goal of Pig is to be appealing toexperienced programmers for performing ad-hoc analysis of extremely large data sets. Consequently, Pig Latin has anumber of features that might seem surprising when viewedfrom a traditional database and SQL perspective. In thissection, we describe the features of Pig, and the rationalebehind them.
2.1 Dataflow Language
As seen in Example 1, in Pig Latin, a user specifies a se-quence of steps where each step specifies only a
single
, high-level data transformation. This is stylistically different fromthe SQL approach where the user specifies a set of declara-tive constraints that collectively define the result. While theSQL approach is good for non-programmers and/or smalldata sets, experienced programmers who must manipulatelarge data sets often prefer the Pig Latin approach. As oneof our users says,“I much prefer writing in Pig [Latin] versus SQL.The step-by-step method of creating a programin Pig [Latin] is much cleaner and simpler to usethan the single block method of SQL. It is eas-ier to keep track of what your variables are, andwhere you are in the process of analyzing yourdata.” – Jasmine Novak, Engineer, Yahoo!Note that although Pig Latin programs supply an explicitsequence of operations, it is not necessary that the oper-ations be executed in that order. The use of high-level,relational-algebra-style primitives, e.g.,
GROUP, FILTER
, al-lows traditional database optimizations to be carried out, incases where the system and/or user have enough confidencethat query optimization can succeed.For example, suppose one is interested in the set of urls of pages that are classified as spam, but have a high pagerankscore. In Pig Latin, one can write:
spam_urls = FILTER urls BY isSpam(url);culprit_urls = FILTER spam_urls BY pagerank > 0.8;
where
culprit_urls
contains the final set of urls that theuser is interested in.The above Pig Latin fragment suggests that we first findthe spam urls through the function
isSpam
, and then filterthem by
pagerank
. However, this might not be the mostefficient method. In particular,
isSpam
might be an expen-sive user-defined function that analyzes the url’s content forspaminess. Then, it will be much more efficient to filter theurls by pagerank first, and invoke
isSpam
only on the pagesthat have high pagerank.With Pig Latin, this optimization opportunity is availableto the system. On the other hand, if these filters were buriedwithin an opaque map or reduce function, such reorderingand optimization would effectively be impossible.
2.2 Quick Start and Interoperability
Pig is designed to support ad-hoc data analysis. If a userhas a data file obtained, say, from a dump of the search
 
engine logs, she can run Pig Latin queries over it directly.She need only provide a function that gives Pig the ability toparse the content of the file into tuples. There is no need togo through a time-consuming data import process prior torunning queries, as in conventional database managementsystems. Similarly, the output of a Pig program can beformatted in the manner of the user’s choosing, accordingto a user-provided function that converts tuples into a bytesequence. Hence it is easy to use the output of a Pig analysissession in a subsequent application, e.g., a visualization orspreadsheet application such as Excel.It is important to keep in mind that Pig is but one of manyapplications in the rich “data ecosystem” of a company likeYahoo! By operating over data residing in external files, andnot taking over control over the data, Pig readily interoper-ates with other applications in the ecosystem.The reasons that conventional database systems do re-quire importing data into system-managed tables are three-fold: (1) to enable transactional consistency guarantees, (2)to enable efficient point lookups (via physical tuple identi-fiers), and (3) to curate the data on behalf of the user, andrecord the schema so that other users can make sense of thedata. Pig only supports read-only data analysis workloads,and those workloads tend to be scan-centric, so transactionalconsistency and index-based lookups are not required. Also,in our environment users often analyze a temporary data setfor a day or two, and then discard it, so data curating andschema management can be overkill.In Pig, stored schemas are strictly optional. Users maysupply schema information on the fly, or perhaps not at all.Thus, in Example 1, if the user knows the the third field of the file that stores the
urls
table is
pagerank
but does notwant to provide the schema, the first line of the Pig Latinprogram can be written as:
good_urls = FILTER urls BY $2 > 0.2;
where $2 uses positional notation to refer to the third field.
2.3 Nested Data Model
Programmers often think in terms of nested data struc-tures. For example, to capture information about the posi-tional occurrences of terms in a collection of documents, aprogrammer would not think twice about creating a struc-ture of the form
Map<documentId, Set<positions>>
foreach
term
.Databases, on the other hand, allow only flat tables, i.e.,only atomic fields as columns, unless one is willing to violatethe First Normal Form (1NF) [7]. To capture the same in-formation about terms above, while conforming to 1NF, onewould need to
normalize
the data by creating two tables:
term_info: (termId, termString, ...)position_info: (termId, documentId, position)
The same positional occurence information can then bereconstructed by joining these two tables on
termId
andgrouping on
termId, documentId
.Pig Latin has a flexible, fully nested data model (describedin Section 3.1), and allows complex, non-atomic data typessuch as set, map, and tuple to occur as fields of a table.There are several reasons why a nested model is more ap-propriate for our setting than 1NF:
A nested data model is closer to how programmers think,and consequently much more natural to them than nor-malization.
Data is often stored on disk in an inherently nested fash-ion. For example, a web crawler might output for eachurl, the set of outlinks from that url. Since Pig oper-ates directly on files (Section 2.2), separating the dataout into normalized form, and later recombining through joins can be prohibitively expensive for web-scale data.
A nested data model also allows us to fulfill our goal of having an algebraic language (Section 2.1), where eachstep carries out only a single data transformation. Forexample, each tuple output by our
GROUP
primitive hasone non-atomic field: a nested set of tuples from theinput that belong to that group. The
GROUP
construct isexplained in detail in Section 3.5.
A nested data model allows programmers to easily writea rich set of user-defined functions, as shown in the nextsection.
2.4 UDFs as First-Class Citizens
A significant part of the analysis of search logs, crawl data,click streams, etc., is custom processing. For example, a usermay be interested in performing natural language stemmingof a search term, or figuring out whether a particular webpage is spam, and countless other tasks.To accommodate specialized data processing tasks, PigLatin has extensive support for
user-defined functions
(UDFs). Essentially all aspects of processing in Pig Latin in-cluding grouping, filtering, joining, and per-tuple processingcan be customized through the use of UDFs.The input and output of UDFs in Pig Latin follow ourflexible, fully nested data model. Consequently, a UDF tobe used in Pig Latin can take non-atomic parameters asinput, and also output non-atomic values. This flexibility isoften very useful as shown by the following example.
Example
2.
Continuing with the setting of Example 1,suppose we want to find for each category, the top 10 urlsaccording to pagerank. In Pig Latin, one can simply write:
groups = GROUP urls BY category;output = FOREACH groups GENERATEcategory, top10(urls);
where top10() is a UDF that accepts a set of urls (for each group at a time), and outputs a set containing the top 10 urls by pagerank for that group.
2
Note that our final output in this case contains non-atomic fields: there is a tuple for each category, and one of the fields of the tuple is the set of the top 10 urls in that category.
Due to our flexible data model, the return type of a UDFdoes not restrict the context in which it can be used. PigLatin has only one type of UDF that can be used in all theconstructs such as filtering, grouping, and per-tuple process-ing. This is in contrast to SQL, where only scalar functionsmay be used in the
SELECT
clause, set-valued functions canonly appear in the
FROM
clause, and aggregation functionscan only be applied in conjunction with a
GROUP BY
or a
PARTITION BY
.Currently, Pig UDFs are written in Java. We are buildingsupport for interfacing with UDFs written in arbitrary lan-
2
In practice, a user would probably write a more genericfunction than
top10()
: one that takes
k
as a parameter tofind the top
k
tuples, and also the field according to whichthe top
k
must be found (
pagerank
in this example).
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...