Uploaded by Voidy Pragman

0.0 (0)

**Data-Intensive Text Processing
**

with MapReduce

Jimmy Lin and Chris Dyer

University of Maryland, College Park

Manuscript prepared April 11, 2010

This is the pre-production manuscript of a book in the Morgan & Claypool Synthesis

Lectures on Human Language Technologies. Anticipated publication date is mid-2010.

ii

Contents

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Computing in the Clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Big Ideas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Why Is This Diﬀerent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.4 What This Book Is Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2

MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1 Functional Programming Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Mappers and Reducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Partitioners and Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5 The Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3

MapReduce Algorithm Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1.1 Combiners and In-Mapper Combining 41

3.1.2 Algorithmic Correctness with Local Aggregation 46

3.2 Pairs and Stripes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3 Computing Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.4 Secondary Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5 Relational Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.5.1 Reduce-Side Join 64

3.5.2 Map-Side Join 66

3.5.3 Memory-Backed Join 67

CONTENTS iii

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4

Inverted Indexing for Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.1 Web Crawling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3 Inverted Indexing: Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4 Inverted Indexing: Revised Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5 Index Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5.1 Byte-Aligned and Word-Aligned Codes 80

4.5.2 Bit-Aligned Codes 82

4.5.3 Postings Compression 84

4.6 What About Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.7 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5

Graph Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Parallel Breadth-First Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3 PageRank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5 Summary and Additional Readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6

EM Algorithms for Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.1.1 Maximum Likelihood Estimation 115

6.1.2 A Latent Variable Marble Game 117

6.1.3 MLE with Latent Variables 118

6.1.4 Expectation Maximization 119

6.1.5 An EM Example 120

6.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2.1 Three Questions for Hidden Markov Models 123

6.2.2 The Forward Algorithm 125

6.2.3 The Viterbi Algorithm 126

iv CONTENTS

6.2.4 Parameter Estimation for HMMs 129

6.2.5 Forward-Backward Training: Summary 133

6.3 EM in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3.1 HMM Training in MapReduce 135

6.4 Case Study: Word Alignment for Statistical Machine Translation . . . . . 138

6.4.1 Statistical Phrase-Based Translation 139

6.4.2 Brief Digression: Language Modeling with MapReduce 142

6.4.3 Word Alignment 143

6.4.4 Experiments 144

6.5 EM-Like Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.5.1 Gradient-Based Optimization and Log-Linear Models 147

6.6 Summary and Additional Readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7

Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.1 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.2 Alternative Computing Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.3 MapReduce and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

1

C H A P T E R 1

Introduction

MapReduce [45] is a programming model for expressing distributed computations on

massive amounts of data and an execution framework for large-scale data processing

on clusters of commodity servers. It was originally developed by Google and built on

well-known principles in parallel and distributed processing dating back several decades.

MapReduce has since enjoyed widespread adoption via an open-source implementation

called Hadoop, whose development was led by Yahoo (now an Apache project). Today,

a vibrant software ecosystem has sprung up around Hadoop, with signiﬁcant activity

in both industry and academia.

This book is about scalable approaches to processing large amounts of text with

MapReduce. Given this focus, it makes sense to start with the most basic question:

Why? There are many answers to this question, but we focus on two. First, “big data”

is a fact of the world, and therefore an issue that real-world systems must grapple with.

Second, across a wide range of text processing applications, more data translates into

more eﬀective algorithms, and thus it makes sense to take advantage of the plentiful

amounts of data that surround us.

Modern information societies are deﬁned by vast repositories of data, both public

and private. Therefore, any practical application must be able to scale up to datasets

of interest. For many, this means scaling up to the web, or at least a non-trivial frac-

tion thereof. Any organization built around gathering, analyzing, monitoring, ﬁltering,

searching, or organizing web content must tackle large-data problems: “web-scale” pro-

cessing is practically synonymous with data-intensive processing. This observation ap-

plies not only to well-established internet companies, but also countless startups and

niche players as well. Just think, how many companies do you know that start their

pitch with “we’re going to harvest information on the web and. . . ”?

Another strong area of growth is the analysis of user behavior data. Any operator

of a moderately successful website can record user activity and in a matter of weeks (or

sooner) be drowning in a torrent of log data. In fact, logging user behavior generates

so much data that many organizations simply can’t cope with the volume, and either

turn the functionality oﬀ or throw away data after some time. This represents lost

opportunities, as there is a broadly-held belief that great value lies in insights derived

from mining such data. Knowing what users look at, what they click on, how much

time they spend on a web page, etc. leads to better business decisions and competitive

advantages. Broadly, this is known as business intelligence, which encompasses a wide

range of technologies including data warehousing, data mining, and analytics.

2 CHAPTER 1. INTRODUCTION

How much data are we talking about? A few examples: Google grew from pro-

cessing 100 TB of data a day with MapReduce in 2004 [45] to processing 20 PB a day

with MapReduce in 2008 [46]. In April 2009, a blog post

1

was written about eBay’s

two enormous data warehouses: one with 2 petabytes of user data, and the other with

6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new

records per day. Shortly thereafter, Facebook revealed

2

similarly impressive numbers,

boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Petabyte

datasets are rapidly becoming the norm, and the trends are clear: our ability to store

data is fast overwhelming our ability to process what we store. More distressing, in-

creases in capacity are outpacing improvements in bandwidth such that our ability to

even read back what we store is deteriorating [91]. Disk capacities have grown from tens

of megabytes in the mid-1980s to about a couple of terabytes today (several orders of

magnitude). On the other hand, latency and bandwidth have improved relatively little:

in the case of latency, perhaps 2 improvement during the last quarter century, and

in the case of bandwidth, perhaps 50. Given the tendency for individuals and organi-

zations to continuously ﬁll up whatever capacity is available, large-data problems are

growing increasingly severe.

Moving beyond the commercial sphere, many have recognized the importance of

data management in many scientiﬁc disciplines, where petabyte-scale datasets are also

becoming increasingly common [21]. For example:

• The high-energy physics community was already describing experiences with

petabyte-scale databases back in 2005 [20]. Today, the Large Hadron Collider

(LHC) near Geneva is the world’s largest particle accelerator, designed to probe

the mysteries of the universe, including the fundamental nature of matter, by

recreating conditions shortly following the Big Bang. When it becomes fully op-

erational, the LHC will produce roughly 15 petabytes of data a year.

3

• Astronomers have long recognized the importance of a “digital observatory” that

would support the data needs of researchers across the globe—the Sloan Digital

Sky Survey [145] is perhaps the most well known of these projects. Looking into

the future, the Large Synoptic Survey Telescope (LSST) is a wide-ﬁeld instrument

that is capable of observing the entire sky every few days. When the telescope

comes online around 2015 in Chile, its 3.2 gigapixel primary camera will produce

approximately half a petabyte of archive images every month [19].

• The advent of next-generation DNA sequencing technology has created a deluge

of sequence data that needs to be stored, organized, and delivered to scientists for

1

http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/

2

http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/

3

http://public.web.cern.ch/public/en/LHC/Computing-en.html

3

further study. Given the fundamental tenant in modern genetics that genotypes

explain phenotypes, the impact of this technology is nothing less than transfor-

mative [103]. The European Bioinformatics Institute (EBI), which hosts a central

repository of sequence data called EMBL-bank, has increased storage capacity

from 2.5 petabytes in 2008 to 5 petabytes in 2009 [142]. Scientists are predicting

that, in the not-so-distant future, sequencing an individual’s genome will be no

more complex than getting a blood test today—ushering a new era of personalized

medicine, where interventions can be speciﬁcally targeted for an individual.

Increasingly, scientiﬁc breakthroughs will be powered by advanced computing capabil-

ities that help researchers manipulate, explore, and mine massive datasets [72]—this

has been hailed as the emerging “fourth paradigm” of science [73] (complementing the-

ory, experiments, and simulations). In other areas of academia, particularly computer

science, systems and algorithms incapable of scaling to massive real-world datasets run

the danger of being dismissed as “toy systems” with limited utility. Large data is a fact

of today’s world and data-intensive processing is fast becoming a necessity, not merely

a luxury or curiosity.

Although large data comes in a variety of forms, this book is primarily concerned

with processing large amounts of text, but touches on other types of data as well (e.g.,

relational and graph data). The problems and solutions we discuss mostly fall into the

disciplinary boundaries of natural language processing (NLP) and information retrieval

(IR). Recent work in these ﬁelds is dominated by a data-driven, empirical approach,

typically involving algorithms that attempt to capture statistical regularities in data for

the purposes of some task or application. There are three components to this approach:

data, representations of the data, and some method for capturing regularities in the

data. Data are called corpora (singular, corpus) by NLP researchers and collections by

those from the IR community. Aspects of the representations of the data are called fea-

tures, which may be “superﬁcial” and easy to extract, such as the words and sequences

of words themselves, or “deep” and more diﬃcult to extract, such as the grammatical

relationship between words. Finally, algorithms or models are applied to capture regu-

larities in the data in terms of the extracted features for some application. One common

application, classiﬁcation, is to sort text into categories. Examples include: Is this email

spam or not spam? Is this word part of an address or a location? The ﬁrst task is

easy to understand, while the second task is an instance of what NLP researchers call

named-entity detection [138], which is useful for local search and pinpointing locations

on maps. Another common application is to rank texts according to some criteria—

search is a good example, which involves ranking documents by relevance to the user’s

query. Another example is to automatically situate texts along a scale of “happiness”,

a task known as sentiment analysis or opinion mining [118], which has been applied to

4 CHAPTER 1. INTRODUCTION

everything from understanding political discourse in the blogosphere to predicting the

movement of stock prices.

There is a growing body of evidence, at least in text processing, that of the three

components discussed above (data, features, algorithms), data probably matters the

most. Superﬁcial word-level features coupled with simple models in most cases trump

sophisticated models over deeper features and less data. But why can’t we have our cake

and eat it too? Why not both sophisticated models and deep features applied to lots of

data? Because inference over sophisticated models and extraction of deep features are

often computationally intensive, they don’t scale well.

Consider a simple task such as determining the correct usage of easily confusable

words such as “than” and “then” in English. One can view this as a supervised machine

learning problem: we can train a classiﬁer to disambiguate between the options, and

then apply the classiﬁer to new instances of the problem (say, as part of a grammar

checker). Training data is fairly easy to come by—we can just gather a large corpus of

texts and assume that most writers make correct choices (the training data may be noisy,

since people make mistakes, but no matter). In 2001, Banko and Brill [14] published

what has become a classic paper in natural language processing exploring the eﬀects

of training data size on classiﬁcation accuracy, using this task as the speciﬁc example.

They explored several classiﬁcation algorithms (the exact ones aren’t important, as we

shall see), and not surprisingly, found that more data led to better accuracy. Across

many diﬀerent algorithms, the increase in accuracy was approximately linear in the

log of the size of the training data. Furthermore, with increasing amounts of training

data, the accuracy of diﬀerent algorithms converged, such that pronounced diﬀerences

in eﬀectiveness observed on smaller datasets basically disappeared at scale. This led to

a somewhat controversial conclusion (at least at the time): machine learning algorithms

really don’t matter, all that matters is the amount of data you have. This resulted in

an even more controversial recommendation, delivered somewhat tongue-in-cheek: we

should just give up working on algorithms and simply spend our time gathering data

(while waiting for computers to become faster so we can process the data).

As another example, consider the problem of answering short, fact-based questions

such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the

user would then have to sort through, a question answering (QA) system would directly

return the answer: John Wilkes Booth. This problem gained interest in the late 1990s,

when natural language processing researchers approached the challenge with sophisti-

cated linguistic processing techniques such as syntactic and semantic analysis. Around

2001, researchers discovered a far simpler approach to answering such questions based

on pattern matching [27, 53, 92]. Suppose you wanted the answer to the above question.

As it turns out, you can simply search for the phrase “shot Abraham Lincoln” on the

web and look for what appears to its left. Or better yet, look through multiple instances

5

of this phrase and tally up the words that appear to the left. This simple strategy works

surprisingly well, and has become known as the redundancy-based approach to question

answering. It capitalizes on the insight that in a very large text collection (i.e., the

web), answers to commonly-asked questions will be stated in obvious ways, such that

pattern-matching techniques suﬃce to extract answers accurately.

Yet another example concerns smoothing in web-scale language models [25]. A

language model is a probability distribution that characterizes the likelihood of observ-

ing a particular sequence of words, estimated from a large corpus of texts. They are

useful in a variety of applications, such as speech recognition (to determine what the

speaker is more likely to have said) and machine translation (to determine which of

possible translations is the most ﬂuent, as we will discuss in Section 6.4). Since there

are inﬁnitely many possible strings, and probabilities must be assigned to all of them,

language modeling is a more challenging task than simply keeping track of which strings

were seen how many times: some number of likely strings will never be encountered,

even with lots and lots of training data! Most modern language models make the Markov

assumption: in a n-gram language model, the conditional probability of a word is given

by the n −1 previous words. Thus, by the chain rule, the probability of a sequence of

words can be decomposed into the product of n-gram probabilities. Nevertheless, an

enormous number of parameters must still be estimated from a training corpus: poten-

tially V

n

parameters, where V is the number of words in the vocabulary. Even if we

treat every word on the web as the training corpus from which to estimate the n-gram

probabilities, most n-grams—in any language, even English—will never have been seen.

To cope with this sparseness, researchers have developed a number of smoothing tech-

niques [35, 102, 79], which all share the basic idea of moving probability mass from

observed to unseen events in a principled manner. Smoothing approaches vary in ef-

fectiveness, both in terms of intrinsic and application-speciﬁc metrics. In 2007, Brants

et al. [25] described language models trained on up to two trillion words.

4

Their ex-

periments compared a state-of-the-art approach known as Kneser-Ney smoothing [35]

with another technique the authors aﬀectionately referred to as “stupid backoﬀ”.

5

Not

surprisingly, stupid backoﬀ didn’t work as well as Kneser-Ney smoothing on smaller

corpora. However, it was simpler and could be trained on more data, which ultimately

yielded better language models. That is, a simpler technique on more data beat a more

sophisticated technique on less data.

4

As an aside, it is interesting to observe the evolving deﬁnition of large over the years. Banko and Brill’s paper

in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation, and dealt with

a corpus containing a billion words.

5

As in, so stupid it couldn’t possibly work.

6 CHAPTER 1. INTRODUCTION

Recently, three Google researchers summarized this data-driven philosophy in

an essay titled The Unreasonable Eﬀectiveness of Data [65].

6

Why is this so? It boils

down to the fact that language in the wild, just like human behavior in general, is

messy. Unlike, say, the interaction of subatomic particles, human use of language is

not constrained by succinct, universal “laws of grammar”. There are of course rules

that govern the formation of words and sentences—for example, that verbs appear

before objects in English, and that subjects and verbs must agree in number in many

languages—but real-world language is aﬀected by a multitude of other factors as well:

people invent new words and phrases all the time, authors occasionally make mistakes,

groups of individuals write within a shared context, etc. The Argentine writer Jorge

Luis Borges wrote a famous allegorical one-paragraph story about a ﬁctional society

in which the art of cartography had gotten so advanced that their maps were as big

as the lands they were describing.

7

The world, he would say, is the best description of

itself. In the same way, the more observations we gather about language use, the more

accurate a description we have of language itself. This, in turn, translates into more

eﬀective algorithms and systems.

So, in summary, why large data? In some ways, the ﬁrst answer is similar to

the reason people climb mountains: because they’re there. But the second answer is

even more compelling. Data represent the rising tide that lifts all boats—more data

lead to better algorithms and systems for solving real-world problems. Now that we’ve

addressed the why, let’s tackle the how. Let’s start with the obvious observation: data-

intensive processing is beyond the capability of any individual machine and requires

clusters—which means that large-data problems are fundamentally about organizing

computations on dozens, hundreds, or even thousands of machines. This is exactly

what MapReduce does, and the rest of this book is about the how.

1.1 COMPUTING IN THE CLOUDS

For better or for worse, it is often diﬃcult to untangled MapReduce and large-data

processing from the broader discourse on cloud computing. True, there is substantial

promise in this new paradigm of computing, but unwarranted hype by the media and

popular sources threatens its credibility in the long run. In some ways, cloud computing

6

This title was inspired by a classic article titled The Unreasonable Eﬀectiveness of Mathematics in the Natural

Sciences [155]. This is somewhat ironic in that the original article lauded the beauty and elegance of mathe-

matical models in capturing natural phenomena, which is the exact opposite of the data-driven approach.

7

On Exactitude in Science [23]. A similar exchange appears in Chapter XI of Sylvie and Bruno Concluded by

Lewis Carroll (1893).

1.1. COMPUTING IN THE CLOUDS 7

is simply brilliant marketing. Before clouds, there were grids,

8

and before grids, there

were vector supercomputers, each having claimed to be the best thing since sliced bread.

So what exactly is cloud computing? This is one of those questions where ten

experts will give eleven diﬀerent answers; in fact, countless papers have been written

simply to attempt to deﬁne the term (e.g., [9, 31, 149], just to name a few examples).

Here we oﬀer up our own thoughts and attempt to explain how cloud computing relates

to MapReduce and data-intensive processing.

At the most superﬁcial level, everything that used to be called web applications has

been rebranded to become “cloud applications”, which includes what we have previously

called “Web 2.0” sites. In fact, anything running inside a browser that gathers and stores

user-generated content now qualiﬁes as an example of cloud computing. This includes

social-networking services such as Facebook, video-sharing sites such as YouTube, web-

based email services such as Gmail, and applications such as Google Docs. In this

context, the cloud simply refers to the servers that power these sites, and user data is

said to reside “in the cloud”. The accumulation of vast quantities of user data creates

large-data problems, many of which are suitable for MapReduce. To give two concrete

examples: a social-networking site analyzes connections in the enormous globe-spanning

graph of friendships to recommend new connections. An online email service analyzes

messages and user behavior to optimize ad selection and placement. These are all large-

data problems that have been tackled with MapReduce.

9

Another important facet of cloud computing is what’s more precisely known as

utility computing [129, 31]. As the name implies, the idea behind utility computing

is to treat computing resource as a metered service, like electricity or natural gas.

The idea harkens back to the days of time-sharing machines, and in truth isn’t very

diﬀerent from this antiquated form of computing. Under this model, a “cloud user” can

dynamically provision any amount of computing resources from a “cloud provider” on

demand and only pay for what is consumed. In practical terms, the user is paying for

access to virtual machine instances that run a standard operating system such as Linux.

Virtualization technology (e.g., [15]) is used by the cloud provider to allocate available

physical resources and enforce isolation between multiple users that may be sharing the

8

What is the diﬀerence between cloud computing and grid computing? Although both tackle the fundamental

problem of how best to bring computational resources to bear on large and diﬃcult problems, they start

with diﬀerent assumptions. Whereas clouds are assumed to be relatively homogeneous servers that reside in a

datacenter or are distributed across a relatively small number of datacenters controlled by a single organization,

grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct

but cooperative organizations. As a result, grid computing tends to deal with tasks that are coarser-grained,

and must deal with the practicalities of a federated environment, e.g., verifying credentials across multiple

administrative domains. Grid computing has adopted a middleware-based approach for tackling many of these

challenges.

9

The ﬁrst example is Facebook, a well-known user of Hadoop, in exactly the manner as described [68]. The

second is, of course, Google, which uses MapReduce to continuously improve existing algorithms and to devise

new algorithms for ad selection and placement.

8 CHAPTER 1. INTRODUCTION

same hardware. Once one or more virtual machine instances have been provisioned, the

user has full control over the resources and can use them for arbitrary computation.

Virtual machines that are no longer needed are destroyed, thereby freeing up physical

resources that can be redirected to other users. Resource consumption is measured in

some equivalent of machine-hours and users are charged in increments thereof.

Both users and providers beneﬁt in the utility computing model. Users are freed

from upfront capital investments necessary to build datacenters and substantial reoccur-

ring costs in maintaining them. They also gain the important property of elasticity—as

demand for computing resources grow, for example, from an unpredicted spike in cus-

tomers, more resources can be seamlessly allocated from the cloud without an inter-

ruption in service. As demand falls, provisioned resources can be released. Prior to the

advent of utility computing, coping with unexpected spikes in demand was fraught with

challenges: under-provision and run the risk of service interruptions, or over-provision

and tie up precious capital in idle machines that are depreciating.

From the utility provider point of view, this business also makes sense because

large datacenters beneﬁt from economies of scale and can be run more eﬃciently than

smaller datacenters. In the same way that insurance works by aggregating risk and re-

distributing it, utility providers aggregate the computing demands for a large number

of users. Although demand may ﬂuctuate signiﬁcantly for each user, overall trends in

aggregate demand should be smooth and predictable, which allows the cloud provider

to adjust capacity over time with less risk of either oﬀering too much (resulting in in-

eﬃcient use of capital) or too little (resulting in unsatisﬁed customers). In the world of

utility computing, Amazon Web Services currently leads the way and remains the dom-

inant player, but a number of other cloud providers populate a market that is becoming

increasingly crowded. Most systems are based on proprietary infrastructure, but there

is at least one, Eucalyptus [111], that is available open source. Increased competition

will beneﬁt cloud users, but what direct relevance does this have for MapReduce? The

connection is quite simple: processing large amounts of data with MapReduce requires

access to clusters with suﬃcient capacity. However, not everyone with large-data prob-

lems can aﬀord to purchase and maintain clusters. This is where utility computing

comes in: clusters of suﬃcient size can be provisioned only when the need arises, and

users pay only as much as is required to solve their problems. This lowers the barrier

to entry for data-intensive processing and makes MapReduce much more accessible.

A generalization of the utility computing concept is “everything as a service”,

which is itself a new take on the age-old idea of outsourcing. A cloud provider oﬀering

customers access to virtual machine instances is said to be oﬀering infrastructure as a

service, or IaaS for short. However, this may be too low level for many users. Enter plat-

form as a service (PaaS), which is a rebranding of what used to be called hosted services

in the “pre-cloud” era. Platform is used generically to refer to any set of well-deﬁned

1.2. BIG IDEAS 9

services on top of which users can build applications, deploy content, etc. This class of

services is best exempliﬁed by Google App Engine, which provides the backend data-

store and API for anyone to build highly-scalable web applications. Google maintains

the infrastructure, freeing the user from having to backup, upgrade, patch, or otherwise

maintain basic services such as the storage layer or the programming environment. At

an even higher level, cloud providers can oﬀer software as a service (SaaS), as exem-

pliﬁed by Salesforce, a leader in customer relationship management (CRM) software.

Other examples include outsourcing an entire organization’s email to a third party,

which is commonplace today.

What does this proliferation of services have to do with MapReduce? No doubt

that “everything as a service” is driven by desires for greater business eﬃciencies, but

scale and elasticity play important roles as well. The cloud allows seamless expansion of

operations without the need for careful planning and supports scales that may otherwise

be diﬃcult or cost-prohibitive for an organization to achieve. Cloud services, just like

MapReduce, represents the search for an appropriate level of abstraction and beneﬁcial

divisions of labor. IaaS is an abstraction over raw physical hardware—an organization

might lack the capital, expertise, or interest in running datacenters, and therefore pays

a cloud provider to do so on its behalf. The argument applies similarly to PaaS and

SaaS. In the same vein, the MapReduce programming model is a powerful abstraction

that separates the what from the how of data-intensive processing.

1.2 BIG IDEAS

Tackling large-data problems requires a distinct approach that sometimes runs counter

to traditional models of computing. In this section, we discuss a number of “big ideas”

behind MapReduce. To be fair, all of these ideas have been discussed in the computer

science literature for some time (some for decades), and MapReduce is certainly not

the ﬁrst to adopt these ideas. Nevertheless, the engineers at Google deserve tremendous

credit for pulling these various threads together and demonstrating the power of these

ideas on a scale previously unheard of.

Scale “out”, not “up”. For data-intensive workloads, a large number of commodity

low-end servers (i.e., the scaling “out” approach) is preferred over a small number of

high-end servers (i.e., the scaling “up” approach). The latter approach of purchasing

symmetric multi-processing (SMP) machines with a large number of processor sockets

(dozens, even hundreds) and a large amount of shared memory (hundreds or even thou-

sands of gigabytes) is not cost eﬀective, since the costs of such machines do not scale

linearly (i.e., a machine with twice as many processors is often signiﬁcantly more than

twice as expensive). On the other hand, the low-end server market overlaps with the

10 CHAPTER 1. INTRODUCTION

high-volume desktop computing market, which has the eﬀect of keeping prices low due

to competition, interchangeable components, and economies of scale.

Barroso and H¨olzle’s recent treatise of what they dubbed “warehouse-scale com-

puters” [18] contains a thoughtful analysis of the two approaches. The Transaction

Processing Council (TPC) is a neutral, non-proﬁt organization whose mission is to

establish objective database benchmarks. Benchmark data submitted to that organiza-

tion are probably the closest one can get to a fair “apples-to-apples” comparison of cost

and performance for speciﬁc, well-deﬁned relational processing applications. Based on

TPC-C benchmark results from late 2007, a low-end server platform is about four times

more cost eﬃcient than a high-end shared memory platform from the same vendor. Ex-

cluding storage costs, the price/performance advantage of the low-end server increases

to about a factor of twelve.

What if we take into account the fact that communication between nodes in

a high-end SMP machine is orders of magnitude faster than communication between

nodes in a commodity network-based cluster? Since workloads today are beyond the

capability of any single machine (no matter how powerful), the comparison is more ac-

curately between a smaller cluster of high-end machines and a larger cluster of low-end

machines (network communication is unavoidable in both cases). Barroso and H¨olzle

model these two approaches under workloads that demand more or less communication,

and conclude that a cluster of low-end servers approaches the performance of the equiv-

alent cluster of high-end servers—the small performance gap is insuﬃcient to justify the

price premium of the high-end servers. For data-intensive applications, the conclusion

appears to be clear: scaling “out” is superior to scaling “up”, and therefore most existing

implementations of the MapReduce programming model are designed around clusters

of low-end commodity servers.

Capital costs in acquiring servers is, of course, only one component of the total

cost of delivering computing capacity. Operational costs are dominated by the cost of

electricity to power the servers as well as other aspects of datacenter operations that

are functionally related to power: power distribution, cooling, etc. [67, 18]. As a result,

energy eﬃciency has become a key issue in building warehouse-scale computers for

large-data processing. Therefore, it is important to factor in operational costs when

deploying a scale-out solution based on large numbers of commodity servers.

Datacenter eﬃciency is typically factored into three separate components that

can be independently measured and optimized [18]. The ﬁrst component measures how

much of a building’s incoming power is actually delivered to computing equipment, and

correspondingly, how much is lost to the building’s mechanical systems (e.g., cooling,

air handling) and electrical infrastructure (e.g., power distribution ineﬃciencies). The

second component measures how much of a server’s incoming power is lost to the power

supply, cooling fans, etc. The third component captures how much of the power delivered

1.2. BIG IDEAS 11

to computing components (processor, RAM, disk, etc.) is actually used to perform useful

computations.

Of the three components of datacenter eﬃciency, the ﬁrst two are relatively

straightforward to objectively quantify. Adoption of industry best-practices can help

datacenter operators achieve state-of-the-art eﬃciency. The third component, however,

is much more diﬃcult to measure. One important issue that has been identiﬁed is the

non-linearity between load and power draw. That is, a server at 10% utilization may

draw slightly more than half as much power as a server at 100% utilization (which

means that a lightly-loaded server is much less eﬃcient than a heavily-loaded server).

A survey of ﬁve thousand Google servers over a six-month period shows that servers

operate most of the time at between 10% and 50% utilization [17], which is an energy-

ineﬃcient operating region. As a result, Barroso and H¨olzle have advocated for research

and development in energy-proportional machines, where energy consumption would

be proportional to load, such that an idle processor would (ideally) consume no power,

but yet retain the ability to power up (nearly) instantaneously in response to demand.

Although we have provided a brief overview here, datacenter eﬃciency is a topic

that is beyond the scope of this book. For more details, consult Barroso and H¨olzle [18]

and Hamilton [67], who provide detailed cost models for typical modern datacenters.

However, even factoring in operational costs, evidence suggests that scaling out remains

more attractive than scaling up.

Assume failures are common. At warehouse scale, failures are not only inevitable,

but commonplace. A simple calculation suﬃces to demonstrate: let us suppose that a

cluster is built from reliable machines with a mean-time between failures (MTBF) of

1000 days (about three years). Even with these reliable servers, a 10,000-server cluster

would still experience roughly 10 failures a day. For the sake of argument, let us suppose

that a MTBF of 10,000 days (about thirty years) were achievable at realistic costs (which

is unlikely). Even then, a 10,000-server cluster would still experience one failure daily.

This means that any large-scale service that is distributed across a large cluster (either

a user-facing application or a computing platform like MapReduce) must cope with

hardware failures as an intrinsic aspect of its operation [66]. That is, a server may fail at

any time, without notice. For example, in large clusters disk failures are common [123]

and RAM experiences more errors than one might expect [135]. Datacenters suﬀer

from both planned outages (e.g., system maintenance and hardware upgrades) and

unexpected outages (e.g., power failure, connectivity loss, etc.).

A well-designed, fault-tolerant service must cope with failures up to a point with-

out impacting the quality of service—failures should not result in inconsistencies or in-

determinism from the user perspective. As servers go down, other cluster nodes should

seamlessly step in to handle the load, and overall performance should gracefully degrade

as server failures pile up. Just as important, a broken server that has been repaired

12 CHAPTER 1. INTRODUCTION

should be able to seamlessly rejoin the service without manual reconﬁguration by the

administrator. Mature implementations of the MapReduce programming model are able

to robustly cope with failures through a number of mechanisms such as automatic task

restarts on diﬀerent cluster nodes.

Move processing to the data. In traditional high-performance computing (HPC)

applications (e.g., for climate or nuclear simulations), it is commonplace for a supercom-

puter to have “processing nodes” and “storage nodes” linked together by a high-capacity

interconnect. Many data-intensive workloads are not very processor-demanding, which

means that the separation of compute and storage creates a bottleneck in the network.

As an alternative to moving data around, it is more eﬃcient to move the process-

ing around. That is, MapReduce assumes an architecture where processors and storage

(disk) are co-located. In such a setup, we can take advantage of data locality by running

code on the processor directly attached to the block of data we need. The distributed

ﬁle system is responsible for managing the data over which MapReduce operates.

Process data sequentially and avoid random access. Data-intensive processing

by deﬁnition means that the relevant datasets are too large to ﬁt in memory and must

be held on disk. Seek times for random disk access are fundamentally limited by the

mechanical nature of the devices: read heads can only move so fast and platters can only

spin so rapidly. As a result, it is desirable to avoid random data access, and instead orga-

nize computations so that data is processed sequentially. A simple scenario

10

poignantly

illustrates the large performance gap between sequential operations and random seeks:

assume a 1 terabyte database containing 10

10

100-byte records. Given reasonable as-

sumptions about disk latency and throughput, a back-of-the-envelop calculation will

show that updating 1% of the records (by accessing and then mutating each record)

will take about a month on a single machine. On the other hand, if one simply reads the

entire database and rewrites all the records (mutating those that need updating), the

process would ﬁnish in under a work day on a single machine. Sequential data access

is, literally, orders of magnitude faster than random data access.

11

The development of solid-state drives is unlikely the change this balance for at

least two reasons. First, the cost diﬀerential between traditional magnetic disks and

solid-state disks remains substantial: large-data will for the most part remain on me-

chanical drives, at least in the near future. Second, although solid-state disks have

substantially faster seek times, order-of-magnitude diﬀerences in performance between

sequential and random access still remain.

MapReduce is primarily designed for batch processing over large datasets. To the

extent possible, all computations are organized into long streaming operations that

10

Adapted from a post by Ted Dunning on the Hadoop mailing list.

11

For more detail, Jacobs [76] provides real-world benchmarks in his discussion of large-data problems.

1.2. BIG IDEAS 13

take advantage of the aggregate bandwidth of many disks in a cluster. Many aspects of

MapReduce’s design explicitly trade latency for throughput.

Hide system-level details from the application developer. According to many

guides on the practice of software engineering written by experienced industry profes-

sionals, one of the key reasons why writing code is diﬃcult is because the programmer

must simultaneously keep track of many details in short term memory—ranging from

the mundane (e.g., variable names) to the sophisticated (e.g., a corner case of an algo-

rithm that requires special treatment). This imposes a high cognitive load and requires

intense concentration, which leads to a number of recommendations about a program-

mer’s environment (e.g., quiet oﬃce, comfortable furniture, large monitors, etc.). The

challenges in writing distributed software are greatly compounded—the programmer

must manage details across several threads, processes, or machines. Of course, the

biggest headache in distributed programming is that code runs concurrently in un-

predictable orders, accessing data in unpredictable patterns. This gives rise to race

conditions, deadlocks, and other well-known problems. Programmers are taught to use

low-level devices such as mutexes and to apply high-level “design patterns” such as

producer–consumer queues to tackle these challenges, but the truth remains: concur-

rent programs are notoriously diﬃcult to reason about and even harder to debug.

MapReduce addresses the challenges of distributed programming by providing an

abstraction that isolates the developer from system-level details (e.g., locking of data

structures, data starvation issues in the processing pipeline, etc.). The programming

model speciﬁes simple and well-deﬁned interfaces between a small number of compo-

nents, and therefore is easy for the programmer to reason about. MapReduce maintains

a separation of what computations are to be performed and how those computations are

actually carried out on a cluster of machines. The ﬁrst is under the control of the pro-

grammer, while the second is exclusively the responsibility of the execution framework

or “runtime”. The advantage is that the execution framework only needs to be de-

signed once and veriﬁed for correctness—thereafter, as long as the developer expresses

computations in the programming model, code is guaranteed to behave as expected.

The upshot is that the developer is freed from having to worry about system-level de-

tails (e.g., no more debugging race conditions and addressing lock contention) and can

instead focus on algorithm or application design.

Seamless scalability. For data-intensive processing, it goes without saying that scal-

able algorithms are highly desirable. As an aspiration, let us sketch the behavior of an

ideal algorithm. We can deﬁne scalability along at least two dimensions.

12

First, in terms

of data: given twice the amount of data, the same algorithm should take at most twice

as long to run, all else being equal. Second, in terms of resources: given a cluster twice

12

See also DeWitt and Gray [50] for slightly diﬀerent deﬁnitions in terms of speedup and scaleup.

14 CHAPTER 1. INTRODUCTION

the size, the same algorithm should take no more than half as long to run. Furthermore,

an ideal algorithm would maintain these desirable scaling characteristics across a wide

range of settings: on data ranging from gigabytes to petabytes, on clusters consisting

of a few to a few thousand machines. Finally, the ideal algorithm would exhibit these

desired behaviors without requiring any modiﬁcations whatsoever, not even tuning of

parameters.

Other than for embarrassingly parallel problems, algorithms with the character-

istics sketched above are, of course, unobtainable. One of the fundamental assertions

in Fred Brook’s classic The Mythical Man-Month [28] is that adding programmers to a

project behind schedule will only make it fall further behind. This is because complex

tasks cannot be chopped into smaller pieces and allocated in a linear fashion, and is

often illustrated with a cute quote: “nine women cannot have a baby in one month”.

Although Brook’s observations are primarily about software engineers and the soft-

ware development process, the same is also true of algorithms: increasing the degree

of parallelization also increases communication costs. The algorithm designer is faced

with diminishing returns, and beyond a certain point, greater eﬃciencies gained by

parallelization are entirely oﬀset by increased communication requirements.

Nevertheless, these fundamental limitations shouldn’t prevent us from at least

striving for the unobtainable. The truth is that most current algorithms are far from

the ideal. In the domain of text processing, for example, most algorithms today assume

that data ﬁts in memory on a single machine. For the most part, this is a fair assumption.

But what happens when the amount of data doubles in the near future, and then doubles

again shortly thereafter? Simply buying more memory is not a viable solution, as the

amount of data is growing faster than the price of memory is falling. Furthermore, the

price of a machine does not scale linearly with the amount of available memory beyond

a certain point (once again, the scaling “up” vs. scaling “out” argument). Quite simply,

algorithms that require holding intermediate data in memory on a single machine will

simply break on suﬃciently-large datasets—moving from a single machine to a cluster

architecture requires fundamentally diﬀerent algorithms (and reimplementations).

Perhaps the most exciting aspect of MapReduce is that it represents a small step

toward algorithms that behave in the ideal manner discussed above. Recall that the

programming model maintains a clear separation between what computations need to

occur with how those computations are actually orchestrated on a cluster. As a result,

a MapReduce algorithm remains ﬁxed, and it is the responsibility of the execution

framework to execute the algorithm. Amazingly, the MapReduce programming model

is simple enough that it is actually possible, in many circumstances, to approach the

ideal scaling characteristics discussed above. We introduce the idea of the “tradeable

machine hour”, as a play on Brook’s classic title. If running an algorithm on a particular

dataset takes 100 machine hours, then we should be able to ﬁnish in an hour on a cluster

1.3. WHY IS THIS DIFFERENT? 15

of 100 machines, or use a cluster of 10 machines to complete the same task in ten hours.

13

With MapReduce, this isn’t so far from the truth, at least for some applications.

1.3 WHY IS THIS DIFFERENT?

“Due to the rapidly decreasing cost of processing, memory, and communica-

tion, it has appeared inevitable for at least two decades that parallel machines

will eventually displace sequential ones in computationally intensive domains.

This, however, has not happened.” — Leslie Valiant [148]

14

For several decades, computer scientists have predicted that the dawn of the age of

parallel computing was “right around the corner” and that sequential processing would

soon fade into obsolescence (consider, for example, the above quote). Yet, until very re-

cently, they have been wrong. The relentless progress of Moore’s Law for several decades

has ensured that most of the world’s problems could be solved by single-processor ma-

chines, save the needs of a few (scientists simulating molecular interactions or nuclear

reactions, for example). Couple that with the inherent challenges of concurrency, and

the result has been that parallel processing and distributed systems have largely been

conﬁned to a small segment of the market and esoteric upper-level electives in the

computer science curriculum.

However, all of that changed around the middle of the ﬁrst decade of this cen-

tury. The manner in which the semiconductor industry had been exploiting Moore’s

Law simply ran out of opportunities for improvement: faster clocks, deeper pipelines,

superscalar architectures, and other tricks of the trade reached a point of diminish-

ing returns that did not justify continued investment. This marked the beginning of

an entirely new strategy and the dawn of the multi-core era [115]. Unfortunately, this

radical shift in hardware architecture was not matched at that time by corresponding

advances in how software could be easily designed for these new processors (but not for

lack of trying [104]). Nevertheless, parallel processing became an important issue at the

forefront of everyone’s mind—it represented the only way forward.

At around the same time, we witnessed the growth of large-data problems. In the

late 1990s and even during the beginning of the ﬁrst decade of this century, relatively

few organizations had data-intensive processing needs that required large clusters: a

handful of internet companies and perhaps a few dozen large corporations. But then,

everything changed. Through a combination of many diﬀerent factors (falling prices of

disks, rise of user-generated web content, etc.), large-data problems began popping up

everywhere. Data-intensive processing needs became widespread, which drove innova-

tions in distributed computing such as MapReduce—ﬁrst by Google, and then by Yahoo

13

Note that this idea meshes well with utility computing, where a 100-machine cluster running for one hour would

cost the same as a 10-machine cluster running for ten hours.

14

Guess when this was written? You may be surprised.

16 CHAPTER 1. INTRODUCTION

and the open source community. This in turn created more demand: when organiza-

tions learned about the availability of eﬀective data analysis tools for large datasets,

they began instrumenting various business processes to gather even more data—driven

by the belief that more data leads to deeper insights and greater competitive advantages.

Today, not only are large-data problems ubiquitous, but technological solutions for ad-

dressing them are widely accessible. Anyone can download the open source Hadoop

implementation of MapReduce, pay a modest fee to rent a cluster from a utility cloud

provider, and be happily processing terabytes upon terabytes of data within the week.

Finally, the computer scientists are right—the age of parallel computing has begun,

both in terms of multiple cores in a chip and multiple machines in a cluster (each of

which often has multiple cores).

Why is MapReduce important? In practical terms, it provides a very eﬀective tool

for tackling large-data problems. But beyond that, MapReduce is important in how it

has changed the way we organize computations at a massive scale. MapReduce repre-

sents the ﬁrst widely-adopted step away from the von Neumann model that has served

as the foundation of computer science over the last half plus century. Valiant called this

a bridging model [148], a conceptual bridge between the physical implementation of a

machine and the software that is to be executed on that machine. Until recently, the

von Neumann model has served us well: Hardware designers focused on eﬃcient imple-

mentations of the von Neumann model and didn’t have to think much about the actual

software that would run on the machines. Similarly, the software industry developed

software targeted at the model without worrying about the hardware details. The result

was extraordinary growth: chip designers churned out successive generations of increas-

ingly powerful processors, and software engineers were able to develop applications in

high-level languages that exploited those processors.

Today, however, the von Neumann model isn’t suﬃcient anymore: we can’t treat

a multi-core processor or a large cluster as an agglomeration of many von Neumann

machine instances communicating over some interconnect. Such a view places too much

burden on the software developer to eﬀectively take advantage of available computa-

tional resources—it simply is the wrong level of abstraction. MapReduce can be viewed

as the ﬁrst breakthrough in the quest for new abstractions that allow us to organize

computations, not over individual machines, but over entire clusters. As Barroso puts

it, the datacenter is the computer [18, 119].

To be fair, MapReduce is certainly not the ﬁrst model of parallel computation

that has been proposed. The most prevalent model in theoretical computer science,

which dates back several decades, is the PRAM [77, 60].

15

In the model, an arbitrary

number of processors, sharing an unboundedly large memory, operate synchronously on

a shared input to produce some output. Other models include LogP [43] and BSP [148].

15

More than a theoretical model, the PRAM has been recently prototyped in hardware [153].

1.4. WHAT THIS BOOK IS NOT 17

For reasons that are beyond the scope of this book, none of these previous models have

enjoyed the success that MapReduce has in terms of adoption and in terms of impact

on the daily lives of millions of users.

16

MapReduce is the most successful abstraction over large-scale computational re-

sources we have seen to date. However, as anyone who has taken an introductory

computer science course knows, abstractions manage complexity by hiding details and

presenting well-deﬁned behaviors to users of those abstractions. They, inevitably, are

imperfect—making certain tasks easier but others more diﬃcult, and sometimes, im-

possible (in the case where the detail suppressed by the abstraction is exactly what

the user cares about). This critique applies to MapReduce: it makes certain large-data

problems easier, but suﬀers from limitations as well. This means that MapReduce is

not the ﬁnal word, but rather the ﬁrst in a new class of programming models that will

allow us to more eﬀectively organize computations at a massive scale.

So if MapReduce is only the beginning, what’s next beyond MapReduce? We’re

getting ahead of ourselves, as we can’t meaningfully answer this question before thor-

oughly understanding what MapReduce can and cannot do well. This is exactly the

purpose of this book: let us now begin our exploration.

1.4 WHAT THIS BOOK IS NOT

Actually, not quite yet. . . A ﬁnal word before we get started. This book is about Map-

Reduce algorithm design, particularly for text processing (and related) applications.

Although our presentation most closely follows the Hadoop open-source implementa-

tion of MapReduce, this book is explicitly not about Hadoop programming. We don’t

for example, discuss APIs, command-line invocations for running jobs, etc. For those

aspects, we refer the reader to Tom White’s excellent book, “Hadoop: The Deﬁnitive

Guide”, published by O’Reilly [154].

16

Nevertheless, it is important to understand the relationship between MapReduce and existing models so that we

can bring to bear accumulated knowledge about parallel algorithms; for example, Karloﬀ et al. [82] demonstrated

that a large class of PRAM algorithms can be eﬃciently simulated via MapReduce.

18

C H A P T E R 2

MapReduce Basics

The only feasible approach to tackling large-data problems today is to divide and con-

quer, a fundamental concept in computer science that is introduced very early in typical

undergraduate curricula. The basic idea is to partition a large problem into smaller sub-

problems. To the extent that the sub-problems are independent [5], they can be tackled

in parallel by diﬀerent workers—threads in a processor core, cores in a multi-core pro-

cessor, multiple processors in a machine, or many machines in a cluster. Intermediate

results from each individual worker are then combined to yield the ﬁnal output.

1

The general principles behind divide-and-conquer algorithms are broadly applica-

ble to a wide range of problems in many diﬀerent application domains. However, the

details of their implementations are varied and complex. For example, the following are

just some of the issues that need to be addressed:

• How do we break up a large problem into smaller tasks? More speciﬁcally, how do

we decompose the problem so that the smaller tasks can be executed in parallel?

• How do we assign tasks to workers distributed across a potentially large number

of machines (while keeping in mind that some workers are better suited to running

some tasks than others, e.g., due to available resources, locality constraints, etc.)?

• How do we ensure that the workers get the data they need?

• How do we coordinate synchronization among the diﬀerent workers?

• How do we share partial results from one worker that is needed by another?

• How do we accomplish all of the above in the face of software errors and hardware

faults?

In traditional parallel or distributed programming environments, the developer

needs to explicitly address many (and sometimes, all) of the above issues. In shared

memory programming, the developer needs to explicitly coordinate access to shared

data structures through synchronization primitives such as mutexes, to explicitly han-

dle process synchronization through devices such as barriers, and to remain ever vigilant

for common problems such as deadlocks and race conditions. Language extensions, like

1

We note that promising technologies such as quantum or biological computing could potentially induce a

paradigm shift, but they are far from being suﬃciently mature to solve real world problems.

19

OpenMP for shared memory parallelism,

2

or libraries implementing the Message Pass-

ing Interface (MPI) for cluster-level parallelism,

3

provide logical abstractions that hide

details of operating system synchronization and communications primitives. However,

even with these extensions, developers are still burdened to keep track of how resources

are made available to workers. Additionally, these frameworks are mostly designed to

tackle processor-intensive problems and have only rudimentary support for dealing with

very large amounts of input data. When using existing parallel computing approaches

for large-data computation, the programmer must devote a signiﬁcant amount of at-

tention to low-level system details, which detracts from higher-level problem solving.

One of the most signiﬁcant advantages of MapReduce is that it provides an ab-

straction that hides many system-level details from the programmer. Therefore, a devel-

oper can focus on what computations need to be performed, as opposed to how those

computations are actually carried out or how to get the data to the processes that

depend on them. Like OpenMP and MPI, MapReduce provides a means to distribute

computation without burdening the programmer with the details of distributed com-

puting (but at a diﬀerent level of granularity). However, organizing and coordinating

large amounts of computation is only part of the challenge. Large-data processing by

deﬁnition requires bringing data and code together for computation to occur—no small

feat for datasets that are terabytes and perhaps petabytes in size! MapReduce addresses

this challenge by providing a simple abstraction for the developer, transparently han-

dling most of the details behind the scenes in a scalable, robust, and eﬃcient manner.

As we mentioned in Chapter 1, instead of moving large amounts of data around, it is far

more eﬃcient, if possible, to move the code to the data. This is operationally realized

by spreading data across the local disks of nodes in a cluster and running processes

on nodes that hold the data. The complex task of managing storage in such a process-

ing environment is typically handled by a distributed ﬁle system that sits underneath

MapReduce.

This chapter introduces the MapReduce programming model and the underlying

distributed ﬁle system. We start in Section 2.1 with an overview of functional program-

ming, from which MapReduce draws its inspiration. Section 2.2 introduces the basic

programming model, focusing on mappers and reducers. Section 2.3 discusses the role

of the execution framework in actually running MapReduce programs (called jobs).

Section 2.4 ﬁlls in additional details by introducing partitioners and combiners, which

provide greater control over data ﬂow. MapReduce would not be practical without a

tightly-integrated distributed ﬁle system that manages the data being processed; Sec-

tion 2.5 covers this in detail. Tying everything together, a complete cluster architecture

is described in Section 2.6 before the chapter ends with a summary.

2

http://www.openmp.org/

3

http://www.mcs.anl.gov/mpi/

20 CHAPTER 2. MAPREDUCE BASICS

f f f f f

g g g g g

Figure 2.1: Illustration of map and fold, two higher-order functions commonly used together

in functional programming: map takes a function f and applies it to every element in a list,

while fold iteratively applies a function g to aggregate results.

2.1 FUNCTIONAL PROGRAMMING ROOTS

MapReduce has its roots in functional programming, which is exempliﬁed in languages

such as Lisp and ML.

4

A key feature of functional languages is the concept of higher-

order functions, or functions that can accept other functions as arguments. Two common

built-in higher order functions are map and fold, illustrated in Figure 2.1. Given a list,

map takes as an argument a function f (that takes a single argument) and applies it to

all elements in a list (the top part of the diagram). Given a list, fold takes as arguments

a function g (that takes two arguments) and an initial value: g is ﬁrst applied to the

initial value and the ﬁrst item in the list, the result of which is stored in an intermediate

variable. This intermediate variable and the next item in the list serve as the arguments

to a second application of g, the results of which are stored in the intermediate variable.

This process repeats until all items in the list have been consumed; fold then returns the

ﬁnal value of the intermediate variable. Typically, map and fold are used in combination.

For example, to compute the sum of squares of a list of integers, one could map a function

that squares its argument (i.e., λx.x

2

) over the input list, and then fold the resulting list

with the addition function (more precisely, λxλy.x + y) using an initial value of zero.

We can view map as a concise way to represent the transformation of a dataset

(as deﬁned by the function f). In the same vein, we can view fold as an aggregation

operation, as deﬁned by the function g. One immediate observation is that the appli-

cation of f to each item in a list (or more generally, to elements in a large dataset)

4

However, there are important characteristics of MapReduce that make it non-functional in nature—this will

become apparent later.

2.1. FUNCTIONAL PROGRAMMING ROOTS 21

can be parallelized in a straightforward manner, since each functional application hap-

pens in isolation. In a cluster, these operations can be distributed across many diﬀer-

ent machines. The fold operation, on the other hand, has more restrictions on data

locality—elements in the list must be “brought together” before the function g can be

applied. However, many real-world applications do not require g to be applied to all

elements of the list. To the extent that elements in the list can be divided into groups,

the fold aggregations can also proceed in parallel. Furthermore, for operations that are

commutative and associative, signiﬁcant eﬃciencies can be gained in the fold operation

through local aggregation and appropriate reordering.

In a nutshell, we have described MapReduce. The map phase in MapReduce

roughly corresponds to the map operation in functional programming, whereas the

reduce phase in MapReduce roughly corresponds to the fold operation in functional

programming. As we will discuss in detail shortly, the MapReduce execution framework

coordinates the map and reduce phases of processing over large amounts of data on

large clusters of commodity machines.

Viewed from a slightly diﬀerent angle, MapReduce codiﬁes a generic “recipe” for

processing large datasets that consists of two stages. In the ﬁrst stage, a user-speciﬁed

computation is applied over all input records in a dataset. These operations occur in

parallel and yield intermediate output that is then aggregated by another user-speciﬁed

computation. The programmer deﬁnes these two types of computations, and the exe-

cution framework coordinates the actual processing (very loosely, MapReduce provides

a functional abstraction). Although such a two-stage processing structure may appear

to be very restrictive, many interesting algorithms can be expressed quite concisely—

especially if one decomposes complex algorithms into a sequence of MapReduce jobs.

Subsequent chapters in this book focus on how a number of algorithms can be imple-

mented in MapReduce.

To be precise, MapReduce can refer to three distinct but related concepts. First,

MapReduce is a programming model, which is the sense discussed above. Second, Map-

Reduce can refer to the execution framework (i.e., the “runtime”) that coordinates the

execution of programs written in this particular style. Finally, MapReduce can refer to

the software implementation of the programming model and the execution framework:

for example, Google’s proprietary implementation vs. the open-source Hadoop imple-

mentation in Java. And in fact, there are many implementations of MapReduce, e.g.,

targeted speciﬁcally for multi-core processors [127], for GPGPUs [71], for the CELL ar-

chitecture [126], etc. There are some diﬀerences between the MapReduce programming

model implemented in Hadoop and Google’s proprietary implementation, which we will

explicitly discuss throughout the book. However, we take a rather Hadoop-centric view

of MapReduce, since Hadoop remains the most mature and accessible implementation

to date, and therefore the one most developers are likely to use.

22 CHAPTER 2. MAPREDUCE BASICS

2.2 MAPPERS AND REDUCERS

Key-value pairs form the basic data structure in MapReduce. Keys and values may be

primitives such as integers, ﬂoating point values, strings, and raw bytes, or they may

be arbitrarily complex structures (lists, tuples, associative arrays, etc.). Programmers

typically need to deﬁne their own custom data types, although a number of libraries

such as Protocol Buﬀers,

5

Thrift,

6

and Avro

7

simplify the task.

Part of the design of MapReduce algorithms involves imposing the key-value struc-

ture on arbitrary datasets. For a collection of web pages, keys may be URLs and values

may be the actual HTML content. For a graph, keys may represent node ids and values

may contain the adjacency lists of those nodes (see Chapter 5 for more details). In some

algorithms, input keys are not particularly meaningful and are simply ignored during

processing, while in other cases input keys are used to uniquely identify a datum (such

as a record id). In Chapter 3, we discuss the role of complex keys and values in the

design of various algorithms.

In MapReduce, the programmer deﬁnes a mapper and a reducer with the following

signatures:

map: (k

1

, v

1

) → [(k

2

, v

2

)]

reduce: (k

2

, [v

2

]) → [(k

3

, v

3

)]

The convention [. . .] is used throughout this book to denote a list. The input to a

MapReduce job starts as data stored on the underlying distributed ﬁle system (see Sec-

tion 2.5). The mapper is applied to every input key-value pair (split across an arbitrary

number of ﬁles) to generate an arbitrary number of intermediate key-value pairs. The

reducer is applied to all values associated with the same intermediate key to generate

output key-value pairs.

8

Implicit between the map and reduce phases is a distributed

“group by” operation on intermediate keys. Intermediate data arrive at each reducer

in order, sorted by the key. However, no ordering relationship is guaranteed for keys

across diﬀerent reducers. Output key-value pairs from each reducer are written persis-

tently back onto the distributed ﬁle system (whereas intermediate key-value pairs are

transient and not preserved). The output ends up in r ﬁles on the distributed ﬁle system,

where r is the number of reducers. For the most part, there is no need to consolidate

reducer output, since the r ﬁles often serve as input to yet another MapReduce job.

Figure 2.2 illustrates this two-stage processing structure.

A simple word count algorithm in MapReduce is shown in Figure 2.3. This algo-

rithm counts the number of occurrences of every word in a text collection, which may

be the ﬁrst step in, for example, building a unigram language model (i.e., probability

5

http://code.google.com/p/protobuf/

6

http://incubator.apache.org/thrift/

7

http://hadoop.apache.org/avro/

8

This characterization, while conceptually accurate, is a slight simpliﬁcation. See Section 2.6 for more details.

2.2. MAPPERS AND REDUCERS 23

A B C D E F α β γ δ ε ζ

b 1 2 3 6 5 2 b 7 8

mapper mapper mapper mapper

Shuffle and Sort: aggregate values by keys

b a 1 2 c c 3 6 a c 5 2 b c 7 8

a 1 5 b 2 7 c 2 9 8

reducer reducer reducer

X 5 Y 7 Z 9

Figure 2.2: Simpliﬁed view of MapReduce. Mappers are applied to all input key-value pairs,

which generate an arbitrary number of intermediate key-value pairs. Reducers are applied to

all values associated with the same key. Between the map and reduce phases lies a barrier that

involves a large distributed sort and group by.

1: class Mapper

2: method Map(docid a, doc d)

3: for all term t ∈ doc d do

4: Emit(term t, count 1)

1: class Reducer

2: method Reduce(term t, counts [c

1

, c

2

, . . .])

3: sum ← 0

4: for all count c ∈ counts [c

1

, c

2

, . . .] do

5: sum ← sum + c

6: Emit(term t, count sum)

Figure 2.3: Pseudo-code for the word count algorithm in MapReduce. The mapper emits an

intermediate key-value pair for each word in a document. The reducer sums up all counts for

each word.

24 CHAPTER 2. MAPREDUCE BASICS

distribution over words in a collection). Input key-values pairs take the form of (docid,

doc) pairs stored on the distributed ﬁle system, where the former is a unique identiﬁer

for the document, and the latter is the text of the document itself. The mapper takes

an input key-value pair, tokenizes the document, and emits an intermediate key-value

pair for every word: the word itself serves as the key, and the integer one serves as the

value (denoting that we’ve seen the word once). The MapReduce execution framework

guarantees that all values associated with the same key are brought together in the

reducer. Therefore, in our word count algorithm, we simply need to sum up all counts

(ones) associated with each word. The reducer does exactly this, and emits ﬁnal key-

value pairs with the word as the key, and the count as the value. Final output is written

to the distributed ﬁle system, one ﬁle per reducer. Words within each ﬁle will be sorted

by alphabetical order, and each ﬁle will contain roughly the same number of words. The

partitioner, which we discuss later in Section 2.4, controls the assignment of words to

reducers. The output can be examined by the programmer or used as input to another

MapReduce program.

There are some diﬀerences between the Hadoop implementation of MapReduce

and Google’s implementation.

9

In Hadoop, the reducer is presented with a key and an

iterator over all values associated with the particular key. The values are arbitrarily

ordered. Google’s implementation allows the programmer to specify a secondary sort

key for ordering the values (if desired)—in which case values associated with each key

would be presented to the developer’s reduce code in sorted order. Later in Section 3.4

we discuss how to overcome this limitation in Hadoop to perform secondary sorting.

Another diﬀerence: in Google’s implementation the programmer is not allowed to change

the key in the reducer. That is, the reducer output key must be exactly the same as the

reducer input key. In Hadoop, there is no such restriction, and the reducer can emit an

arbitrary number of output key-value pairs (with diﬀerent keys).

To provide a bit more implementation detail: pseudo-code provided in this book

roughly mirrors how MapReduce programs are written in Hadoop. Mappers and reduc-

ers are objects that implement the Map and Reduce methods, respectively. In Hadoop,

a mapper object is initialized for each map task (associated with a particular sequence

of key-value pairs called an input split) and the Map method is called on each key-value

pair by the execution framework. In conﬁguring a MapReduce job, the programmer pro-

vides a hint on the number of map tasks to run, but the execution framework (see next

section) makes the ﬁnal determination based on the physical layout of the data (more

details in Section 2.5 and Section 2.6). The situation is similar for the reduce phase:

a reducer object is initialized for each reduce task, and the Reduce method is called

once per intermediate key. In contrast with the number of map tasks, the programmer

can precisely specify the number of reduce tasks. We will return to discuss the details

9

Personal communication, Jeﬀ Dean.

2.2. MAPPERS AND REDUCERS 25

of Hadoop job execution in Section 2.6, which is dependent on an understanding of

the distributed ﬁle system (covered in Section 2.5). To reiterate: although the presen-

tation of algorithms in this book closely mirrors the way they would be implemented

in Hadoop, our focus is on algorithm design and conceptual understanding—not actual

Hadoop programming. For that, we would recommend Tom White’s book [154].

What are the restrictions on mappers and reducers? Mappers and reducers can

express arbitrary computations over their inputs. However, one must generally be careful

about use of external resources since multiple mappers or reducers may be contending

for those resources. For example, it may be unwise for a mapper to query an external

SQL database, since that would introduce a scalability bottleneck on the number of map

tasks that could be run in parallel (since they might all be simultaneously querying the

database).

10

In general, mappers can emit an arbitrary number of intermediate key-value

pairs, and they need not be of the same type as the input key-value pairs. Similarly,

reducers can emit an arbitrary number of ﬁnal key-value pairs, and they can diﬀer

in type from the intermediate key-value pairs. Although not permitted in functional

programming, mappers and reducers can have side eﬀects. This is a powerful and useful

feature: for example, preserving state across multiple inputs is central to the design of

many MapReduce algorithms (see Chapter 3). Such algorithms can be understood as

having side eﬀects that only change state that is internal to the mapper or reducer.

While the correctness of such algorithms may be more diﬃcult to guarantee (since the

function’s behavior depends not only on the current input but on previous inputs),

most potential synchronization problems are avoided since internal state is private only

to individual mappers and reducers. In other cases (see Section 4.4 and Section 6.5), it

may be useful for mappers or reducers to have external side eﬀects, such as writing ﬁles

to the distributed ﬁle system. Since many mappers and reducers are run in parallel, and

the distributed ﬁle system is a shared global resource, special care must be taken to

ensure that such operations avoid synchronization conﬂicts. One strategy is to write a

temporary ﬁle that is renamed upon successful completion of the mapper or reducer [45].

In addition to the “canonical” MapReduce processing ﬂow, other variations are

also possible. MapReduce programs can contain no reducers, in which case mapper

output is directly written to disk (one ﬁle per mapper). For embarrassingly parallel

problems, e.g., parse a large text collection or independently analyze a large number of

images, this would be a common pattern. The converse—a MapReduce program with

no mappers—is not possible, although in some cases it is useful for the mapper to imple-

ment the identity function and simply pass input key-value pairs to the reducers. This

has the eﬀect of sorting and regrouping the input for reduce-side processing. Similarly,

in some cases it is useful for the reducer to implement the identity function, in which

case the program simply sorts and groups mapper output. Finally, running identity

10

Unless, of course, the database itself is highly scalable.

26 CHAPTER 2. MAPREDUCE BASICS

mappers and reducers has the eﬀect of regrouping and resorting the input data (which

is sometimes useful).

Although in the most common case, input to a MapReduce job comes from data

stored on the distributed ﬁle system and output is written back to the distributed ﬁle

system, any other system that satisﬁes the proper abstractions can serve as a data source

or sink. With Google’s MapReduce implementation, BigTable [34], a sparse, distributed,

persistent multidimensional sorted map, is frequently used as a source of input and as

a store of MapReduce output. HBase is an open-source BigTable clone and has similar

capabilities. Also, Hadoop has been integrated with existing MPP (massively parallel

processing) relational databases, which allows a programmer to write MapReduce jobs

over database rows and dump output into a new database table. Finally, in some cases

MapReduce jobs may not consume any input at all (e.g., computing π) or may only

consume a small amount of data (e.g., input parameters to many instances of processor-

intensive simulations running in parallel).

2.3 THE EXECUTION FRAMEWORK

One of the most important idea behind MapReduce is separating the what of distributed

processing from the how. A MapReduce program, referred to as a job, consists of code

for mappers and reducers (as well as combiners and partitioners to be discussed in the

next section) packaged together with conﬁguration parameters (such as where the in-

put lies and where the output should be stored). The developer submits the job to the

submission node of a cluster (in Hadoop, this is called the jobtracker) and execution

framework (sometimes called the “runtime”) takes care of everything else: it transpar-

ently handles all other aspects of distributed code execution, on clusters ranging from

a single node to a few thousand nodes. Speciﬁc responsibilities include:

Scheduling. Each MapReduce job is divided into smaller units called tasks (see Sec-

tion 2.6 for more details). For example, a map task may be responsible for processing

a certain block of input key-value pairs (called an input split in Hadoop); similarly, a

reduce task may handle a portion of the intermediate key space. It is not uncommon

for MapReduce jobs to have thousands of individual tasks that need to be assigned to

nodes in the cluster. In large jobs, the total number of tasks may exceed the number of

tasks that can be run on the cluster concurrently, making it necessary for the scheduler

to maintain some sort of a task queue and to track the progress of running tasks so

that waiting tasks can be assigned to nodes as they become available. Another aspect

of scheduling involves coordination among tasks belonging to diﬀerent jobs (e.g., from

diﬀerent users). How can a large, shared resource support several users simultaneously

in a predictable, transparent, policy-driven fashion? There has been some recent work

along these lines in the context of Hadoop [131, 160].

2.3. THE EXECUTION FRAMEWORK 27

Speculative execution is an optimization that is implemented by both Hadoop and

Google’s MapReduce implementation (called “backup tasks” [45]). Due to the barrier

between the map and reduce tasks, the map phase of a job is only as fast as the slowest

map task. Similarly, the completion time of a job is bounded by the running time of the

slowest reduce task. As a result, the speed of a MapReduce job is sensitive to what are

known as stragglers, or tasks that take an usually long time to complete. One cause of

stragglers is ﬂaky hardware: for example, a machine that is suﬀering from recoverable

errors may become signiﬁcantly slower. With speculative execution, an identical copy

of the same task is executed on a diﬀerent machine, and the framework simply uses the

result of the ﬁrst task attempt to ﬁnish. Zaharia et al. [161] presented diﬀerent execution

strategies in a recent paper, and Google has reported that speculative execution can

improve job running times by 44% [45]. Although in Hadoop both map and reduce tasks

can be speculatively executed, the common wisdom is that the technique is more helpful

for map tasks than reduce tasks, since each copy of the reduce task needs to pull data

over the network. Note, however, that speculative execution cannot adequately address

another common cause of stragglers: skew in the distribution of values associated with

intermediate keys (leading to reduce stragglers). In text processing we often observe

Zipﬁan distributions, which means that the task or tasks responsible for processing the

most frequent few elements will run much longer than the typical task. Better local

aggregation, discussed in the next chapter, is one possible solution to this problem.

Data/code co-location. The phrase data distribution is misleading, since one of the

key ideas behind MapReduce is to move the code, not the data. However, the more

general point remains—in order for computation to occur, we need to somehow feed

data to the code. In MapReduce, this issue is inexplicably intertwined with scheduling

and relies heavily on the design of the underlying distributed ﬁle system.

11

To achieve

data locality, the scheduler starts tasks on the node that holds a particular block of data

(i.e., on its local drive) needed by the task. This has the eﬀect of moving code to the

data. If this is not possible (e.g., a node is already running too many tasks), new tasks

will be started elsewhere, and the necessary data will be streamed over the network.

An important optimization here is to prefer nodes that are on the same rack in the

datacenter as the node holding the relevant data block, since inter-rack bandwidth is

signiﬁcantly less than intra-rack bandwidth.

Synchronization. In general, synchronization refers to the mechanisms by which

multiple concurrently running processes “join up”, for example, to share intermediate

results or otherwise exchange state information. In MapReduce, synchronization is ac-

complished by a barrier between the map and reduce phases of processing. Intermediate

key-value pairs must be grouped by key, which is accomplished by a large distributed

11

In the canonical case, that is. Recall that MapReduce may receive its input from other sources.

28 CHAPTER 2. MAPREDUCE BASICS

sort involving all the nodes that executed map tasks and all the nodes that will execute

reduce tasks. This necessarily involves copying intermediate data over the network, and

therefore the process is commonly known as “shuﬄe and sort”. A MapReduce job with

m mappers and r reducers involves up to mr distinct copy operations, since each

mapper may have intermediate output going to every reducer.

Note that the reduce computation cannot start until all the mappers have ﬁn-

ished emitting key-value pairs and all intermediate key-value pairs have been shuﬄed

and sorted, since the execution framework cannot otherwise guarantee that all values

associated with the same key have been gathered. This is an important departure from

functional programming: in a fold operation, the aggregation function g is a function of

the intermediate value and the next item in the list—which means that values can be

lazily generated and aggregation can begin as soon as values are available. In contrast,

the reducer in MapReduce receives all values associated with the same key at once.

However, it is possible to start copying intermediate key-value pairs over the network

to the nodes running the reducers as soon as each mapper ﬁnishes—this is a common

optimization and implemented in Hadoop.

Error and fault handling. The MapReduce execution framework must accomplish

all the tasks above in an environment where errors and faults are the norm, not the

exception. Since MapReduce was explicitly designed around low-end commodity servers,

the runtime must be especially resilient. In large clusters, disk failures are common [123]

and RAM experiences more errors than one might expect [135]. Datacenters suﬀer

from both planned outages (e.g., system maintenance and hardware upgrades) and

unexpected outages (e.g., power failure, connectivity loss, etc.).

And that’s just hardware. No software is bug free—exceptions must be appropri-

ately trapped, logged, and recovered from. Large-data problems have a penchant for

uncovering obscure corner cases in code that is otherwise thought to be bug-free. Fur-

thermore, any suﬃciently large dataset will contain corrupted data or records that are

mangled beyond a programmer’s imagination—resulting in errors that one would never

think to check for or trap. The MapReduce execution framework must thrive in this

hostile environment.

2.4 PARTITIONERS AND COMBINERS

We have thus far presented a simpliﬁed view of MapReduce. There are two additional

elements that complete the programming model: partitioners and combiners.

Partitioners are responsible for dividing up the intermediate key space and assign-

ing intermediate key-value pairs to reducers. In other words, the partitioner speciﬁes

the task to which an intermediate key-value pair must be copied. Within each reducer,

keys are processed in sorted order (which is how the “group by” is implemented). The

2.4. PARTITIONERS AND COMBINERS 29

simplest partitioner involves computing the hash value of the key and then taking the

mod of that value with the number of reducers. This assigns approximately the same

number of keys to each reducer (dependent on the quality of the hash function). Note,

however, that the partitioner only considers the key and ignores the value—therefore, a

roughly-even partitioning of the key space may nevertheless yield large diﬀerences in the

number of key-values pairs sent to each reducer (since diﬀerent keys may have diﬀerent

numbers of associated values). This imbalance in the amount of data associated with

each key is relatively common in many text processing applications due to the Zipﬁan

distribution of word occurrences.

Combiners are an optimization in MapReduce that allow for local aggregation

before the shuﬄe and sort phase. We can motivate the need for combiners by considering

the word count algorithm in Figure 2.3, which emits a key-value pair for each word

in the collection. Furthermore, all these key-value pairs need to be copied across the

network, and so the amount of intermediate data will be larger than the input collection

itself. This is clearly ineﬃcient. One solution is to perform local aggregation on the

output of each mapper, i.e., to compute a local count for a word over all the documents

processed by the mapper. With this modiﬁcation (assuming the maximum amount of

local aggregation possible), the number of intermediate key-value pairs will be at most

the number of unique words in the collection times the number of mappers (and typically

far smaller because each mapper may not encounter every word).

The combiner in MapReduce supports such an optimization. One can think of

combiners as “mini-reducers” that take place on the output of the mappers, prior to the

shuﬄe and sort phase. Each combiner operates in isolation and therefore does not have

access to intermediate output from other mappers. The combiner is provided keys and

values associated with each key (the same types as the mapper output keys and values).

Critically, one cannot assume that a combiner will have the opportunity to process all

values associated with the same key. The combiner can emit any number of key-value

pairs, but the keys and values must be of the same type as the mapper output (same as

the reducer input).

12

In cases where an operation is both associative and commutative

(e.g., addition or multiplication), reducers can directly serve as combiners. In general,

however, reducers and combiners are not interchangeable.

In many cases, proper use of combiners can spell the diﬀerence between an imprac-

tical algorithm and an eﬃcient algorithm. This topic will be discussed in Section 3.1,

which focuses on various techniques for local aggregation. It suﬃces to say for now that

12

A note on the implementation of combiners in Hadoop: by default, the execution framework reserves the right

to use combiners at its discretion. In reality, this means that a combiner may be invoked zero, one, or multiple

times. In addition, combiners in Hadoop may actually be invoked in the reduce phase, i.e., after key-value pairs

have been copied over to the reducer, but before the user reducer code runs. As a result, combiners must be

carefully written so that they can be executed in these diﬀerent environments. Section 3.1.2 discusses this in

more detail.

30 CHAPTER 2. MAPREDUCE BASICS

A B C D E F α β γ δ ε ζ

mapper mapper mapper mapper

b a 1 2 c c 3 6 a c 5 2 b c 7 8

combiner combiner combiner combiner

pp pp pp pp

b a 1 2 c 9 a c 5 2 b c 7 8

partitioner partitioner partitioner partitioner

Shuffle and Sort: aggregate values by keys

a 1 5 b 2 7 c 2 9 8

p p p p

reducer reducer reducer

X 5 Y 7 Z 9

Figure 2.4: Complete view of MapReduce, illustrating combiners and partitioners in addi-

tion to mappers and reducers. Combiners can be viewed as “mini-reducers” in the map phase.

Partitioners determine which reducer is responsible for a particular key.

a combiner can signiﬁcantly reduce the amount of data that needs to be copied over

the network, resulting in much faster algorithms.

The complete MapReduce model is shown in Figure 2.4. Output of the mappers

are processed by the combiners, which perform local aggregation to cut down on the

number of intermediate key-value pairs. The partitioner determines which reducer will

be responsible for processing a particular key, and the execution framework uses this

information to copy the data to the right location during the shuﬄe and sort phase.

13

Therefore, a complete MapReduce job consists of code for the mapper, reducer, com-

biner, and partitioner, along with job conﬁguration parameters. The execution frame-

work handles everything else.

13

In Hadoop, partitioners are actually executed before combiners, so while Figure 2.4 is conceptually accurate,

it doesn’t precisely describe the Hadoop implementation.

2.5. THE DISTRIBUTED FILE SYSTEM 31

2.5 THE DISTRIBUTED FILE SYSTEM

So far, we have mostly focused on the processing aspect of data-intensive processing,

but it is important to recognize that without data, there is nothing to compute on. In

high-performance computing (HPC) and many traditional cluster architectures, stor-

age is viewed as a distinct and separate component from computation. Implementations

vary widely, but network-attached storage (NAS) and storage area networks (SAN) are

common; supercomputers often have dedicated subsystems for handling storage (sepa-

rate nodes, and often even separate networks). Regardless of the details, the processing

cycle remains the same at a high level: the compute nodes fetch input from storage, load

the data into memory, process the data, and then write back the results (with perhaps

intermediate checkpointing for long-running processes).

As dataset sizes increase, more compute capacity is required for processing. But as

compute capacity grows, the link between the compute nodes and the storage becomes

a bottleneck. At that point, one could invest in higher performance but more expensive

networks (e.g., 10 gigabit Ethernet) or special-purpose interconnects such as InﬁniBand

(even more expensive). In most cases, this is not a cost-eﬀective solution, as the price

of networking equipment increases non-linearly with performance (e.g., a switch with

ten times the capacity is usually more than ten times more expensive). Alternatively,

one could abandon the separation of computation and storage as distinct components

in a cluster. The distributed ﬁle system (DFS) that underlies MapReduce adopts ex-

actly this approach. The Google File System (GFS) [57] supports Google’s proprietary

implementation of MapReduce; in the open-source world, HDFS (Hadoop Distributed

File System) is an open-source implementation of GFS that supports Hadoop. Although

MapReduce doesn’t necessarily require the distributed ﬁle system, it is diﬃcult to re-

alize many of the advantages of the programming model without a storage substrate

that behaves much like the DFS.

14

Of course, distributed ﬁle systems are not new [74, 32, 7, 147, 133]. The Map-

Reduce distributed ﬁle system builds on previous work but is speciﬁcally adapted to

large-data processing workloads, and therefore departs from previous architectures in

certain respects (see discussion by Ghemawat et al. [57] in the original GFS paper.).

The main idea is to divide user data into blocks and replicate those blocks across the

local disks of nodes in the cluster. Blocking data, of course, is not a new idea, but DFS

blocks are signiﬁcantly larger than block sizes in typical single-machine ﬁle systems (64

MB by default). The distributed ﬁle system adopts a master–slave architecture in which

the master maintains the ﬁle namespace (metadata, directory structure, ﬁle to block

mapping, location of blocks, and access permissions) and the slaves manage the actual

14

However, there is evidence that existing POSIX-based distributed cluster ﬁle systems (e.g., GPFS or PVFS)

can serve as a replacement for HDFS, when properly tuned or modiﬁed for MapReduce workloads [146, 6].

This, however, remains an experimental use case.

32 CHAPTER 2. MAPREDUCE BASICS

HDFS namenode

(file name, block id)

(block id, block location)

HDFS namenode

File namespace

/foo/bar

block 3df2

Application

HDFS Client

instructions to datanode

datanode state

(block id, byte range)

block data

HDFS datanode

Linux file system

HDFS datanode

Linux file system

… …

Figure 2.5: The architecture of HDFS. The namenode (master) is responsible for maintaining

the ﬁle namespace and directing clients to datanodes (slaves) that actually hold data blocks

containing user data.

data blocks. In GFS, the master is called the GFS master, and the slaves are called

GFS chunkservers. In Hadoop, the same roles are ﬁlled by the namenode and datan-

odes, respectively.

15

This book adopts the Hadoop terminology, although for most basic

ﬁle operations GFS and HDFS work much the same way. The architecture of HDFS is

shown in Figure 2.5, redrawn from a similar diagram describing GFS [57].

In HDFS, an application client wishing to read a ﬁle (or a portion thereof) must

ﬁrst contact the namenode to determine where the actual data is stored. In response

to the client request, the namenode returns the relevant block id and the location

where the block is held (i.e., which datanode). The client then contacts the datanode to

retrieve the data. Blocks are themselves stored on standard single-machine ﬁle systems,

so HDFS lies on top of the standard OS stack (e.g., Linux). An important feature of

the design is that data is never moved through the namenode. Instead, all data transfer

occurs directly between clients and datanodes; communications with the namenode only

involves transfer of metadata.

By default, HDFS stores three separate copies of each data block to ensure both

reliability, availability, and performance. In large clusters, the three replicas are spread

across diﬀerent physical racks, so HDFS is resilient towards two common failure sce-

narios: individual datanode crashes and failures in networking equipment that bring

an entire rack oﬄine. Replicating blocks across physical machines also increases oppor-

15

To be precise, namenode and datanode may refer to physical machines in a cluster, or they may refer to daemons

running on those machines providing the relevant services.

2.5. THE DISTRIBUTED FILE SYSTEM 33

tunities to co-locate data and processing in the scheduling of MapReduce jobs, since

multiple copies yield more opportunities to exploit locality. The namenode is in periodic

communication with the datanodes to ensure proper replication of all the blocks: if there

aren’t enough replicas (e.g., due to disk or machine failures or to connectivity losses

due to networking equipment failures), the namenode directs the creation of additional

copies;

16

if there are too many replicas (e.g., a repaired node rejoins the cluster), extra

copies are discarded.

To create a new ﬁle and write data to HDFS, the application client ﬁrst contacts

the namenode, which updates the ﬁle namespace after checking permissions and making

sure the ﬁle doesn’t already exist. The namenode allocates a new block on a suitable

datanode, and the application is directed to stream data directly to it. From the initial

datanode, data is further propagated to additional replicas. In the most recent release of

Hadoop as of this writing (release 0.20.2), ﬁles are immutable—they cannot be modiﬁed

after creation. There are current plans to oﬃcially support ﬁle appends in the near

future, which is a feature already present in GFS.

In summary, the HDFS namenode has the following responsibilities:

• Namespace management. The namenode is responsible for maintaining the ﬁle

namespace, which includes metadata, directory structure, ﬁle to block mapping,

location of blocks, and access permissions. These data are held in memory for fast

access and all mutations are persistently logged.

• Coordinating ﬁle operations. The namenode directs application clients to datan-

odes for read operations, and allocates blocks on suitable datanodes for write

operations. All data transfers occur directly between clients and datanodes. When

a ﬁle is deleted, HDFS does not immediately reclaim the available physical storage;

rather, blocks are lazily garbage collected.

• Maintaining overall health of the ﬁle system. The namenode is in periodic contact

with the datanodes via heartbeat messages to ensure the integrity of the system.

If the namenode observes that a data block is under-replicated (fewer copies are

stored on datanodes than the desired replication factor), it will direct the creation

of new replicas. Finally, the namenode is also responsible for rebalancing the ﬁle

system.

17

During the course of normal operations, certain datanodes may end up

holding more blocks than others; rebalancing involves moving blocks from datan-

odes with more blocks to datanodes with fewer blocks. This leads to better load

balancing and more even disk utilization.

16

Note that the namenode coordinates the replication process, but data transfer occurs directly from datanode

to datanode.

17

In Hadoop, this is a manually-invoked process.

34 CHAPTER 2. MAPREDUCE BASICS

Since GFS and HDFS were speciﬁcally designed to support Google’s proprietary and

the open-source implementation of MapReduce, respectively, they were designed with

a number of assumptions about the operational environment, which in turn inﬂuenced

the design of the systems. Understanding these choices is critical to designing eﬀective

MapReduce algorithms:

• The ﬁle system stores a relatively modest number of large ﬁles. The deﬁnition of

“modest” varies by the size of the deployment, but in HDFS multi-gigabyte ﬁles

are common (and even encouraged). There are several reasons why lots of small

ﬁles are to be avoided. Since the namenode must hold all ﬁle metadata in memory,

this presents an upper bound on both the number of ﬁles and blocks that can

be supported.

18

Large multi-block ﬁles represent a more eﬃcient use of namenode

memory than many single-block ﬁles (each of which consumes less space than a

single block size). In addition, mappers in a MapReduce job use individual ﬁles as

a basic unit for splitting input data. At present, there is no default mechanism in

Hadoop that allows a mapper to process multiple ﬁles. As a result, mapping over

many small ﬁles will yield as many map tasks as there are ﬁles. This results in

two potential problems: ﬁrst, the startup costs of mappers may become signiﬁcant

compared to the time spent actually processing input key-value pairs; second, this

may result in an excessive amount of across-the-network copy operations during

the “shuﬄe and sort” phase (recall that a MapReduce job with m mappers and r

reducers involves up to mr distinct copy operations).

• Workloads are batch oriented, dominated by long streaming reads and large se-

quential writes. As a result, high sustained bandwidth is more important than low

latency. This exactly describes the nature of MapReduce jobs, which are batch

operations on large amounts of data. Due to the common-case workload, both

HDFS and GFS do not implement any form of data caching.

19

• Applications are aware of the characteristics of the distributed ﬁle system. Neither

HDFS nor GFS present a general POSIX-compliant API, but rather support only

a subset of possible ﬁle operations. This simpliﬁes the design of the distributed

ﬁle system, and in essence pushes part of the data management onto the end

application. One rationale for this decision is that each application knows best

how to handle data speciﬁc to that application, for example, in terms of resolving

inconsistent states and optimizing the layout of data structures.

18

According to Dhruba Borthakur in a post to the Hadoop mailing list on 6/8/2008, each block in HDFS occupies

about 150 bytes of memory on the namenode.

19

However, since the distributed ﬁle system is built on top of a standard operating system such as Linux, there

is still OS-level caching.

2.5. THE DISTRIBUTED FILE SYSTEM 35

• The ﬁle system is deployed in an environment of cooperative users. There is no

discussion of security in the original GFS paper, but HDFS explicitly assumes a

datacenter environment where only authorized users have access. File permissions

in HDFS are only meant to prevent unintended operations and can be easily

circumvented.

20

• The system is built from unreliable but inexpensive commodity components. As a

result, failures are the norm rather than the exception. HDFS is designed around

a number of self-monitoring and self-healing mechanisms to robustly cope with

common failure modes.

Finally, some discussion is necessary to understand the single-master design of HDFS

and GFS. It has been demonstrated that in large-scale distributed systems, simultane-

ously providing consistency, availability, and partition tolerance is impossible—this is

Brewer’s so-called CAP Theorem [58]. Since partitioning is unavoidable in large-data

systems, the real tradeoﬀ is between consistency and availability. A single-master de-

sign trades availability for consistency and signiﬁcantly simpliﬁes implementation. If the

master (HDFS namenode or GFS master) goes down, the entire ﬁle system becomes

unavailable, which trivially guarantees that the ﬁle system will never be in an incon-

sistent state. An alternative design might involve multiple masters that jointly manage

the ﬁle namespace—such an architecture would increase availability (if one goes down,

another can step in) at the cost of consistency, not to mention requiring a more complex

implementation (cf. [4, 105]).

The single-master design of GFS and HDFS is a well-known weakness, since if

the master goes oﬄine, the entire ﬁle system and all MapReduce jobs running on top

of it will grind to a halt. This weakness is mitigated in part by the lightweight nature

of ﬁle system operations. Recall that no data is ever moved through the namenode and

that all communication between clients and datanodes involve only metadata. Because

of this, the namenode rarely is the bottleneck, and for the most part avoids load-

induced crashes. In practice, this single point of failure is not as severe a limitation as

it may appear—with diligent monitoring of the namenode, mean time between failure

measured in months are not uncommon for production deployments. Furthermore, the

Hadoop community is well-aware of this problem and has developed several reasonable

workarounds—for example, a warm standby namenode that can be quickly switched

over when the primary namenode fails. The open source environment and the fact

that many organizations already depend on Hadoop for production systems virtually

guarantees that more eﬀective solutions will be developed over time.

20

However, there are existing plans to integrate Kerberos into Hadoop/HDFS.

36 CHAPTER 2. MAPREDUCE BASICS

namenode

namenode daemon

job submission node

jobtracker

t kt k t kt k t kt k

namenode daemon jobtracker

datanode daemon

Linux file system

tasktracker

datanode daemon

Linux file system

tasktracker

datanode daemon

Linux file system

tasktracker

…

slave node

…

slave node

…

slave node

Figure 2.6: Architecture of a complete Hadoop cluster, which consists of three separate compo-

nents: the HDFS master (called the namenode), the job submission node (called the jobtracker),

and many slave nodes (three shown here). Each of the slave nodes runs a tasktracker for exe-

cuting map and reduce tasks and a datanode daemon for serving HDFS data.

2.6 HADOOP CLUSTER ARCHITECTURE

Putting everything together, the architecture of a complete Hadoop cluster is shown in

Figure 2.6. The HDFS namenode runs the namenode daemon. The job submission node

runs the jobtracker, which is the single point of contact for a client wishing to execute a

MapReduce job. The jobtracker monitors the progress of running MapReduce jobs and

is responsible for coordinating the execution of the mappers and reducers. Typically,

these services run on two separate machines, although in smaller clusters they are often

co-located. The bulk of a Hadoop cluster consists of slave nodes (only three of which

are shown in the ﬁgure) that run both a tasktracker, which is responsible for actually

running user code, and a datanode daemon, for serving HDFS data.

A Hadoop MapReduce job is divided up into a number of map tasks and reduce

tasks. Tasktrackers periodically send heartbeat messages to the jobtracker that also

doubles as a vehicle for task allocation. If a tasktracker is available to run tasks (in

Hadoop parlance, has empty task slots), the return acknowledgment of the tasktracker

heartbeat contains task allocation information. The number of reduce tasks is equal

to the number of reducers speciﬁed by the programmer. The number of map tasks,

on the other hand, depends on many factors: the number of mappers speciﬁed by

the programmer serves as a hint to the execution framework, but the actual number

of tasks depends on both the number of input ﬁles and the number of HDFS data

blocks occupied by those ﬁles. Each map task is assigned a sequence of input key-value

2.6. HADOOP CLUSTER ARCHITECTURE 37

pairs, called an input split in Hadoop. Input splits are computed automatically and the

execution framework strives to align them to HDFS block boundaries so that each map

task is associated with a single data block. In scheduling map tasks, the jobtracker tries

to take advantage of data locality—if possible, map tasks are scheduled on the slave

node that holds the input split, so that the mapper will be processing local data. The

alignment of input splits with HDFS block boundaries simpliﬁes task scheduling. If it

is not possible to run a map task on local data, it becomes necessary to stream input

key-value pairs across the network. Since large clusters are organized into racks, with

far greater intra-rack bandwidth than inter-rack bandwidth, the execution framework

strives to at least place map tasks on a rack which has a copy of the data block.

Although conceptually in MapReduce one can think of the mapper being applied

to all input key-value pairs and the reducer being applied to all values associated with

the same key, actual job execution is a bit more complex. In Hadoop, mappers are Java

objects with a Map method (among others). A mapper object is instantiated for every

map task by the tasktracker. The life-cycle of this object begins with instantiation,

where a hook is provided in the API to run programmer-speciﬁed code. This means

that mappers can read in “side data”, providing an opportunity to load state, static

data sources, dictionaries, etc. After initialization, the Map method is called (by the

execution framework) on all key-value pairs in the input split. Since these method

calls occur in the context of the same Java object, it is possible to preserve state across

multiple input key-value pairs within the same map task—this is an important property

to exploit in the design of MapReduce algorithms, as we will see in the next chapter.

After all key-value pairs in the input split have been processed, the mapper object

provides an opportunity to run programmer-speciﬁed termination code. This, too, will

be important in the design of MapReduce algorithms.

The actual execution of reducers is similar to that of the mappers. Each re-

ducer object is instantiated for every reduce task. The Hadoop API provides hooks for

programmer-speciﬁed initialization and termination code. After initialization, for each

intermediate key in the partition (deﬁned by the partitioner), the execution framework

repeatedly calls the Reduce method with an intermediate key and an iterator over

all values associated with that key. The programming model also guarantees that in-

termediate keys will be presented to the Reduce method in sorted order. Since this

occurs in the context of a single object, it is possible to preserve state across multiple

intermediate keys (and associated values) within a single reduce task. Once again, this

property is critical in the design of MapReduce algorithms and will be discussed in the

next chapter.

38 CHAPTER 2. MAPREDUCE BASICS

2.7 SUMMARY

This chapter provides a basic overview of the MapReduce programming model, starting

with its roots in functional programming and continuing with a description of mappers,

reducers, partitioners, and combiners. Signiﬁcant attention is also given to the underly-

ing distributed ﬁle system, which is a tightly-integrated component of the MapReduce

environment. Given this basic understanding, we now turn our attention to the design

of MapReduce algorithms.

39

C H A P T E R 3

MapReduce Algorithm Design

A large part of the power of MapReduce comes from its simplicity: in addition to

preparing the input data, the programmer needs only to implement the mapper, the

reducer, and optionally, the combiner and the partitioner. All other aspects of execution

are handled transparently by the execution framework—on clusters ranging from a

single node to a few thousand nodes, over datasets ranging from gigabytes to petabytes.

However, this also means that any conceivable algorithm that a programmer wishes to

develop must be expressed in terms of a small number of rigidly-deﬁned components

that must ﬁt together in very speciﬁc ways. It may not appear obvious how a multitude

of algorithms can be recast into this programming model. The purpose of this chapter is

to provide, primarily through examples, a guide to MapReduce algorithm design. These

examples illustrate what can be thought of as “design patterns” for MapReduce, which

instantiate arrangements of components and speciﬁc techniques designed to handle

frequently-encountered situations across a variety of problem domains. Two of these

design patterns are used in the scalable inverted indexing algorithm we’ll present later

in Chapter 4; concepts presented here will show up again in Chapter 5 (graph processing)

and Chapter 6 (expectation-maximization algorithms).

Synchronization is perhaps the most tricky aspect of designing MapReduce algo-

rithms (or for that matter, parallel and distributed algorithms in general). Other than

embarrassingly-parallel problems, processes running on separate nodes in a cluster must,

at some point in time, come together—for example, to distribute partial results from

nodes that produced them to the nodes that will consume them. Within a single Map-

Reduce job, there is only one opportunity for cluster-wide synchronization—during the

shuﬄe and sort stage where intermediate key-value pairs are copied from the mappers

to the reducers and grouped by key. Beyond that, mappers and reducers run in isolation

without any mechanisms for direct communication. Furthermore, the programmer has

little control over many aspects of execution, for example:

• Where a mapper or reducer runs (i.e., on which node in the cluster).

• When a mapper or reducer begins or ﬁnishes.

• Which input key-value pairs are processed by a speciﬁc mapper.

• Which intermediate key-value pairs are processed by a speciﬁc reducer.

Nevertheless, the programmer does have a number of techniques for controlling execu-

tion and managing the ﬂow of data in MapReduce. In summary, they are:

40 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

1. The ability to construct complex data structures as keys and values to store and

communicate partial results.

2. The ability to execute user-speciﬁed initialization code at the beginning of a map

or reduce task, and the ability to execute user-speciﬁed termination code at the

end of a map or reduce task.

3. The ability to preserve state in both mappers and reducers across multiple input

or intermediate keys.

4. The ability to control the sort order of intermediate keys, and therefore the order

in which a reducer will encounter particular keys.

5. The ability to control the partitioning of the key space, and therefore the set of

keys that will be encountered by a particular reducer.

It is important to realize that many algorithms cannot be easily expressed as a single

MapReduce job. One must often decompose complex algorithms into a sequence of jobs,

which requires orchestrating data so that the output of one job becomes the input to the

next. Many algorithms are iterative in nature, requiring repeated execution until some

convergence criteria—graph algorithms in Chapter 5 and expectation-maximization al-

gorithms in Chapter 6 behave in exactly this way. Often, the convergence check itself

cannot be easily expressed in MapReduce. The standard solution is an external (non-

MapReduce) program that serves as a “driver” to coordinate MapReduce iterations.

This chapter explains how various techniques to control code execution and

data ﬂow can be applied to design algorithms in MapReduce. The focus is both on

scalability—ensuring that there are no inherent bottlenecks as algorithms are applied

to increasingly larger datasets—and eﬃciency—ensuring that algorithms do not need-

lessly consume resources and thereby reducing the cost of parallelization. The gold

standard, of course, is linear scalability: an algorithm running on twice the amount

of data should take only twice as long. Similarly, an algorithm running on twice the

number of nodes should only take half as long.

The chapter is organized as follows:

• Section 3.1 introduces the important concept of local aggregation in MapReduce

and strategies for designing eﬃcient algorithms that minimize the amount of par-

tial results that need to be copied across the network. The proper use of combiners

is discussed in detail, as well as the “in-mapper combining” design pattern.

• Section 3.2 uses the example of building word co-occurrence matrices on large

text corpora to illustrate two common design patterns, which we dub “pairs” and

“stripes”. These two approaches are useful in a large class of problems that require

keeping track of joint events across a large number of observations.

3.1. LOCAL AGGREGATION 41

• Section 3.3 shows how co-occurrence counts can be converted into relative frequen-

cies using a pattern known as “order inversion”. The sequencing of computations

in the reducer can be recast as a sorting problem, where pieces of intermediate

data are sorted into exactly the order that is required to carry out a series of

computations. Often, a reducer needs to compute an aggregate statistic on a set

of elements before individual elements can be processed. Normally, this would re-

quire two passes over the data, but with the “order inversion” design pattern, the

aggregate statistic can be computed in the reducer before the individual elements

are encountered. This may seem counter-intuitive: how can we compute an aggre-

gate statistic on a set of elements before encountering elements of that set? As it

turns out, clever sorting of special key-value pairs enables exactly this.

• Section 3.4 provides a general solution to secondary sorting, which is the problem

of sorting values associated with a key in the reduce phase. We call this technique

“value-to-key conversion”.

• Section 3.5 covers the topic of performing joins on relational datasets and presents

three diﬀerent approaches: reduce-side, map-side, and memory-backed joins.

3.1 LOCAL AGGREGATION

In the context of data-intensive distributed processing, the single most important as-

pect of synchronization is the exchange of intermediate results, from the processes that

produced them to the processes that will ultimately consume them. In a cluster environ-

ment, with the exception of embarrassingly-parallel problems, this necessarily involves

transferring data over the network. Furthermore, in Hadoop, intermediate results are

written to local disk before being sent over the network. Since network and disk laten-

cies are relatively expensive compared to other operations, reductions in the amount of

intermediate data translate into increases in algorithmic eﬃciency. In MapReduce, local

aggregation of intermediate results is one of the keys to eﬃcient algorithms. Through

use of the combiner and by taking advantage of the ability to preserve state across

multiple inputs, it is often possible to substantially reduce both the number and size of

key-value pairs that need to be shuﬄed from the mappers to the reducers.

3.1.1 COMBINERS AND IN-MAPPER COMBINING

We illustrate various techniques for local aggregation using the simple word count ex-

ample presented in Section 2.2. For convenience, Figure 3.1 repeats the pseudo-code of

the basic algorithm, which is quite simple: the mapper emits an intermediate key-value

pair for each term observed, with the term itself as the key and a value of one; reducers

sum up the partial counts to arrive at the ﬁnal count.

42 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

1: class Mapper

2: method Map(docid a, doc d)

3: for all term t ∈ doc d do

4: Emit(term t, count 1)

1: class Reducer

2: method Reduce(term t, counts [c

1

, c

2

, . . .])

3: sum ← 0

4: for all count c ∈ counts [c

1

, c

2

, . . .] do

5: sum ← sum + c

6: Emit(term t, count sum)

Figure 3.1: Pseudo-code for the basic word count algorithm in MapReduce (repeated from

Figure 2.3).

The ﬁrst technique for local aggregation is the combiner, already discussed in

Section 2.4. Combiners provide a general mechanism within the MapReduce framework

to reduce the amount of intermediate data generated by the mappers—recall that they

can be understood as “mini-reducers” that process the output of mappers. In this

example, the combiners aggregate term counts across the documents processed by each

map task. This results in a reduction in the number of intermediate key-value pairs that

need to be shuﬄed across the network—from the order of total number of terms in the

collection to the order of the number of unique terms in the collection.

1

An improvement on the basic algorithm is shown in Figure 3.2 (the mapper is

modiﬁed but the reducer remains the same as in Figure 3.1 and therefore is not re-

peated). An associative array (i.e., Map in Java) is introduced inside the mapper to

tally up term counts within a single document: instead of emitting a key-value pair for

each term in the document, this version emits a key-value pair for each unique term in

the document. Given that some words appear frequently within a document (for exam-

ple, a document about dogs is likely to have many occurrences of the word “dog”), this

can yield substantial savings in the number of intermediate key-value pairs emitted,

especially for long documents.

1

More precisely, if the combiners take advantage of all opportunities for local aggregation, the algorithm would

generate at most m×V intermediate key-value pairs, where m is the number of mappers and V is the vo-

cabulary size (number of unique terms in the collection), since every term could have been observed in every

mapper. However, there are two additional factors to consider. Due to the Zipﬁan nature of term distributions,

most terms will not be observed by most mappers (for example, terms that occur only once will by deﬁnition

only be observed by one mapper). On the other hand, combiners in Hadoop are treated as optional optimiza-

tions, so there is no guarantee that the execution framework will take advantage of all opportunities for partial

aggregation.

3.1. LOCAL AGGREGATION 43

1: class Mapper

2: method Map(docid a, doc d)

3: H ← new AssociativeArray

4: for all term t ∈ doc d do

5: H¦t¦ ← H¦t¦ + 1 Tally counts for entire document

6: for all term t ∈ H do

7: Emit(term t, count H¦t¦)

Figure 3.2: Pseudo-code for the improved MapReduce word count algorithm that uses an

associative array to aggregate term counts on a per-document basis. Reducer is the same as in

Figure 3.1.

This basic idea can be taken one step further, as illustrated in the variant of the

word count algorithm in Figure 3.3 (once again, only the mapper is modiﬁed). The

workings of this algorithm critically depends on the details of how map and reduce

tasks in Hadoop are executed, discussed in Section 2.6. Recall, a (Java) mapper object

is created for each map task, which is responsible for processing a block of input key-

value pairs. Prior to processing any input key-value pairs, the mapper’s Initialize

method is called, which is an API hook for user-speciﬁed code. In this case, we initialize

an associative array for holding term counts. Since it is possible to preserve state across

multiple calls of the Map method (for each input key-value pair), we can continue

to accumulate partial term counts in the associative array across multiple documents,

and emit key-value pairs only when the mapper has processed all documents. That is,

emission of intermediate data is deferred until the Close method in the pseudo-code.

Recall that this API hook provides an opportunity to execute user-speciﬁed code after

the Map method has been applied to all input key-value pairs of the input data split

to which the map task was assigned.

With this technique, we are in essence incorporating combiner functionality di-

rectly inside the mapper. There is no need to run a separate combiner, since all op-

portunities for local aggregation are already exploited.

2

This is a suﬃciently common

design pattern in MapReduce that it’s worth giving it a name, “in-mapper combining”,

so that we can refer to the pattern more conveniently throughout the book. We’ll see

later on how this pattern can be applied to a variety of problems. There are two main

advantages to using this design pattern:

First, it provides control over when local aggregation occurs and how it exactly

takes place. In contrast, the semantics of the combiner is underspeciﬁed in MapReduce.

2

Leaving aside the minor complication that in Hadoop, combiners can be run in the reduce phase also (when

merging intermediate key-value pairs from diﬀerent map tasks). However, in practice it makes almost no

diﬀerence either way.

44 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

1: class Mapper

2: method Initialize

3: H ← new AssociativeArray

4: method Map(docid a, doc d)

5: for all term t ∈ doc d do

6: H¦t¦ ← H¦t¦ + 1 Tally counts across documents

7: method Close

8: for all term t ∈ H do

9: Emit(term t, count H¦t¦)

Figure 3.3: Pseudo-code for the improved MapReduce word count algorithm that demon-

strates the “in-mapper combining” design pattern. Reducer is the same as in Figure 3.1.

For example, Hadoop makes no guarantees on how many times the combiner is applied,

or that it is even applied at all. The combiner is provided as a semantics-preserving

optimization to the execution framework, which has the option of using it, perhaps

multiple times, or not at all (or even in the reduce phase). In some cases (although not

in this particular example), such indeterminism is unacceptable, which is exactly why

programmers often choose to perform their own local aggregation in the mappers.

Second, in-mapper combining will typically be more eﬃcient than using actual

combiners. One reason for this is the additional overhead associated with actually ma-

terializing the key-value pairs. Combiners reduce the amount of intermediate data that

is shuﬄed across the network, but don’t actually reduce the number of key-value pairs

that are emitted by the mappers in the ﬁrst place. With the algorithm in Figure 3.2,

intermediate key-value pairs are still generated on a per-document basis, only to be

“compacted” by the combiners. This process involves unnecessary object creation and

destruction (garbage collection takes time), and furthermore, object serialization and

deserialization (when intermediate key-value pairs ﬁll the in-memory buﬀer holding map

outputs and need to be temporarily spilled to disk). In contrast, with in-mapper com-

bining, the mappers will generate only those key-value pairs that need to be shuﬄed

across the network to the reducers.

There are, however, drawbacks to the in-mapper combining pattern. First, it

breaks the functional programming underpinnings of MapReduce, since state is be-

ing preserved across multiple input key-value pairs. Ultimately, this isn’t a big deal,

since pragmatic concerns for eﬃciency often trump theoretical “purity”, but there are

practical consequences as well. Preserving state across multiple input instances means

that algorithmic behavior may depend on the order in which input key-value pairs are

encountered. This creates the potential for ordering-dependent bugs, which are diﬃcult

to debug on large datasets in the general case (although the correctness of in-mapper

3.1. LOCAL AGGREGATION 45

combining for word count is easy to demonstrate). Second, there is a fundamental scala-

bility bottleneck associated with the in-mapper combining pattern. It critically depends

on having suﬃcient memory to store intermediate results until the mapper has com-

pletely processed all key-value pairs in an input split. In the word count example, the

memory footprint is bound by the vocabulary size, since it is theoretically possible that

a mapper encounters every term in the collection. Heap’s Law, a well-known result in

information retrieval, accurately models the growth of vocabulary size as a function

of the collection size—the somewhat surprising fact is that the vocabulary size never

stops growing.

3

Therefore, the algorithm in Figure 3.3 will scale only up to a point,

beyond which the associative array holding the partial term counts will no longer ﬁt in

memory.

4

One common solution to limiting memory usage when using the in-mapper com-

bining technique is to “block” input key-value pairs and “ﬂush” in-memory data struc-

tures periodically. The idea is simple: instead of emitting intermediate data only after

every key-value pair has been processed, emit partial results after processing every n

key-value pairs. This is straightforwardly implemented with a counter variable that

keeps track of the number of input key-value pairs that have been processed. As an

alternative, the mapper could keep track of its own memory footprint and ﬂush inter-

mediate key-value pairs once memory usage has crossed a certain threshold. In both

approaches, either the block size or the memory usage threshold needs to be determined

empirically: with too large a value, the mapper may run out of memory, but with too

small a value, opportunities for local aggregation may be lost. Furthermore, in Hadoop

physical memory is split between multiple tasks that may be running on a node con-

currently; these tasks are all competing for ﬁnite resources, but since the tasks are not

aware of each other, it is diﬃcult to coordinate resource consumption eﬀectively. In

practice, however, one often encounters diminishing returns in performance gains with

increasing buﬀer sizes, such that it is not worth the eﬀort to search for an optimal buﬀer

size (personal communication, Jeﬀ Dean).

In MapReduce algorithms, the extent to which eﬃciency can be increased through

local aggregation depends on the size of the intermediate key space, the distribution of

keys themselves, and the number of key-value pairs that are emitted by each individual

map task. Opportunities for aggregation, after all, come from having multiple values

associated with the same key (whether one uses combiners or employs the in-mapper

combining pattern). In the word count example, local aggregation is eﬀective because

3

In more detail, Heap’s Law relates the vocabulary size V to the collection size as follows: V = kT

b

, where

T is the number of tokens in the collection. Typical values of the parameters k and b are: 30 ≤ k ≤ 100 and

b ∼ 0.5 ([101], p. 81).

4

A few more details: note what matters is that the partial term counts encountered within particular input

split ﬁts into memory. However, as collection sizes increase, one will often want to increase the input split size

to limit the growth of the number of map tasks (in order to reduce the number of distinct copy operations

necessary to shuﬄe intermediate data over the network).

46 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

many words are encountered multiple times within a map task. Local aggregation is also

an eﬀective technique for dealing with reduce stragglers (see Section 2.3) that result

from a highly-skewed (e.g., Zipﬁan) distribution of values associated with intermediate

keys. In our word count example, we do not ﬁlter frequently-occurring words: therefore,

without local aggregation, the reducer that’s responsible for computing the count of

‘the’ will have a lot more work to do than the typical reducer, and therefore will likely

be a straggler. With local aggregation (either combiners or in-mapper combining), we

substantially reduce the number of values associated with frequently-occurring terms,

which alleviates the reduce straggler problem.

3.1.2 ALGORITHMIC CORRECTNESS WITH LOCAL AGGREGATION

Although use of combiners can yield dramatic reductions in algorithm running time,

care must be taken in applying them. Since combiners in Hadoop are viewed as op-

tional optimizations, the correctness of the algorithm cannot depend on computations

performed by the combiner or depend on them even being run at all. In any MapReduce

program, the reducer input key-value type must match the mapper output key-value

type: this implies that the combiner input and output key-value types must match the

mapper output key-value type (which is the same as the reducer input key-value type).

In cases where the reduce computation is both commutative and associative, the re-

ducer can also be used (unmodiﬁed) as the combiner (as is the case with the word count

example). In the general case, however, combiners and reducers are not interchangeable.

Consider a simple example: we have a large dataset where input keys are strings

and input values are integers, and we wish to compute the mean of all integers associated

with the same key (rounded to the nearest integer). A real-world example might be a

large user log from a popular website, where keys represent user ids and values represent

some measure of activity such as elapsed time for a particular session—the task would

correspond to computing the mean session length on a per-user basis, which would

be useful for understanding user demographics. Figure 3.4 shows the pseudo-code of

a simple algorithm for accomplishing this task that does not involve combiners. We

use an identity mapper, which simply passes all input key-value pairs to the reducers

(appropriately grouped and sorted). The reducer keeps track of the running sum and

the number of integers encountered. This information is used to compute the mean once

all values are processed. The mean is then emitted as the output value in the reducer

(with the input string as the key).

This algorithm will indeed work, but suﬀers from the same drawbacks as the

basic word count algorithm in Figure 3.1: it requires shuﬄing all key-value pairs from

mappers to reducers across the network, which is highly ineﬃcient. Unlike in the word

count example, the reducer cannot be used as a combiner in this case. Consider what

would happen if we did: the combiner would compute the mean of an arbitrary subset

3.1. LOCAL AGGREGATION 47

1: class Mapper

2: method Map(string t, integer r)

3: Emit(string t, integer r)

1: class Reducer

2: method Reduce(string t, integers [r

1

, r

2

, . . .])

3: sum ← 0

4: cnt ← 0

5: for all integer r ∈ integers [r

1

, r

2

, . . .] do

6: sum ← sum + r

7: cnt ← cnt + 1

8: r

avg

← sum/cnt

9: Emit(string t, integer r

avg

)

Figure 3.4: Pseudo-code for the basic MapReduce algorithm that computes the mean of values

associated with the same key.

of values associated with the same key, and the reducer would compute the mean of

those values. As a concrete example, we know that:

Mean(1, 2, 3, 4, 5) = Mean(Mean(1, 2), Mean(3, 4, 5))

In general, the mean of means of arbitrary subsets of a set of numbers is not the same

as the mean of the set of numbers. Therefore, this approach would not produce the

correct result.

5

So how might we properly take advantage of combiners? An attempt is shown in

Figure 3.5. The mapper remains the same, but we have added a combiner that partially

aggregates results by computing the numeric components necessary to arrive at the

mean. The combiner receives each string and the associated list of integer values, from

which it computes the sum of those values and the number of integers encountered (i.e.,

the count). The sum and count are packaged into a pair, and emitted as the output

of the combiner, with the same string as the key. In the reducer, pairs of partial sums

and counts can be aggregated to arrive at the mean. Up until now, all keys and values

in our algorithms have been primitives (string, integers, etc.). However, there are no

prohibitions in MapReduce for more complex types,

6

and, in fact, this represents a key

technique in MapReduce algorithm design that we introduced at the beginning of this

5

There is, however, one special case in which using reducers as combiners would produce the correct result: if

each combiner computed the mean of equal-size subsets of the values. However, since such ﬁne-grained control

over the combiners is impossible in MapReduce, such a scenario is highly unlikely.

6

In Hadoop, either custom types or types deﬁned using a library such as Protocol Buﬀers, Thrift, or Avro.

48 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

1: class Mapper

2: method Map(string t, integer r)

3: Emit(string t, integer r)

1: class Combiner

2: method Combine(string t, integers [r

1

, r

2

, . . .])

3: sum ← 0

4: cnt ← 0

5: for all integer r ∈ integers [r

1

, r

2

, . . .] do

6: sum ← sum + r

7: cnt ← cnt + 1

8: Emit(string t, pair (sum, cnt)) Separate sum and count

1: class Reducer

2: method Reduce(string t, pairs [(s

1

, c

1

), (s

2

, c

2

) . . .])

3: sum ← 0

4: cnt ← 0

5: for all pair (s, c) ∈ pairs [(s

1

, c

1

), (s

2

, c

2

) . . .] do

6: sum ← sum + s

7: cnt ← cnt + c

8: r

avg

← sum/cnt

9: Emit(string t, integer r

avg

)

Figure 3.5: Pseudo-code for an incorrect ﬁrst attempt at introducing combiners to compute

the mean of values associated with each key. The mismatch between combiner input and output

key-value types violates the MapReduce programming model.

chapter. We will frequently encounter complex keys and values throughput the rest of

this book.

Unfortunately, this algorithm will not work. Recall that combiners must have the

same input and output key-value type, which also must be the same as the mapper

output type and the reducer input type. This is clearly not the case. To understand

why this restriction is necessary in the programming model, remember that combiners

are optimizations that cannot change the correctness of the algorithm. So let us remove

the combiner and see what happens: the output value type of the mapper is integer,

so the reducer expects to receive a list of integers as values. But the reducer actually

expects a list of pairs! The correctness of the algorithm is contingent on the combiner

running on the output of the mappers, and more speciﬁcally, that the combiner is run

exactly once. Recall from our previous discussion that Hadoop makes no guarantees on

3.1. LOCAL AGGREGATION 49

1: class Mapper

2: method Map(string t, integer r)

3: Emit(string t, pair (r, 1))

1: class Combiner

2: method Combine(string t, pairs [(s

1

, c

1

), (s

2

, c

2

) . . .])

3: sum ← 0

4: cnt ← 0

5: for all pair (s, c) ∈ pairs [(s

1

, c

1

), (s

2

, c

2

) . . .] do

6: sum ← sum + s

7: cnt ← cnt + c

8: Emit(string t, pair (sum, cnt))

1: class Reducer

2: method Reduce(string t, pairs [(s

1

, c

1

), (s

2

, c

2

) . . .])

3: sum ← 0

4: cnt ← 0

5: for all pair (s, c) ∈ pairs [(s

1

, c

1

), (s

2

, c

2

) . . .] do

6: sum ← sum + s

7: cnt ← cnt + c

8: r

avg

← sum/cnt

9: Emit(string t, integer r

avg

)

Figure 3.6: Pseudo-code for a MapReduce algorithm that computes the mean of values asso-

ciated with each key. This algorithm correctly takes advantage of combiners.

how many times combiners are called; it could be zero, one, or multiple times. This

violates the MapReduce programming model.

Another stab at the algorithm is shown in Figure 3.6, and this time, the algorithm

is correct. In the mapper we emit as the value a pair consisting of the integer and

one—this corresponds to a partial count over one instance. The combiner separately

aggregates the partial sums and the partial counts (as before), and emits pairs with

updated sums and counts. The reducer is similar to the combiner, except that the

mean is computed at the end. In essence, this algorithm transforms a non-associative

operation (mean of numbers) into an associative operation (element-wise sum of a pair

of numbers, with an additional division at the very end).

Let us verify the correctness of this algorithm by repeating the previous exercise:

What would happen if no combiners were run? With no combiners, the mappers would

send pairs (as values) directly to the reducers. There would be as many intermediate

pairs as there were input key-value pairs, and each of those would consist of an integer

50 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

1: class Mapper

2: method Initialize

3: S ← new AssociativeArray

4: C ← new AssociativeArray

5: method Map(string t, integer r)

6: S¦t¦ ← S¦t¦ + r

7: C¦t¦ ← C¦t¦ + 1

8: method Close

9: for all term t ∈ S do

10: Emit(term t, pair (S¦t¦, C¦t¦))

Figure 3.7: Pseudo-code for a MapReduce algorithm that computes the mean of values asso-

ciated with each key, illustrating the in-mapper combining design pattern. Only the mapper is

shown here; the reducer is the same as in Figure 3.6

and one. The reducer would still arrive at the correct sum and count, and hence the

mean would be correct. Now add in the combiners: the algorithm would remain correct,

no matter how many times they run, since the combiners merely aggregate partial sums

and counts to pass along to the reducers. Note that although the output key-value type

of the combiner must be the same as the input key-value type of the reducer, the reducer

can emit ﬁnal key-value pairs of a diﬀerent type.

Finally, in Figure 3.7, we present an even more eﬃcient algorithm that exploits the

in-mapper combining pattern. Inside the mapper, the partial sums and counts associated

with each string are held in memory across input key-value pairs. Intermediate key-value

pairs are emitted only after the entire input split has been processed; similar to before,

the value is a pair consisting of the sum and count. The reducer is exactly the same as

in Figure 3.6. Moving partial aggregation from the combiner directly into the mapper

is subjected to all the tradeoﬀs and caveats discussed earlier this section, but in this

case the memory footprint of the data structures for holding intermediate data is likely

to be modest, making this variant algorithm an attractive option.

3.2 PAIRS AND STRIPES

One common approach for synchronization in MapReduce is to construct complex keys

and values in such a way that data necessary for a computation are naturally brought

together by the execution framework. We ﬁrst touched on this technique in the previous

section, in the context of “packaging” partial sums and counts in a complex value

(i.e., pair) that is passed from mapper to combiner to reducer. Building on previously

3.2. PAIRS AND STRIPES 51

published work [54, 94], this section introduces two common design patterns we have

dubbed “pairs” and “stripes” that exemplify this strategy.

As a running example, we focus on the problem of building word co-occurrence

matrices from large corpora, a common task in corpus linguistics and statistical natural

language processing. Formally, the co-occurrence matrix of a corpus is a square n n

matrix where n is the number of unique words in the corpus (i.e., the vocabulary size). A

cell m

ij

contains the number of times word w

i

co-occurs with word w

j

within a speciﬁc

context—a natural unit such as a sentence, paragraph, or a document, or a certain

window of m words (where m is an application-dependent parameter). Note that the

upper and lower triangles of the matrix are identical since co-occurrence is a symmetric

relation, though in the general case relations between words need not be symmetric. For

example, a co-occurrence matrix M where m

ij

is the count of how many times word i

was immediately succeeded by word j would usually not be symmetric.

This task is quite common in text processing and provides the starting point to

many other algorithms, e.g., for computing statistics such as pointwise mutual infor-

mation [38], for unsupervised sense clustering [136], and more generally, a large body

of work in lexical semantics based on distributional proﬁles of words, dating back to

Firth [55] and Harris [69] in the 1950s and 1960s. The task also has applications in in-

formation retrieval (e.g., automatic thesaurus construction [137] and stemming [157]),

and other related ﬁelds such as text mining. More importantly, this problem represents

a speciﬁc instance of the task of estimating distributions of discrete joint events from a

large number of observations, a very common task in statistical natural language pro-

cessing for which there are nice MapReduce solutions. Indeed, concepts presented here

are also used in Chapter 6 when we discuss expectation-maximization algorithms.

Beyond text processing, problems in many application domains share similar char-

acteristics. For example, a large retailer might analyze point-of-sale transaction records

to identify correlated product purchases (e.g., customers who buy this tend to also buy

that), which would assist in inventory management and product placement on store

shelves. Similarly, an intelligence analyst might wish to identify associations between

re-occurring ﬁnancial transactions that are otherwise unrelated, which might provide a

clue in thwarting terrorist activity. The algorithms discussed in this section could be

adapted to tackle these related problems.

It is obvious that the space requirement for the word co-occurrence problem is

O(n

2

), where n is the size of the vocabulary, which for real-world English corpora can

be hundreds of thousands of words, or even billions of words in web-scale collections.

7

The computation of the word co-occurrence matrix is quite simple if the entire matrix

7

The size of the vocabulary depends on the deﬁnition of a “word” and techniques (if any) for corpus pre-

processing. One common strategy is to replace all rare words (below a certain frequency) with a “special”

token such as <UNK> (which stands for “unknown”) to model out-of-vocabulary words. Another technique

involves replacing numeric digits with #, such that 1.32 and 1.19 both map to the same token (#.##).

52 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

ﬁts into memory—however, in the case where the matrix is too big to ﬁt in memory,

a na¨ıve implementation on a single machine can be very slow as memory is paged to

disk. Although compression techniques can increase the size of corpora for which word

co-occurrence matrices can be constructed on a single machine, it is clear that there are

inherent scalability limitations. We describe two MapReduce algorithms for this task

that can scale to large corpora.

Pseudo-code for the ﬁrst algorithm, dubbed the “pairs” approach, is shown in

Figure 3.8. As usual, document ids and the corresponding contents make up the input

key-value pairs. The mapper processes each input document and emits intermediate

key-value pairs with each co-occurring word pair as the key and the integer one (i.e.,

the count) as the value. This is straightforwardly accomplished by two nested loops:

the outer loop iterates over all words (the left element in the pair), and the inner

loop iterates over all neighbors of the ﬁrst word (the right element in the pair). The

neighbors of a word can either be deﬁned in terms of a sliding window or some other

contextual unit such as a sentence. The MapReduce execution framework guarantees

that all values associated with the same key are brought together in the reducer. Thus,

in this case the reducer simply sums up all the values associated with the same co-

occurring word pair to arrive at the absolute count of the joint event in the corpus,

which is then emitted as the ﬁnal key-value pair. Each pair corresponds to a cell in the

word co-occurrence matrix. This algorithm illustrates the use of complex keys in order

to coordinate distributed computations.

An alternative approach, dubbed the “stripes” approach, is presented in Fig-

ure 3.9. Like the pairs approach, co-occurring word pairs are generated by two nested

loops. However, the major diﬀerence is that instead of emitting intermediate key-value

pairs for each co-occurring word pair, co-occurrence information is ﬁrst stored in an

associative array, denoted H. The mapper emits key-value pairs with words as keys

and corresponding associative arrays as values, where each associative array encodes

the co-occurrence counts of the neighbors of a particular word (i.e., its context). The

MapReduce execution framework guarantees that all associative arrays with the same

key will be brought together in the reduce phase of processing. The reducer performs an

element-wise sum of all associative arrays with the same key, accumulating counts that

correspond to the same cell in the co-occurrence matrix. The ﬁnal associative array is

emitted with the same word as the key. In contrast to the pairs approach, each ﬁnal

key-value pair encodes a row in the co-occurrence matrix.

It is immediately obvious that the pairs algorithm generates an immense number

of key-value pairs compared to the stripes approach. The stripes representation is much

more compact, since with pairs the left element is repeated for every co-occurring word

pair. The stripes approach also generates fewer and shorter intermediate keys, and

therefore the execution framework has less sorting to perform. However, values in the

3.2. PAIRS AND STRIPES 53

1: class Mapper

2: method Map(docid a, doc d)

3: for all term w ∈ doc d do

4: for all term u ∈ Neighbors(w) do

5: Emit(pair (w, u), count 1) Emit count for each co-occurrence

1: class Reducer

2: method Reduce(pair p, counts [c

1

, c

2

, . . .])

3: s ← 0

4: for all count c ∈ counts [c

1

, c

2

, . . .] do

5: s ← s + c Sum co-occurrence counts

6: Emit(pair p, count s)

Figure 3.8: Pseudo-code for the “pairs” approach for computing word co-occurrence matrices

from large corpora.

1: class Mapper

2: method Map(docid a, doc d)

3: for all term w ∈ doc d do

4: H ← new AssociativeArray

5: for all term u ∈ Neighbors(w) do

6: H¦u¦ ← H¦u¦ + 1 Tally words co-occurring with w

7: Emit(Term w, Stripe H)

1: class Reducer

2: method Reduce(term w, stripes [H

1

, H

2

, H

3

, . . .])

3: H

f

← new AssociativeArray

4: for all stripe H ∈ stripes [H

1

, H

2

, H

3

, . . .] do

5: Sum(H

f

, H) Element-wise sum

6: Emit(term w, stripe H

f

)

Figure 3.9: Pseudo-code for the “stripes” approach for computing word co-occurrence matrices

from large corpora.

54 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

stripes approach are more complex, and come with more serialization and deserialization

overhead than with the pairs approach.

Both algorithms can beneﬁt from the use of combiners, since the respective oper-

ations in their reducers (addition and element-wise sum of associative arrays) are both

commutative and associative. However, combiners with the stripes approach have more

opportunities to perform local aggregation because the key space is the vocabulary—

associative arrays can be merged whenever a word is encountered multiple times by

a mapper. In contrast, the key space in the pairs approach is the cross of the vocab-

ulary with itself, which is far larger—counts can be aggregated only when the same

co-occurring word pair is observed multiple times by an individual mapper (which is

less likely than observing multiple occurrences of a word, as in the stripes case).

For both algorithms, the in-mapper combining optimization discussed in the pre-

vious section can also be applied; the modiﬁcation is suﬃciently straightforward that

we leave the implementation as an exercise for the reader. However, the above caveats

remain: there will be far fewer opportunities for partial aggregation in the pairs ap-

proach due to the sparsity of the intermediate key space. The sparsity of the key space

also limits the eﬀectiveness of in-memory combining, since the mapper may run out of

memory to store partial counts before all documents are processed, necessitating some

mechanism to periodically emit key-value pairs (which further limits opportunities to

perform partial aggregation). Similarly, for the stripes approach, memory management

will also be more complex than in the simple word count example. For common terms,

the associative array may grow to be quite large, necessitating some mechanism to

periodically ﬂush in-memory structures.

It is important to consider potential scalability bottlenecks of either algorithm.

The stripes approach makes the assumption that, at any point in time, each associative

array is small enough to ﬁt into memory—otherwise, memory paging will signiﬁcantly

impact performance. The size of the associative array is bounded by the vocabulary size,

which is itself unbounded with respect to corpus size (recall the previous discussion of

Heap’s Law). Therefore, as the sizes of corpora increase, this will become an increasingly

pressing issue—perhaps not for gigabyte-sized corpora, but certainly for terabyte-sized

and petabyte-sized corpora that will be commonplace tomorrow. The pairs approach,

on the other hand, does not suﬀer from this limitation, since it does not need to hold

intermediate data in memory.

Given this discussion, which approach is faster? Here, we present previously-

published results [94] that empirically answered this question. We have implemented

both algorithms in Hadoop and applied them to a corpus of 2.27 million documents

from the Associated Press Worldstream (APW) totaling 5.7 GB.

8

Prior to working

8

This was a subset of the English Gigaword corpus (version 3) distributed by the Linguistic Data Consortium

(LDC catalog number LDC2007T07).

3.2. PAIRS AND STRIPES 55

with Hadoop, the corpus was ﬁrst preprocessed as follows: All XML markup was re-

moved, followed by tokenization and stopword removal using standard tools from the

Lucene search engine. All tokens were then replaced with unique integers for a more

eﬃcient encoding. Figure 3.10 compares the running time of the pairs and stripes ap-

proach on diﬀerent fractions of the corpus, with a co-occurrence window size of two.

These experiments were performed on a Hadoop cluster with 19 slave nodes, each with

two single-core processors and two disks.

Results demonstrate that the stripes approach is much faster than the pairs ap-

proach: 666 seconds (∼11 minutes) compared to 3758 seconds (∼62 minutes) for the

entire corpus (improvement by a factor of 5.7). The mappers in the pairs approach gen-

erated 2.6 billion intermediate key-value pairs totaling 31.2 GB. After the combiners,

this was reduced to 1.1 billion key-value pairs, which quantiﬁes the amount of interme-

diate data transferred across the network. In the end, the reducers emitted a total of 142

million ﬁnal key-value pairs (the number of non-zero cells in the co-occurrence matrix).

On the other hand, the mappers in the stripes approach generated 653 million interme-

diate key-value pairs totaling 48.1 GB. After the combiners, only 28.8 million key-value

pairs remained. The reducers emitted a total of 1.69 million ﬁnal key-value pairs (the

number of rows in the co-occurrence matrix). As expected, the stripes approach pro-

vided more opportunities for combiners to aggregate intermediate results, thus greatly

reducing network traﬃc in the shuﬄe and sort phase. Figure 3.10 also shows that both

algorithms exhibit highly desirable scaling characteristics—linear in the amount of in-

put data. This is conﬁrmed by a linear regression applied to the running time data,

which yields an R

2

value close to one.

An additional series of experiments explored the scalability of the stripes approach

along another dimension: the size of the cluster. These experiments were made possible

by Amazon’s EC2 service, which allows users to rapidly provision clusters of varying

sizes for limited durations (for more information, refer back to our discussion of utility

computing in Section 1.1). Virtualized computational units in EC2 are called instances,

and the user is charged only for the instance-hours consumed. Figure 3.11 (left) shows

the running time of the stripes algorithm (on the same corpus, with same setup as

before), on varying cluster sizes, from 20 slave “small” instances all the way up to 80

slave “small” instances (along the x-axis). Running times are shown with solid squares.

Figure 3.11 (right) recasts the same results to illustrate scaling characteristics. The

circles plot the relative size and speedup of the EC2 experiments, with respect to the

20-instance cluster. These results show highly desirable linear scaling characteristics

(i.e., doubling the cluster size makes the job twice as fast). This is conﬁrmed by a linear

regression with an R

2

value close to one.

Viewed abstractly, the pairs and stripes algorithms represent two diﬀerent ap-

proaches to counting co-occurring events from a large number of observations. This

56 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100

R

u

n

n

i

n

g

t

i

m

e

(

s

e

c

o

n

d

s

)

Percentage of the APW corpus

R

2

= 0.992

R

2

= 0.999

"stripes" approach

"pairs" approach

Figure 3.10: Running time of the “pairs” and “stripes” algorithms for computing word co-

occurrence matrices on diﬀerent fractions of the APW corpus. These experiments were per-

formed on a Hadoop cluster with 19 slaves, each with two single-core processors and two disks.

0

1000

2000

3000

4000

5000

10 20 30 40 50 60 70 80 90

R

u

n

n

i

n

g

t

i

m

e

(

s

e

c

o

n

d

s

)

Size of EC2 cluster (number of slave instances)

1x

2x

3x

4x

1x 2x 3x 4x

R

e

l

a

t

i

v

e

s

p

e

e

d

u

p

Relative size of EC2 cluster

R

2

= 0.997

Figure 3.11: Running time of the stripes algorithm on the APW corpus with Hadoop clusters

of diﬀerent sizes from EC2 (left). Scaling characteristics (relative speedup) in terms of increasing

Hadoop cluster size (right).

3.3. COMPUTING RELATIVE FREQUENCIES 57

general description captures the gist of many algorithms in ﬁelds as diverse as text

processing, data mining, and bioinformatics. For this reason, these two design patterns

are broadly useful and frequently observed in a variety of applications.

To conclude, it is worth noting that the pairs and stripes approaches represent

endpoints along a continuum of possibilities. The pairs approach individually records

each co-occurring event, while the stripes approach records all co-occurring events

with respect a conditioning event. A middle ground might be to record a subset of

the co-occurring events with respect to a conditioning event. We might divide up the

entire vocabulary into b buckets (e.g., via hashing), so that words co-occurring with

w

i

would be divided into b smaller “sub-stripes”, associated with ten separate keys,

(w

i

, 1), (w

i

, 2) . . . (w

i

, b). This would be a reasonable solution to the memory limitations

of the stripes approach, since each of the sub-stripes would be smaller. In the case of

b = [V [, where [V [ is the vocabulary size, this is equivalent to the pairs approach. In

the case of b = 1, this is equivalent to the standard stripes approach.

3.3 COMPUTING RELATIVE FREQUENCIES

Let us build on the pairs and stripes algorithms presented in the previous section and

continue with our running example of constructing the word co-occurrence matrix M

for a large corpus. Recall that in this large square n n matrix, where n = [V [ (the

vocabulary size), cell m

ij

contains the number of times word w

i

co-occurs with word

w

j

within a speciﬁc context. The drawback of absolute counts is that it doesn’t take

into account the fact that some words appear more frequently than others. Word w

i

may co-occur frequently with w

j

simply because one of the words is very common. A

simple remedy is to convert absolute counts into relative frequencies, f(w

j

[w

i

). That is,

what proportion of the time does w

j

appear in the context of w

i

? This can be computed

using the following equation:

f(w

j

[w

i

) =

N(w

i

, w

j

)

¸

w

N(w

i

, w

t

)

(3.1)

Here, N(, ) indicates the number of times a particular co-occurring word pair is ob-

served in the corpus. We need the count of the joint event (word co-occurrence), divided

by what is known as the marginal (the sum of the counts of the conditioning variable

co-occurring with anything else).

Computing relative frequencies with the stripes approach is straightforward. In

the reducer, counts of all words that co-occur with the conditioning variable (w

i

in the

above example) are available in the associative array. Therefore, it suﬃces to sum all

those counts to arrive at the marginal (i.e.,

¸

w

N(w

i

, w

t

)), and then divide all the joint

counts by the marginal to arrive at the relative frequency for all words. This implemen-

tation requires minimal modiﬁcation to the original stripes algorithm in Figure 3.9, and

58 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

illustrates the use of complex data structures to coordinate distributed computations

in MapReduce. Through appropriate structuring of keys and values, one can use the

MapReduce execution framework to bring together all the pieces of data required to

perform a computation. Note that, as with before, this algorithm also assumes that

each associative array ﬁts into memory.

How might one compute relative frequencies with the pairs approach? In the pairs

approach, the reducer receives (w

i

, w

j

) as the key and the count as the value. From

this alone it is not possible to compute f(w

j

[w

i

) since we do not have the marginal.

Fortunately, as in the mapper, the reducer can preserve state across multiple keys.

Inside the reducer, we can buﬀer in memory all the words that co-occur with w

i

and

their counts, in essence building the associative array in the stripes approach. To make

this work, we must deﬁne the sort order of the pair so that keys are ﬁrst sorted by the left

word, and then by the right word. Given this ordering, we can easily detect if all pairs

associated with the word we are conditioning on (w

i

) have been encountered. At that

point we can go back through the in-memory buﬀer, compute the relative frequencies,

and then emit those results in the ﬁnal key-value pairs.

There is one more modiﬁcation necessary to make this algorithm work. We must

ensure that all pairs with the same left word are sent to the same reducer. This, unfor-

tunately, does not happen automatically: recall that the default partitioner is based on

the hash value of the intermediate key, modulo the number of reducers. For a complex

key, the raw byte representation is used to compute the hash value. As a result, there

is no guarantee that, for example, (dog, aardvark) and (dog, zebra) are assigned to the

same reducer. To produce the desired behavior, we must deﬁne a custom partitioner

that only pays attention to the left word. That is, the partitioner should partition based

on the hash of the left word only.

This algorithm will indeed work, but it suﬀers from the same drawback as the

stripes approach: as the size of the corpus grows, so does that vocabulary size, and at

some point there will not be suﬃcient memory to store all co-occurring words and their

counts for the word we are conditioning on. For computing the co-occurrence matrix, the

advantage of the pairs approach is that it doesn’t suﬀer from any memory bottlenecks.

Is there a way to modify the basic pairs approach so that this advantage is retained?

As it turns out, such an algorithm is indeed possible, although it requires the co-

ordination of several mechanisms in MapReduce. The insight lies in properly sequencing

data presented to the reducer. If it were possible to somehow compute (or otherwise

obtain access to) the marginal in the reducer before processing the joint counts, the

reducer could simply divide the joint counts by the marginal to compute the relative

frequencies. The notion of “before” and “after” can be captured in the ordering of

key-value pairs, which can be explicitly controlled by the programmer. That is, the

programmer can deﬁne the sort order of keys so that data needed earlier is presented

3.3. COMPUTING RELATIVE FREQUENCIES 59

key values

(dog, ∗) [6327, 8514, . . .] compute marginal:

¸

w

N(dog, w

t

) = 42908

(dog, aardvark) [2,1] f(aardvark[dog) = 3/42908

(dog, aardwolf) [1] f(aardwolf[dog) = 1/42908

. . .

(dog, zebra) [2,1,1,1] f(zebra[dog) = 5/42908

(doge, ∗) [682, . . .] compute marginal:

¸

w

N(doge, w

t

) = 1267

. . .

Figure 3.12: Example of the sequence of key-value pairs presented to the reducer in the pairs

algorithm for computing relative frequencies. This illustrates the application of the order inver-

sion design pattern.

to the reducer before data that is needed later. However, we still need to compute the

marginal counts. Recall that in the basic pairs algorithm, each mapper emits a key-

value pair with the co-occurring word pair as the key. To compute relative frequencies,

we modify the mapper so that it additionally emits a “special” key of the form (w

i

, ∗),

with a value of one, that represents the contribution of the word pair to the marginal.

Through use of combiners, these partial marginal counts will be aggregated before be-

ing sent to the reducers. Alternatively, the in-mapper combining pattern can be used

to even more eﬃciently aggregate marginal counts.

In the reducer, we must make sure that the special key-value pairs representing

the partial marginal contributions are processed before the normal key-value pairs rep-

resenting the joint counts. This is accomplished by deﬁning the sort order of the keys

so that pairs with the special symbol of the form (w

i

, ∗) are ordered before any other

key-value pairs where the left word is w

i

. In addition, as with before we must also prop-

erly deﬁne the partitioner to pay attention to only the left word in each pair. With the

data properly sequenced, the reducer can directly compute the relative frequencies.

A concrete example is shown in Figure 3.12, which lists the sequence of key-value

pairs that a reducer might encounter. First, the reducer is presented with the special key

(dog, ∗) and a number of values, each of which represents a partial marginal contribution

from the map phase (assume here either combiners or in-mapper combining, so the

values represent partially aggregated counts). The reducer accumulates these counts to

arrive at the marginal,

¸

w

N(dog, w

t

). The reducer holds on to this value as it processes

subsequent keys. After (dog, ∗), the reducer will encounter a series of keys representing

joint counts; let’s say the ﬁrst of these is the key (dog, aardvark). Associated with this

key will be a list of values representing partial joint counts from the map phase (two

separate values in this case). Summing these counts will yield the ﬁnal joint count, i.e.,

the number of times dog and aardvark co-occur in the entire collection. At this point,

60 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

since the reducer already knows the marginal, simple arithmetic suﬃces to compute

the relative frequency. All subsequent joint counts are processed in exactly the same

manner. When the reducer encounters the next special key-value pair (doge, ∗), the

reducer resets its internal state and starts to accumulate the marginal all over again.

Observe that the memory requirement for this algorithm is minimal, since only the

marginal (an integer) needs to be stored. No buﬀering of individual co-occurring word

counts is necessary, and therefore we have eliminated the scalability bottleneck of the

previous algorithm.

This design pattern, which we call “order inversion”, occurs surprisingly often

and across applications in many domains. It is so named because through proper co-

ordination, we can access the result of a computation in the reducer (for example, an

aggregate statistic) before processing the data needed for that computation. The key

insight is to convert the sequencing of computations into a sorting problem. In most

cases, an algorithm requires data in some ﬁxed order: by controlling how keys are sorted

and how the key space is partitioned, we can present data to the reducer in the order

necessary to perform the proper computations. This greatly cuts down on the amount

of partial results that the reducer needs to hold in memory.

To summarize, the speciﬁc application of the order inversion design pattern for

computing relative frequencies requires the following:

• Emitting a special key-value pair for each co-occurring word pair in the mapper

to capture its contribution to the marginal.

• Controlling the sort order of the intermediate key so that the key-value pairs

representing the marginal contributions are processed by the reducer before any

of the pairs representing the joint word co-occurrence counts.

• Deﬁning a custom partitioner to ensure that all pairs with the same left word are

shuﬄed to the same reducer.

• Preserving state across multiple keys in the reducer to ﬁrst compute the marginal

based on the special key-value pairs and then dividing the joint counts by the

marginals to arrive at the relative frequencies.

As we will see in Chapter 4, this design pattern is also used in inverted index construc-

tion to properly set compression parameters for postings lists.

3.4 SECONDARY SORTING

MapReduce sorts intermediate key-value pairs by the keys during the shuﬄe and sort

phase, which is very convenient if computations inside the reducer rely on sort order

(e.g., the order inversion design pattern described in the previous section). However,

3.4. SECONDARY SORTING 61

what if in addition to sorting by key, we also need to sort by value? Google’s MapReduce

implementation provides built-in functionality for (optional) secondary sorting, which

guarantees that values arrive in sorted order. Hadoop, unfortunately, does not have this

capability built in.

Consider the example of sensor data from a scientiﬁc experiment: there are m

sensors each taking readings on continuous basis, where m is potentially a large number.

A dump of the sensor data might look something like the following, where r

x

after each

timestamp represents the actual sensor readings (unimportant for this discussion, but

may be a series of values, one or more complex records, or even raw bytes of images).

(t

1

, m

1

, r

80521

)

(t

1

, m

2

, r

14209

)

(t

1

, m

3

, r

76042

)

...

(t

2

, m

1

, r

21823

)

(t

2

, m

2

, r

66508

)

(t

2

, m

3

, r

98347

)

Suppose we wish to reconstruct the activity at each individual sensor over time. A

MapReduce program to accomplish this might map over the raw data and emit the

sensor id as the intermediate key, with the rest of each record as the value:

m

1

→ (t

1

, r

80521

)

This would bring all readings from the same sensor together in the reducer. However,

since MapReduce makes no guarantees about the ordering of values associated with the

same key, the sensor readings will not likely be in temporal order. The most obvious

solution is to buﬀer all the readings in memory and then sort by timestamp before

additional processing. However, it should be apparent by now that any in-memory

buﬀering of data introduces a potential scalability bottleneck. What if we are working

with a high frequency sensor or sensor readings over a long period of time? What if the

sensor readings themselves are large complex objects? This approach may not scale in

these cases—the reducer would run out of memory trying to buﬀer all values associated

with the same key.

This is a common problem, since in many applications we wish to ﬁrst group

together data one way (e.g., by sensor id), and then sort within the groupings another

way (e.g., by time). Fortunately, there is a general purpose solution, which we call the

“value-to-key conversion” design pattern. The basic idea is to move part of the value

into the intermediate key to form a composite key, and let the MapReduce execution

framework handle the sorting. In the above example, instead of emitting the sensor id

as the key, we would emit the sensor id and the timestamp as a composite key:

(m

1

, t

1

) → (r

80521

)

62 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

The sensor reading itself now occupies the value. We must deﬁne the intermediate key

sort order to ﬁrst sort by the sensor id (the left element in the pair) and then by the

timestamp (the right element in the pair). We must also implement a custom partitioner

so that all pairs associated with the same sensor are shuﬄed to the same reducer.

Properly orchestrated, the key-value pairs will be presented to the reducer in the

correct sorted order:

(m

1

, t

1

) → [(r

80521

)]

(m

1

, t

2

) → [(r

21823

)]

(m

1

, t

3

) → [(r

146925

)]

. . .

However, note that sensor readings are now split across multiple keys. The reducer will

need to preserve state and keep track of when readings associated with the current

sensor end and the next sensor begin.

9

The basic tradeoﬀ between the two approaches discussed above (buﬀer and in-

memory sort vs. value-to-key conversion) is where sorting is performed. One can explic-

itly implement secondary sorting in the reducer, which is likely to be faster but suﬀers

from a scalability bottleneck.

10

With value-to-key conversion, sorting is oﬄoaded to the

MapReduce execution framework. Note that this approach can be arbitrarily extended

to tertiary, quaternary, etc. sorting. This pattern results in many more keys for the

framework to sort, but distributed sorting is a task that the MapReduce runtime excels

at since it lies at the heart of the programming model.

3.5 RELATIONAL JOINS

One popular application of Hadoop is data-warehousing. In an enterprise setting, a data

warehouse serves as a vast repository of data, holding everything from sales transac-

tions to product inventories. Typically, the data is relational in nature, but increasingly

data warehouses are used to store semi-structured data (e.g., query logs) as well as

unstructured data. Data warehouses form a foundation for business intelligence appli-

cations designed to provide decision support. It is widely believed that insights gained

by mining historical, current, and prospective data can yield competitive advantages in

the marketplace.

Traditionally, data warehouses have been implemented through relational

databases, particularly those optimized for a speciﬁc workload known as online analyt-

ical processing (OLAP). A number of vendors oﬀer parallel databases, but customers

9

Alternatively, Hadoop provides API hooks to deﬁne “groups” of intermediate keys that should be processed

together in the reducer.

10

Note that, in principle, this need not be an in-memory sort. It is entirely possible to implement a disk-based sort

within the reducer, although one would be duplicating functionality that is already present in the MapReduce

execution framework. It makes more sense to take advantage of functionality that is already present with

value-to-key conversion.

3.5. RELATIONAL JOINS 63

ﬁnd that they often cannot cost-eﬀectively scale to the crushing amounts of data an

organization needs to deal with today. Parallel databases are often quite expensive—

on the order of tens of thousands of dollars per terabyte of user data. Over the past

few years, Hadoop has gained popularity as a platform for data-warehousing. Ham-

merbacher [68], for example, discussed Facebook’s experiences with scaling up business

intelligence applications with Oracle databases, which they ultimately abandoned in

favor of a Hadoop-based solution developed in-house called Hive (which is now an

open-source project). Pig [114] is a platform for massive data analytics built on Hadoop

and capable of handling structured as well as semi-structured data. It was originally

developed by Yahoo, but is now also an open-source project.

Given successful applications of Hadoop to data-warehousing and complex ana-

lytical queries that are prevalent in such an environment, it makes sense to examine

MapReduce algorithms for manipulating relational data. This section focuses specif-

ically on performing relational joins in MapReduce. We should stress here that even

though Hadoop has been applied to process relational data, Hadoop is not a database.

There is an ongoing debate between advocates of parallel databases and proponents

of MapReduce regarding the merits of both approaches for OLAP-type workloads. De-

witt and Stonebraker, two well-known ﬁgures in the database community, famously

decried MapReduce as “a major step backwards” in a controversial blog post.

11

With

colleagues, they ran a series of benchmarks that demonstrated the supposed superiority

of column-oriented parallel databases over Hadoop [120, 144]. However, see Dean and

Ghemawat’s counterarguments [47] and recent attempts at hybrid architectures [1].

We shall refrain here from participating in this lively debate, and instead focus on

discussing algorithms. From an application point of view, it is highly unlikely that an

analyst interacting with a data warehouse will ever be called upon to write MapReduce

programs (and indeed, Hadoop-based systems such as Hive and Pig present a much

higher-level language for interacting with large amounts of data). Nevertheless, it is

instructive to understand the algorithms that underlie basic relational operations.

This section presents three diﬀerent strategies for performing relational joins on

two datasets (relations), generically named S and T. Let us suppose that relation S

looks something like the following:

(k

1

, s

1

, S

1

)

(k

2

, s

2

, S

2

)

(k

3

, s

3

, S

3

)

. . .

where k is the key we would like to join on, s

n

is a unique id for the tuple, and the

S

n

after s

n

denotes other attributes in the tuple (unimportant for the purposes of the

join). Similarly, suppose relation T looks something like this:

11

http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/

64 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

(k

1

, t

1

, T

1

)

(k

3

, t

2

, T

2

)

(k

8

, t

3

, T

3

)

. . .

where k is the join key, t

n

is a unique id for the tuple, and the T

n

after t

n

denotes other

attributes in the tuple.

To make this task more concrete, we present one realistic scenario: S might rep-

resent a collection of user proﬁles, in which case k could be interpreted as the primary

key (i.e., user id). The tuples might contain demographic information such as age, gen-

der, income, etc. The other dataset, T, might represent logs of online activity. Each

tuple might correspond to a page view of a particular URL and may contain additional

information such as time spent on the page, ad revenue generated, etc. The k in these

tuples could be interpreted as the foreign key that associates each individual page view

with a user. Joining these two datasets would allow an analyst, for example, to break

down online activity in terms of demographics.

3.5.1 REDUCE-SIDE JOIN

The ﬁrst approach to relational joins is what’s known as a reduce-side join. The idea

is quite simple: we map over both datasets and emit the join key as the intermediate

key, and the tuple itself as the intermediate value. Since MapReduce guarantees that

all values with the same key are brought together, all tuples will be grouped by the

join key—which is exactly what we need to perform the join operation. This approach

is known as a parallel sort-merge join in the database community [134]. In more detail,

there are three diﬀerent cases to consider.

The ﬁrst and simplest is a one-to-one join, where at most one tuple from S and

one tuple from T share the same join key (but it may be the case that no tuple from

S shares the join key with a tuple from T, or vice versa). In this case, the algorithm

sketched above will work ﬁne. The reducer will be presented keys and lists of values

along the lines of the following:

k

23

→ [(s

64

, S

64

), (t

84

, T

84

)]

k

37

→ [(s

68

, S

68

)]

k

59

→ [(t

97

, T

97

), (s

81

, S

81

)]

k

61

→ [(t

99

, T

99

)]

. . .

Since we’ve emitted the join key as the intermediate key, we can remove it from the

value to save a bit of space.

12

If there are two values associated with a key, then we know

that one must be from S and the other must be from T. However, recall that in the

12

Not very important if the intermediate data is compressed.

3.5. RELATIONAL JOINS 65

basic MapReduce programming model, no guarantees are made about value ordering,

so the ﬁrst value might be from S or from T. We can proceed to join the two tuples

and perform additional computations (e.g., ﬁlter by some other attribute, compute

aggregates, etc.). If there is only one value associated with a key, this means that no

tuple in the other dataset shares the join key, so the reducer does nothing.

Let us now consider the one-to-many join. Assume that tuples in S have unique

join keys (i.e., k is the primary key in S), so that S is the “one” and T is the “many”.

The above algorithm will still work, but when processing each key in the reducer, we

have no idea when the value corresponding to the tuple from S will be encountered, since

values are arbitrarily ordered. The easiest solution is to buﬀer all values in memory,

pick out the tuple from S, and then cross it with every tuple from T to perform the

join. However, as we have seen several times already, this creates a scalability bottleneck

since we may not have suﬃcient memory to hold all the tuples with the same join key.

This is a problem that requires a secondary sort, and the solution lies in the

value-to-key conversion design pattern we just presented. In the mapper, instead of

simply emitting the join key as the intermediate key, we instead create a composite key

consisting of the join key and the tuple id (from either S or T). Two additional changes

are required: First, we must deﬁne the sort order of the keys to ﬁrst sort by the join

key, and then sort all tuple ids from S before all tuple ids from T. Second, we must

deﬁne the partitioner to pay attention to only the join key, so that all composite keys

with the same join key arrive at the same reducer.

After applying the value-to-key conversion design pattern, the reducer will be

presented with keys and values along the lines of the following:

(k

82

, s

105

) → [(S

105

)]

(k

82

, t

98

) → [(T

98

)]

(k

82

, t

101

) → [(T

101

)]

(k

82

, t

137

) → [(T

137

)]

. . .

Since both the join key and the tuple id are present in the intermediate key, we can

remove them from the value to save a bit of space.

13

Whenever the reducer encounters

a new join key, it is guaranteed that the associated value will be the relevant tuple from

S. The reducer can hold this tuple in memory and then proceed to cross it with tuples

from T in subsequent steps (until a new join key is encountered). Since the MapReduce

execution framework performs the sorting, there is no need to buﬀer tuples (other than

the single one from S). Thus, we have eliminated the scalability bottleneck.

Finally, let us consider the many-to-many join case. Assuming that S is the smaller

dataset, the above algorithm works as well. Consider what happens at the reducer:

13

Once again, not very important if the intermediate data is compressed.

66 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

(k

82

, s

105

) → [(S

105

)]

(k

82

, s

124

) → [(S

124

)]

. . .

(k

82

, t

98

) → [(T

98

)]

(k

82

, t

101

) → [(T

101

)]

(k

82

, t

137

) → [(T

137

)]

. . .

All the tuples from S with the same join key will be encountered ﬁrst, which the reducer

can buﬀer in memory. As the reducer processes each tuple from T, it is crossed with all

the tuples from S. Of course, we are assuming that the tuples from S (with the same

join key) will ﬁt into memory, which is a limitation of this algorithm (and why we want

to control the sort order so that the smaller dataset comes ﬁrst).

The basic idea behind the reduce-side join is to repartition the two datasets by

the join key. The approach isn’t particularly eﬃcient since it requires shuﬄing both

datasets across the network. This leads us to the map-side join.

3.5.2 MAP-SIDE JOIN

Suppose we have two datasets that are both sorted by the join key. We can perform a

join by scanning through both datasets simultaneously—this is known as a merge join

in the database community. We can parallelize this by partitioning and sorting both

datasets in the same way. For example, suppose S and T were both divided into ten

ﬁles, partitioned in the same manner by the join key. Further suppose that in each ﬁle,

the tuples were sorted by the join key. In this case, we simply need to merge join the

ﬁrst ﬁle of S with the ﬁrst ﬁle of T, the second ﬁle with S with the second ﬁle of T, etc.

This can be accomplished in parallel, in the map phase of a MapReduce job—hence, a

map-side join. In practice, we map over one of the datasets (the larger one) and inside

the mapper read the corresponding part of the other dataset to perform the merge

join.

14

No reducer is required, unless the programmer wishes to repartition the output

or perform further processing.

A map-side join is far more eﬃcient than a reduce-side join since there is no need

to shuﬄe the datasets over the network. But is it realistic to expect that the stringent

conditions required for map-side joins are satisﬁed? In many cases, yes. The reason

is that relational joins happen within the broader context of a workﬂow, which may

include multiple steps. Therefore, the datasets that are to be joined may be the output

of previous processes (either MapReduce jobs or other code). If the workﬂow is known

in advance and relatively static (both reasonable assumptions in a mature workﬂow),

we can engineer the previous processes to generate output sorted and partitioned in

a way that makes eﬃcient map-side joins possible (in MapReduce, by using a custom

14

Note that this almost always implies a non-local read.

3.5. RELATIONAL JOINS 67

partitioner and controlling the sort order of key-value pairs). For ad hoc data analysis,

reduce-side joins are a more general, albeit less eﬃcient, solution. Consider the case

where datasets have multiple keys that one might wish to join on—then no matter

how the data is organized, map-side joins will require repartitioning of the data. Al-

ternatively, it is always possible to repartition a dataset using an identity mapper and

reducer. But of course, this incurs the cost of shuﬄing data over the network.

There is a ﬁnal restriction to bear in mind when using map-side joins with the

Hadoop implementation of MapReduce. We assume here that the datasets to be joined

were produced by previous MapReduce jobs, so this restriction applies to keys the

reducers in those jobs may emit. Hadoop permits reducers to emit keys that are diﬀerent

from the input key whose values they are processing (that is, input and output keys

need not be the same, nor even the same type).

15

However, if the output key of a

reducer is diﬀerent from the input key, then the output dataset from the reducer will

not necessarily be partitioned in a manner consistent with the speciﬁed partitioner

(because the partitioner applies to the input keys rather than the output keys). Since

map-side joins depend on consistent partitioning and sorting of keys, the reducers used

to generate data that will participate in a later map-side join must not emit any key

but the one they are currently processing.

3.5.3 MEMORY-BACKED JOIN

In addition to the two previous approaches to joining relational data that leverage the

MapReduce framework to bring together tuples that share a common join key, there is a

family of approaches we call memory-backed joins based on random access probes. The

simplest version is applicable when one of the two datasets completely ﬁts in memory

on each node. In this situation, we can load the smaller dataset into memory in every

mapper, populating an associative array to facilitate random access to tuples based on

the join key. The mapper initialization API hook (see Section 3.1.1) can be used for

this purpose. Mappers are then applied to the other (larger) dataset, and for each input

key-value pair, the mapper probes the in-memory dataset to see if there is a tuple with

the same join key. If there is, the join is performed. This is known as a simple hash join

by the database community [51].

What if neither dataset ﬁts in memory? The simplest solution is to divide the

smaller dataset, let’s say S, into n partitions, such that S = S

1

∪ S

2

∪ . . . ∪ S

n

. We

can choose n so that each partition is small enough to ﬁt in memory, and then run

n memory-backed hash joins. This, of course, requires streaming through the other

dataset n times.

15

In contrast, recall from Section 2.2 that in Google’s implementation, reducers’ output keys must be exactly

same as their input keys.

68 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN

There is an alternative approach to memory-backed joins for cases where neither

datasets ﬁt into memory. A distributed key-value store can be used to hold one dataset

in memory across multiple machines while mapping over the other. The mappers would

then query this distributed key-value store in parallel and perform joins if the join

keys match.

16

The open-source caching system memcached can be used for exactly

this purpose, and therefore we’ve dubbed this approach memcached join. For more

information, this approach is detailed in a technical report [95].

3.6 SUMMARY

This chapter provides a guide on the design of MapReduce algorithms. In particular,

we present a number of “design patterns” that capture eﬀective solutions to common

problems. In summary, they are:

• “In-mapper combining”, where the functionality of the combiner is moved into the

mapper. Instead of emitting intermediate output for every input key-value pair,

the mapper aggregates partial results across multiple input records and only emits

intermediate key-value pairs after some amount of local aggregation is performed.

• The related patterns “pairs” and “stripes” for keeping track of joint events from

a large number of observations. In the pairs approach, we keep track of each joint

event separately, whereas in the stripes approach we keep track of all events that

co-occur with the same event. Although the stripes approach is signiﬁcantly more

eﬃcient, it requires memory on the order of the size of the event space, which

presents a scalability bottleneck.

• “Order inversion”, where the main idea is to convert the sequencing of compu-

tations into a sorting problem. Through careful orchestration, we can send the

reducer the result of a computation (e.g., an aggregate statistic) before it encoun-

ters the data necessary to produce that computation.

• “Value-to-key conversion”, which provides a scalable solution for secondary sort-

ing. By moving part of the value into the key, we can exploit the MapReduce

execution framework itself for sorting.

Ultimately, controlling synchronization in the MapReduce programming model boils

down to eﬀective use of the following techniques:

1. Constructing complex keys and values that bring together data necessary for a

computation. This is used in all of the above design patterns.

16

In order to achieve good performance in accessing distributed key-value stores, it is often necessary to batch

queries before making synchronous requests (to amortize latency over many requests) or to rely on asynchronous

requests.

3.6. SUMMARY 69

2. Executing user-speciﬁed initialization and termination code in either the mapper

or reducer. For example, in-mapper combining depends on emission of intermediate

key-value pairs in the map task termination code.

3. Preserving state across multiple inputs in the mapper and reducer. This is used

in in-mapper combining, order inversion, and value-to-key conversion.

4. Controlling the sort order of intermediate keys. This is used in order inversion and

value-to-key conversion.

5. Controlling the partitioning of the intermediate key space. This is used in order

inversion and value-to-key conversion.

This concludes our overview of MapReduce algorithm design. It should be clear by now

that although the programming model forces one to express algorithms in terms of a

small set of rigidly-deﬁned components, there are many tools at one’s disposal to shape

the ﬂow of computation. In the next few chapters, we will focus on speciﬁc classes

of MapReduce algorithms: for inverted indexing in Chapter 4, for graph processing in

Chapter 5, and for expectation-maximization in Chapter 6.

70

C H A P T E R 4

Inverted Indexing for Text Retrieval

Web search is the quintessential large-data problem. Given an information need ex-

pressed as a short query consisting of a few terms, the system’s task is to retrieve

relevant web objects (web pages, PDF documents, PowerPoint slides, etc.) and present

them to the user. How large is the web? It is diﬃcult to compute exactly, but even a

conservative estimate would place the size at several tens of billions of pages, totaling

hundreds of terabytes (considering text alone). In real-world applications, users demand

results quickly from a search engine—query latencies longer than a few hundred mil-

liseconds will try a user’s patience. Fulﬁlling these requirements is quite an engineering

feat, considering the amounts of data involved!

Nearly all retrieval engines for full-text search today rely on a data structure

called an inverted index, which given a term provides access to the list of documents

that contain the term. In information retrieval parlance, objects to be retrieved are

generically called “documents” even though in actuality they may be web pages, PDFs,

or even fragments of code. Given a user query, the retrieval engine uses the inverted

index to score documents that contain the query terms with respect to some ranking

model, taking into account features such as term matches, term proximity, attributes

of the terms in the document (e.g., bold, appears in title, etc.), as well as the hyperlink

structure of the documents (e.g., PageRank [117], which we’ll discuss in Chapter 5, or

related metrics such as HITS [84] and SALSA [88]).

The web search problem decomposes into three components: gathering web con-

tent (crawling), construction of the inverted index (indexing) and ranking documents

given a query (retrieval). Crawling and indexing share similar characteristics and re-

quirements, but these are very diﬀerent from retrieval. Gathering web content and

building inverted indexes are for the most part oﬄine problems. Both need to be scal-

able and eﬃcient, but they do not need to operate in real time. Indexing is usually a

batch process that runs periodically: the frequency of refreshes and updates is usually

dependent on the design of the crawler. Some sites (e.g., news organizations) update

their content quite frequently and need to be visited often; other sites (e.g., government

regulations) are relatively static. However, even for rapidly changing sites, it is usually

tolerable to have a delay of a few minutes until content is searchable. Furthermore, since

the amount of content that changes rapidly is relatively small, running smaller-scale in-

dex updates at greater frequencies is usually an adequate solution.

1

Retrieval, on the

1

Leaving aside the problem of searching live data streams such a tweets, which requires diﬀerent techniques and

algorithms.

4.1. WEB CRAWLING 71

other hand, is an online problem that demands sub-second response time. Individual

users expect low query latencies, but query throughput is equally important since a

retrieval engine must usually serve many users concurrently. Furthermore, query loads

are highly variable, depending on the time of day, and can exhibit “spikey” behavior

due to special circumstances (e.g., a breaking news event triggers a large number of

searches on the same topic). On the other hand, resource consumption for the indexing

problem is more predictable.

A comprehensive treatment of web search is beyond the scope of this chapter,

and even this entire book. Explicitly recognizing this, we mostly focus on the problem

of inverted indexing, the task most amenable to solutions in MapReduce. This chapter

begins by ﬁrst providing an overview of web crawling (Section 4.1) and introducing the

basic structure of an inverted index (Section 4.2). A baseline inverted indexing algorithm

in MapReduce is presented in Section 4.3. We point out a scalability bottleneck in that

algorithm, which leads to a revised version presented in Section 4.4. Index compression

is discussed in Section 4.5, which ﬁlls in missing details on building compact index

structures. Since MapReduce is primarily designed for batch-oriented processing, it

does not provide an adequate solution for the retrieval problem, an issue we discuss in

Section 4.6. The chapter concludes with a summary and pointers to additional readings.

4.1 WEB CRAWLING

Before building inverted indexes, we must ﬁrst acquire the document collection over

which these indexes are to be built. In academia and for research purposes, this can

be relatively straightforward. Standard collections for information retrieval research are

widely available for a variety of genres ranging from blogs to newswire text. For re-

searchers who wish to explore web-scale retrieval, there is the ClueWeb09 collection

that contains one billion web pages in ten languages (totaling 25 terabytes) crawled by

Carnegie Mellon University in early 2009.

2

Obtaining access to these standard collec-

tions is usually as simple as signing an appropriate data license from the distributor of

the collection, paying a reasonable fee, and arranging for receipt of the data.

3

For real-world web search, however, one cannot simply assume that the collection

is already available. Acquiring web content requires crawling, which is the process of

traversing the web by repeatedly following hyperlinks and storing downloaded pages

for subsequent processing. Conceptually, the process is quite simple to understand: we

start by populating a queue with a “seed” list of pages. The crawler downloads pages

in the queue, extracts links from those pages to add to the queue, stores the pages for

2

http://boston.lti.cs.cmu.edu/Data/clueweb09/

3

As an interesting side note, in the 1990s, research collections were distributed via postal mail on CD-ROMs,

and later, on DVDs. Electronic distribution became common earlier this decade for collections below a certain

size. However, many collections today are so large that the only practical method of distribution is shipping

hard drives via postal mail.

72 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

further processing, and repeats. In fact, rudimentary web crawlers can be written in a

few hundred lines of code.

However, eﬀective and eﬃcient web crawling is far more complex. The following

lists a number of issues that real-world crawlers must contend with:

• A web crawler must practice good “etiquette” and not overload web servers. For

example, it is common practice to wait a ﬁxed amount of time before repeated

requests to the same server. In order to respect these constraints while maintaining

good throughput, a crawler typically keeps many execution threads running in

parallel and maintains many TCP connections (perhaps hundreds) open at the

same time.

• Since a crawler has ﬁnite bandwidth and resources, it must prioritize the order in

which unvisited pages are downloaded. Such decisions must be made online and

in an adversarial environment, in the sense that spammers actively create “link

farms” and “spider traps” full of spam pages to trick a crawler into overrepresent-

ing content from a particular site.

• Most real-world web crawlers are distributed systems that run on clusters of ma-

chines, often geographically distributed. To avoid downloading a page multiple

times and to ensure data consistency, the crawler as a whole needs mechanisms

for coordination and load-balancing. It also needs to be robust with respect to

machine failures, network outages, and errors of various types.

• Web content changes, but with diﬀerent frequency depending on both the site and

the nature of the content. A web crawler needs to learn these update patterns

to ensure that content is reasonably current. Getting the right recrawl frequency

is tricky: too frequent means wasted resources, but not frequent enough leads to

stale content.

• The web is full of duplicate content. Examples include multiple copies of a popu-

lar conference paper, mirrors of frequently-accessed sites such as Wikipedia, and

newswire content that is often duplicated. The problem is compounded by the fact

that most repetitious pages are not exact duplicates but near duplicates (that is,

basically the same page but with diﬀerent ads, navigation bars, etc.) It is desir-

able during the crawling process to identify near duplicates and select the best

exemplar to index.

• The web is multilingual. There is no guarantee that pages in one language only

link to pages in the same language. For example, a professor in Asia may maintain

her website in the local language, but contain links to publications in English.

4.2. INVERTED INDEXES 73

Furthermore, many pages contain a mix of text in diﬀerent languages. Since doc-

ument processing techniques (e.g., tokenization, stemming) diﬀer by language, it

is important to identify the (dominant) language on a page.

The above discussion is not meant to be an exhaustive enumeration of issues, but rather

to give the reader an appreciation of the complexities involved in this intuitively simple

task. For more information, see a recent survey on web crawling [113]. Section 4.7

provides pointers to additional readings.

4.2 INVERTED INDEXES

In its basic form, an inverted index consists of postings lists, one associated with each

term that appears in the collection.

4

The structure of an inverted index is illustrated in

Figure 4.1. A postings list is comprised of individual postings, each of which consists of

a document id and a payload—information about occurrences of the term in the doc-

ument. The simplest payload is. . . nothing! For simple boolean retrieval, no additional

information is needed in the posting other than the document id; the existence of the

posting itself indicates that presence of the term in the document. The most common

payload, however, is term frequency (tf), or the number of times the term occurs in the

document. More complex payloads include positions of every occurrence of the term in

the document (to support phrase queries and document scoring based on term proxim-

ity), properties of the term (such as if it occurred in the page title or not, to support

document ranking based on notions of importance), or even the results of additional

linguistic processing (for example, indicating that the term is part of a place name, to

support address searches). In the web context, anchor text information (text associated

with hyperlinks from other pages to the page in question) is useful in enriching the

representation of document content (e.g., [107]); this information is often stored in the

index as well.

In the example shown in Figure 4.1, we see that term

1

occurs in ¦d

1

, d

5

, d

6

, d

11

, . . .¦,

term

2

occurs in ¦d

11

, d

23

, d

59

, d

84

, . . .¦, and term

3

occurs in ¦d

1

, d

4

, d

11

, d

19

, . . .¦. In an

actual implementation, we assume that documents can be identiﬁed by a unique integer

ranging from 1 to n, where n is the total number of documents.

5

Generally, postings are

sorted by document id, although other sort orders are possible as well. The document ids

have no inherent semantic meaning, although assignment of numeric ids to documents

need not be arbitrary. For example, pages from the same domain may be consecutively

numbered. Or, alternatively, pages that are higher in quality (based, for example, on

PageRank values) might be assigned smaller numeric values so that they appear toward

4

In information retrieval parlance, term is preferred over word since documents are processed (e.g., tokenization

and stemming) into basic units that are often not words in the linguistic sense.

5

It is preferable to start numbering the documents at one since it is not possible to code zero with many common

compression schemes used in information retrieval; see Section 4.5.

74 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

term

1

term

2

…

terms postings

d

1

p d

5

p d

6

p d

11

p

… d

11

p d

23

p d

59

p d

84

p

term

3

… …

… d

1

p d

4

p d

11

p d

19

p

terms postings

term

1

term

2

term

3

… d

1

p d

5

p d

6

p d

11

p

… d

11

p d

23

d

23

pp d

59

p d

84

d

84

pp

… d

1

p d

4

p d

11

d

11

pp d

19

p

3

… …

1

p

4

p

11 11

pp

19

p

Figure 4.1: Simple illustration of an inverted index. Each term is associated with a list of

postings. Each posting is comprised of a document id and a payload, denoted by p in this case.

An inverted index provides quick access to documents ids that contain a term.

the front of a postings list. Either way, an auxiliary data structure is necessary to

maintain the mapping from integer document ids to some other more meaningful handle,

such as a URL.

Given a query, retrieval involves fetching postings lists associated with query terms

and traversing the postings to compute the result set. In the simplest case, boolean

retrieval involves set operations (union for boolean OR and intersection for boolean

AND) on postings lists, which can be accomplished very eﬃciently since the postings

are sorted by document id. In the general case, however, query–document scores must be

computed. Partial document scores are stored in structures called accumulators. At the

end (i.e., once all postings have been processed), the top k documents are then extracted

to yield a ranked list of results for the user. Of course, there are many optimization

strategies for query evaluation (both approximate and exact) that reduce the number

of postings a retrieval engine must examine.

The size of an inverted index varies, depending on the payload stored in each

posting. If only term frequency is stored, a well-optimized inverted index can be a tenth

of the size of the original document collection. An inverted index that stores positional

information would easily be several times larger than one that does not. Generally, it

is possible to hold the entire vocabulary (i.e., dictionary of all the terms) in memory,

especially with techniques such as front-coding [156]. However, with the exception of

well-resourced, commercial web search engines,

6

postings lists are usually too large to

store in memory and must be held on disk, usually in compressed form (more details in

Section 4.5). Query evaluation, therefore, necessarily involves random disk access and

“decoding” of the postings. One important aspect of the retrieval problem is to organize

disk operations such that random seeks are minimized.

6

Google keeps indexes in memory.

4.3. INVERTED INDEXING: BASELINE IMPLEMENTATION 75

1: class Mapper

2: procedure Map(docid n, doc d)

3: H ← new AssociativeArray

4: for all term t ∈ doc d do

5: H¦t¦ ← H¦t¦ + 1

6: for all term t ∈ H do

7: Emit(term t, posting 'n, H¦t¦`)

1: class Reducer

2: procedure Reduce(term t, postings ['n

1

, f

1

`, 'n

2

, f

2

` . . .])

3: P ← new List

4: for all posting 'a, f` ∈ postings ['n

1

, f

1

`, 'n

2

, f

2

` . . .] do

5: Append(P, 'a, f`)

6: Sort(P)

7: Emit(term t, postings P)

Figure 4.2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce. Map-

pers emit postings keyed by terms, the execution framework groups postings by term, and the

reducers write postings lists to disk.

Once again, this brief discussion glosses over many complexities and does a huge

injustice to the tremendous amount of research in information retrieval. However, our

goal is to provide the reader with an overview of the important issues; Section 4.7

provides references to additional readings.

4.3 INVERTED INDEXING: BASELINE IMPLEMENTATION

MapReduce was designed from the very beginning to produce the various data struc-

tures involved in web search, including inverted indexes and the web graph. We begin

with the basic inverted indexing algorithm shown in Figure 4.2.

Input to the mapper consists of document ids (keys) paired with the actual con-

tent (values). Individual documents are processed in parallel by the mappers. First,

each document is analyzed and broken down into its component terms. The process-

ing pipeline diﬀers depending on the application and type of document, but for web

pages typically involves stripping out HTML tags and other elements such as JavaScript

code, tokenizing, case folding, removing stopwords (common words such as ‘the’, ‘a’,

‘of’, etc.), and stemming (removing aﬃxes from words so that ‘dogs’ becomes ‘dog’).

Once the document has been analyzed, term frequencies are computed by iterating over

all the terms and keeping track of counts. Lines 4 and 5 in the pseudo-code reﬂect the

process of computing term frequencies, but hides the details of document processing.

76 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

After this histogram has been built, the mapper then iterates over all terms. For each

term, a pair consisting of the document id and the term frequency is created. Each pair,

denoted by 'n, H¦t¦` in the pseudo-code, represents an individual posting. The mapper

then emits an intermediate key-value pair with the term as the key and the posting as

the value, in line 7 of the mapper pseudo-code. Although as presented here only the

term frequency is stored in the posting, this algorithm can be easily augmented to store

additional information (e.g., term positions) in the payload.

In the shuﬄe and sort phase, the MapReduce runtime essentially performs a large,

distributed group by of the postings by term. Without any additional eﬀort by the

programmer, the execution framework brings together all the postings that belong in

the same postings list. This tremendously simpliﬁes the task of the reducer, which

simply needs to gather together all the postings and write them to disk. The reducer

begins by initializing an empty list and then appends all postings associated with the

same key (term) to the list. The postings are then sorted by document id, and the entire

postings list is emitted as a value, with the term as the key. Typically, the postings list

is ﬁrst compressed, but we leave this aside for now (see Section 4.4 for more details).

The ﬁnal key-value pairs are written to disk and comprise the inverted index. Since

each reducer writes its output in a separate ﬁle in the distributed ﬁle system, our ﬁnal

index will be split across r ﬁles, where r is the number of reducers. There is no need to

further consolidate these ﬁles. Separately, we must also build an index to the postings

lists themselves for the retrieval engine: this is typically in the form of mappings from

term to (ﬁle, byte oﬀset) pairs, so that given a term, the retrieval engine can fetch

its postings list by opening the appropriate ﬁle and seeking to the correct byte oﬀset

position in that ﬁle.

Execution of the complete algorithm is illustrated in Figure 4.3 with a toy example

consisting of three documents, three mappers, and two reducers. Intermediate key-value

pairs (from the mappers) and the ﬁnal key-value pairs comprising the inverted index

(from the reducers) are shown in the boxes with dotted lines. Postings are shown as

pairs of boxes, with the document id on the left and the term frequency on the right.

The MapReduce programming model provides a very concise expression of the in-

verted indexing algorithm. Its implementation is similarly concise: the basic algorithm

can be implemented in as few as a couple dozen lines of code in Hadoop (with mini-

mal document processing). Such an implementation can be completed as a week-long

programming assignment in a course for advanced undergraduates or ﬁrst-year gradu-

ate students [83, 93]. In a non-MapReduce indexer, a signiﬁcant fraction of the code

is devoted to grouping postings by term, given constraints imposed by memory and

disk (e.g., memory capacity is limited, disk seeks are slow, etc.). In MapReduce, the

programmer does not need to worry about any of these issues—most of the heavy lifting

is performed by the execution framework.

4.4. INVERTED INDEXING: REVISED IMPLEMENTATION 77

one fish, two fish

doc 1

red fish, blue fish

doc 2

one red bird

doc 3

mapper mapper mapper

d

1

2 fish

d

1

1 one

d

1

1 two

d

2

1 blue

d

2

2 fish

d

2

1 red

d

3

1 bird

d

3

1 one

d

3

1 red

reducer

d

1

1 two d

2

1 red d

3

1 red

Shuffle and Sort: aggregate values by keys

reducer reducer reducer

d

1

2 fish d

2

2 d

3

1 bird

d

1

1 one

d

1

1 two

d

2

1 blue

d

2

1 red d

3

1

d

3

1

Figure 4.3: Simple illustration of the baseline inverted indexing algorithm in MapReduce with

three mappers and two reducers. Postings are shown as pairs of boxes (docid, tf).

4.4 INVERTED INDEXING: REVISED IMPLEMENTATION

The inverted indexing algorithm presented in the previous section serves as a reasonable

baseline. However, there is a signiﬁcant scalability bottleneck: the algorithm assumes

that there is suﬃcient memory to hold all postings associated with the same term. Since

the basic MapReduce execution framework makes no guarantees about the ordering of

values associated with the same key, the reducer ﬁrst buﬀers all postings (line 5 of the

reducer pseudo-code in Figure 4.2) and then performs an in-memory sort before writing

the postings to disk.

7

For eﬃcient retrieval, postings need to be sorted by document id.

However, as collections become larger, postings lists grow longer, and at some point in

time, reducers will run out of memory.

There is a simple solution to this problem. Since the execution framework guaran-

tees that keys arrive at each reducer in sorted order, one way to overcome the scalability

7

See similar discussion in Section 3.4: in principle, this need not be an in-memory sort. It is entirely possible to

implement a disk-based sort within the reducer.

78 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

bottleneck is to let the MapReduce runtime do the sorting for us. Instead of emitting

key-value pairs of the following type:

(term t, posting 'docid, f`)

We emit intermediate key-value pairs of the type instead:

(tuple 't, docid`, tf f)

In other words, the key is a tuple containing the term and the document id, while the

value is the term frequency. This is exactly the value-to-key conversion design pattern

introduced in Section 3.4. With this modiﬁcation, the programming model ensures that

the postings arrive in the correct order. This, combined with the fact that reducers can

hold state across multiple keys, allows postings lists to be created with minimal memory

usage. As a detail, remember that we must deﬁne a custom partitioner to ensure that

all tuples with the same term are shuﬄed to the same reducer.

The revised MapReduce inverted indexing algorithm is shown in Figure 4.4. The

mapper remains unchanged for the most part, other than diﬀerences in the intermediate

key-value pairs. The Reduce method is called for each key (i.e., 't, n`), and by design,

there will only be one value associated with each key. For each key-value pair, a posting

can be directly added to the postings list. Since the postings are guaranteed to arrive

in sorted order by document id, they can be incrementally coded in compressed form—

thus ensuring a small memory footprint. Finally, when all postings associated with the

same term have been processed (i.e., t = t

prev

), the entire postings list is emitted. The

ﬁnal postings list must be written out in the Close method. As with the baseline

algorithm, payloads can be easily changed: by simply replacing the intermediate value

f (term frequency) with whatever else is desired (e.g., term positional information).

There is one more detail we must address when building inverted indexes. Since

almost all retrieval models take into account document length when computing query–

document scores, this information must also be extracted. Although it is straightforward

to express this computation as another MapReduce job, this task can actually be folded

into the inverted indexing process. When processing the terms in each document, the

document length is known, and can be written out as “side data” directly to HDFS.

We can take advantage of the ability for a mapper to hold state across the processing of

multiple documents in the following manner: an in-memory associative array is created

to store document lengths, which is populated as each document is processed.

8

When

the mapper ﬁnishes processing input records, document lengths are written out to

HDFS (i.e., in the Close method). This approach is essentially a variant of the in-

mapper combining pattern. Document length data ends up in m diﬀerent ﬁles, where

m is the number of mappers; these ﬁles are then consolidated into a more compact

8

In general, there is no worry about insuﬃcient memory to hold these data.

4.5. INDEX COMPRESSION 79

1: class Mapper

2: method Map(docid n, doc d)

3: H ← new AssociativeArray

4: for all term t ∈ doc d do

5: H¦t¦ ← H¦t¦ + 1

6: for all term t ∈ H do

7: Emit(tuple 't, n`, tf H¦t¦)

1: class Reducer

2: method Initialize

3: t

prev

← ∅

4: P ← new PostingsList

5: method Reduce(tuple 't, n`, tf [f])

6: if t = t

prev

∧ t

prev

= ∅ then

7: Emit(term t, postings P)

8: P.Reset()

9: P.Add('n, f`)

10: t

prev

← t

11: method Close

12: Emit(term t, postings P)

Figure 4.4: Pseudo-code of a scalable inverted indexing algorithm in MapReduce. By applying

the value-to-key conversion design pattern, the execution framework is exploited to sort postings

so that they arrive sorted by document id in the reducer.

representation. Alternatively, document length information can be emitted in special

key-value pairs by the mapper. One must then write a custom partitioner so that these

special key-value pairs are shuﬄed to a single reducer, which will be responsible for

writing out the length data separate from the postings lists.

4.5 INDEX COMPRESSION

We return to the question of how postings are actually compressed and stored on disk.

This chapter devotes a substantial amount of space to this topic because index com-

pression is one of the main diﬀerences between a “toy” indexer and one that works on

real-world collections. Otherwise, MapReduce inverted indexing algorithms are pretty

straightforward.

Let us consider the canonical case where each posting consists of a document id

and the term frequency. A na¨ıve implementation might represent the ﬁrst as a 32-bit

80 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

integer

9

and the second as a 16-bit integer. Thus, a postings list might be encoded as

follows:

[(5, 2), (7, 3), (12, 1), (49, 1), (51, 2), . . .]

where each posting is represented by a pair in parentheses. Note that all brackets, paren-

theses, and commas are only included to enhance readability; in reality the postings

would be represented as a long stream of integers. This na¨ıve implementation would

require six bytes per posting. Using this scheme, the entire inverted index would be

about as large as the collection itself. Fortunately, we can do signiﬁcantly better.

The ﬁrst trick is to encode diﬀerences between document ids as opposed to the

document ids themselves. Since the postings are sorted by document ids, the diﬀerences

(called d-gaps) must be positive integers greater than zero. The above postings list,

represented with d-gaps, would be:

[(5, 2), (2, 3), (5, 1), (37, 1), (2, 2), . . .]

Of course, we must actually encode the ﬁrst document id. We haven’t lost any infor-

mation, since the original document ids can be easily reconstructed from the d-gaps.

However, it’s not obvious that we’ve reduced the space requirements either, since the

largest possible d-gap is one less than the number of documents in the collection.

This is where the second trick comes in, which is to represent the d-gaps in a

way such that it takes less space for smaller numbers. Similarly, we want to apply the

same techniques to compress the term frequencies, since for the most part they are also

small values. But to understand how this is done, we need to take a slight detour into

compression techniques, particularly for coding integers.

Compression, in general, can be characterized as either lossless or lossy: it’s fairly

obvious that loseless compression is required in this context. To start, it is important

to understand that all compression techniques represent a time–space tradeoﬀ. That

is, we reduce the amount of space on disk necessary to store data, but at the cost of

extra processor cycles that must be spent coding and decoding data. Therefore, it is

possible that compression reduces size but also slows processing. However, if the two

factors are properly balanced (i.e., decoding speed can keep up with disk bandwidth),

we can achieve the best of both worlds: smaller and faster.

4.5.1 BYTE-ALIGNED AND WORD-ALIGNED CODES

In most programming languages, an integer is encoded in four bytes and holds a value

between 0 and 2

32

−1, inclusive. We limit our discussion to unsigned integers, since d-

gaps are always positive (and greater than zero). This means that 1 and 4,294,967,295

9

However, note that 2

32

−1 is “only” 4,294,967,295, which is much less than even the most conservative estimate

of the size of the web.

4.5. INDEX COMPRESSION 81

both occupy four bytes. Obviously, encoding d-gaps this way doesn’t yield any reduc-

tions in size.

A simple approach to compression is to only use as many bytes as is necessary to

represent the integer. This is known as variable-length integer coding (varInt for short)

and accomplished by using the high order bit of every byte as the continuation bit,

which is set to one in the last byte and zero elsewhere. As a result, we have 7 bits per

byte for coding the value, which means that 0 ≤ n < 2

7

can be expressed with 1 byte,

2

7

≤ n < 2

14

with 2 bytes, 2

14

≤ n < 2

21

with 3, and 2

21

≤ n < 2

28

with 4 bytes. This

scheme can be extended to code arbitrarily-large integers (i.e., beyond 4 bytes). As a

concrete example, the two numbers:

127, 128

would be coded as such:

1 1111111, 0 0000001 1 0000000

The above code contains two code words, the ﬁrst consisting of 1 byte, and the second

consisting of 2 bytes. Of course, the comma and the spaces are there only for readability.

Variable-length integers are byte-aligned because the code words always fall along byte

boundaries. As a result, there is never any ambiguity about where one code word ends

and the next begins. However, the downside of varInt coding is that decoding involves

lots of bit operations (masks, shifts). Furthermore, the continuation bit sometimes re-

sults in frequent branch mispredicts (depending on the actual distribution of d-gaps),

which slows down processing.

A variant of the varInt scheme was described by Jeﬀ Dean in a keynote talk at

the WSDM 2009 conference.

10

The insight is to code groups of four integers at a time.

Each group begins with a preﬁx byte, divided into four 2-bit values that specify the

byte length of each of the following integers. For example, the following preﬁx byte:

00,00,01,10

indicates that the following four integers are one byte, one byte, two bytes, and three

bytes, respectively. Therefore, each group of four integers would consume anywhere be-

tween 5 and 17 bytes. A simple lookup table based on the preﬁx byte directs the decoder

on how to process subsequent bytes to recover the coded integers. The advantage of this

group varInt coding scheme is that values can be decoded with fewer branch mispredicts

and bitwise operations. Experiments reported by Dean suggest that decoding integers

with this scheme is more than twice as fast as the basic varInt scheme.

In most architectures, accessing entire machine words is more eﬃcient than fetch-

ing all its bytes separately. Therefore, it makes sense to store postings in increments

10

http://research.google.com/people/jeff/WSDM09-keynote.pdf

82 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

of 16-bit, 32-bit, or 64-bit machine words. Anh and Moﬀat [8] presented several word-

aligned coding methods, one of which is called Simple-9, based on 32-bit words. In this

coding scheme, four bits in each 32-bit word are reserved as a selector. The remaining

28 bits are used to code actual integer values. Now, there are a variety of ways these 28

bits can be divided to code one or more integers: 28 bits can be used to code one 28-bit

integer, two 14-bit integers, three 9-bit integers (with one bit unused), etc., all the way

up to twenty-eight 1-bit integers. In fact, there are nine diﬀerent ways the 28 bits can be

divided into equal parts (hence the name of the technique), some with leftover unused

bits. This is stored in the selector bits. Therefore, decoding involves reading a 32-bit

word, examining the selector to see how the remaining 28 bits are packed, and then

appropriately decoding each integer. Coding works in the opposite way: the algorithm

scans ahead to see how many integers can be squeezed into 28 bits, packs those integers,

and sets the selector bits appropriately.

4.5.2 BIT-ALIGNED CODES

The advantage of byte-aligned and word-aligned codes is that they can be coded and

decoded quickly. The downside, however, is that they must consume multiples of eight

bits, even when fewer bits might suﬃce (the Simple-9 scheme gets around this by

packing multiple integers into a 32-bit word, but even then, bits are often wasted).

In bit-aligned codes, on the other hand, code words can occupy any number of bits,

meaning that boundaries can fall anywhere. In practice, coding and decoding bit-aligned

codes require processing bytes and appropriately shifting or masking bits (usually more

involved than varInt and group varInt coding).

One additional challenge with bit-aligned codes is that we need a mechanism to

delimit code words, i.e., tell where the last ends and the next begins, since there are

no byte boundaries to guide us. To address this issue, most bit-aligned codes are so-

called preﬁx codes (confusingly, they are also called preﬁx-free codes), in which no valid

code word is a preﬁx of any other valid code word. For example, coding 0 ≤ x < 3 with

¦0, 1, 01¦ is not a valid preﬁx code, since 0 is a preﬁx of 01, and so we can’t tell if 01 is

two code words or one. On the other hand, ¦00, 01, 1¦ is a valid preﬁx code, such that

a sequence of bits:

0001101001010100

can be unambiguously segmented into:

00 01 1 01 00 1 01 01 00

and decoded without any additional delimiters.

One of the simplest preﬁx codes is the unary code. An integer x > 0 is coded as x −

1 one bits followed by a zero bit. Note that unary codes do not allow the representation

4.5. INDEX COMPRESSION 83

Golomb

x unary γ b = 5 b = 10

1 0 0 0:00 0:000

2 10 10:0 0:01 0:001

3 110 10:1 0:10 0:010

4 1110 110:00 0:110 0:011

5 11110 110:01 0:111 0:100

6 111110 110:10 10:00 0:101

7 1111110 110:11 10:01 0:1100

8 11111110 1110:000 10:10 0:1101

9 111111110 1110:001 10:110 0:1110

10 1111111110 1110:010 10:111 0:1111

Figure 4.5: The ﬁrst ten positive integers in unary, γ, and Golomb (b = 5, 10) codes.

of zero, which is ﬁne since d-gaps and term frequencies should never be zero.

11

As an

example, 4 in unary code is 1110. With unary code we can code x in x bits, which

although economical for small values, becomes ineﬃcient for even moderately large

values. Unary codes are rarely used by themselves, but form a component of other

coding schemes. Unary codes of the ﬁrst ten positive integers are shown in Figure 4.5.

Elias γ code is an eﬃcient coding scheme that is widely used in practice. An integer

x > 0 is broken into two components, 1 +log

2

x| (= n, the length), which is coded in

unary code, and x −2

|log

2

x|

(= r, the remainder), which is in binary.

12

The unary

component n speciﬁes the number of bits required to code x, and the binary component

codes the remainder r in n −1 bits. As an example, consider x = 10: 1 +log

2

10| =

4, which is 1110. The binary component codes x −2

3

= 2 in 4 −1 = 3 bits, which is

010. Putting both together, we arrive at 1110:010. The extra colon is inserted only for

readability; it’s not part of the ﬁnal code, of course.

Working in reverse, it is easy to unambiguously decode a bit stream of γ codes:

First, we read a unary code c

u

, which is a preﬁx code. This tells us that the binary

portion is written in c

u

−1 bits, which we then read as c

b

. We can then reconstruct x

as 2

cu−1

+ c

b

. For x < 16, γ codes occupy less than a full byte, which makes them more

compact than variable-length integer codes. Since term frequencies for the most part are

relatively small, γ codes make sense for them and can yield substantial space savings.

For reference, the γ codes of the ﬁrst ten positive integers are shown in Figure 4.5. A

11

As a note, some sources describe slightly diﬀerent formulations of the same coding scheme. Here, we adopt the

conventions used in the classic IR text Managing Gigabytes [156].

12

Note that x is the ﬂoor function, which maps x to the largest integer not greater than x, so, e.g., 3.8 = 3.

This is the default behavior in many programming languages when casting from a ﬂoating-point type to an

integer type.

84 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

variation on γ code is δ code, where the n portion of the γ code is coded in γ code

itself (as opposed to unary code). For smaller values γ codes are more compact, but for

larger values, δ codes take less space.

Unary and γ codes are parameterless, but even better compression can be achieved

with parameterized codes. A good example of this is Golomb code. For some parameter

b, an integer x > 0 is coded in two parts: ﬁrst, we compute q = (x −1)/b| and code

q + 1 in unary; then, we code the remainder r = x −qb −1 in truncated binary. This

is accomplished as follows: if b is a power of two, then truncated binary is exactly the

same as normal binary, requiring log

2

b bits. Otherwise, we code the ﬁrst 2

|log

2

b|+1

−b

values of r in log

2

b| bits and code the rest of the values of r by coding r + 2

|log

2

b|+1

−b

in ordinary binary representation using log

2

b| + 1 bits. In this case, the r is coded in

either log

2

b| or log

2

b| + 1 bits, and unlike ordinary binary coding, truncated binary

codes are preﬁx codes. As an example, if b = 5, then r can take the values ¦0, 1, 2, 3, 4¦,

which would be coded with the following code words: ¦00, 01, 10, 110, 111¦. For reference,

Golomb codes of the ﬁrst ten positive integers are shown in Figure 4.5 for b = 5 and

b = 10. A special case of Golomb code is worth noting: if b is a power of two, then

coding and decoding can be handled more eﬃciently (needing only bit shifts and bit

masks, as opposed to multiplication and division). These are known as Rice codes.

Researchers have shown that Golomb compression works well for d-gaps, and is

optimal with the following parameter setting:

b ≈ 0.69

df

N

(4.1)

where df is the document frequency of the term, and N is the number of documents in

the collection.

13

Putting everything together, one popular approach for postings compression is to

represent d-gaps with Golomb codes and term frequencies with γ codes [156, 162]. If

positional information is desired, we can use the same trick to code diﬀerences between

term positions using γ codes.

4.5.3 POSTINGS COMPRESSION

Having completed our slight detour into integer compression techniques, we can now

return to the scalable inverted indexing algorithm shown in Figure 4.4 and discuss how

postings lists can be properly compressed. As we can see from the previous section,

there is a wide range of choices that represent diﬀerent tradeoﬀs between compression

ratio and decoding speed. Actual performance also depends on characteristics of the

collection, which, among other factors, determine the distribution of d-gaps. B¨ uttcher

13

For details as to why this is the case, we refer the reader elsewhere [156], but here’s the intuition: under

reasonable assumptions, the appearance of postings can be modeled as a sequence of independent Bernoulli

trials, which implies a certain distribution of d-gaps. From this we can derive an optimal setting of b.

4.5. INDEX COMPRESSION 85

et al. [30] recently compared the performance of various compression techniques on

coding document ids. In terms of the amount of compression that can be obtained

(measured in bits per docid), Golomb and Rice codes performed the best, followed by

γ codes, Simple-9, varInt, and group varInt (the least space eﬃcient). In terms of raw

decoding speed, the order was almost the reverse: group varInt was the fastest, followed

by varInt.

14

Simple-9 was substantially slower, and the bit-aligned codes were even

slower than that. Within the bit-aligned codes, Rice codes were the fastest, followed by

γ, with Golomb codes being the slowest (about ten times slower than group varInt).

Let us discuss what modiﬁcations are necessary to our inverted indexing algorithm

if we were to adopt Golomb compression for d-gaps and represent term frequencies

with γ codes. Note that this represents a space-eﬃcient encoding, at the cost of slower

decoding compared to alternatives. Whether or not this is actually a worthwhile tradeoﬀ

in practice is not important here: use of Golomb codes serves a pedagogical purpose, to

illustrate how one might set compression parameters.

Coding term frequencies with γ codes is easy since they are parameterless. Com-

pressing d-gaps with Golomb codes, however, is a bit tricky, since two parameters are

required: the size of the document collection and the number of postings for a particular

postings list (i.e., the document frequency, or df). The ﬁrst is easy to obtain and can be

passed into the reducer as a constant. The df of a term, however, is not known until all

the postings have been processed—and unfortunately, the parameter must be known

before any posting is coded. At ﬁrst glance, this seems like a chicken-and-egg problem.

A two-pass solution that involves ﬁrst buﬀering the postings (in memory) would suﬀer

from the memory bottleneck we’ve been trying to avoid in the ﬁrst place.

To get around this problem, we need to somehow inform the reducer of a term’s

df before any of its postings arrive. This can be solved with the order inversion design

pattern introduced in Section 3.3 to compute relative frequencies. The solution is to

have the mapper emit special keys of the form 't, ∗` to communicate partial document

frequencies. That is, inside the mapper, in addition to emitting intermediate key-value

pairs of the following form:

(tuple 't, docid`, tf f)

we also emit special intermediate key-value pairs like this:

(tuple 't, ∗`, df e)

to keep track of document frequencies associated with each term. In practice, we can

accomplish this by applying the in-mapper combining design pattern (see Section 3.1).

The mapper holds an in-memory associative array that keeps track of how many doc-

uments a term has been observed in (i.e., the local document frequency of the term for

14

However, this study found less speed diﬀerence between group varInt and basic varInt than Dean’s analysis,

presumably due to the diﬀerent distribution of d-gaps in the collections they were examining.

86 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

the subset of documents processed by the mapper). Once the mapper has processed all

input records, special keys of the form 't, ∗` are emitted with the partial df as the value.

To ensure that these special keys arrive ﬁrst, we deﬁne the sort order of the

tuple so that the special symbol ∗ precedes all documents (part of the order inversion

design pattern). Thus, for each term, the reducer will ﬁrst encounter the 't, ∗` key,

associated with a list of values representing partial df values originating from each

mapper. Summing all these partial contributions will yield the term’s df, which can

then be used to set the Golomb compression parameter b. This allows the postings to be

incrementally compressed as they are encountered in the reducer—memory bottlenecks

are eliminated since we do not need to buﬀer postings in memory.

Once again, the order inversion design pattern comes to the rescue. Recall that

the pattern is useful when a reducer needs to access the result of a computation (e.g.,

an aggregate statistic) before it encounters the data necessary to produce that compu-

tation. For computing relative frequencies, that bit of information was the marginal. In

this case, it’s the document frequency.

4.6 WHAT ABOUT RETRIEVAL?

Thus far, we have brieﬂy discussed web crawling and focused mostly on MapReduce

algorithms for inverted indexing. What about retrieval? It should be fairly obvious

that MapReduce, which was designed for large batch operations, is a poor solution for

retrieval. Since users demand sub-second response times, every aspect of retrieval must

be optimized for low latency, which is exactly the opposite tradeoﬀ made in MapReduce.

Recall the basic retrieval problem: we must look up postings lists corresponding to query

terms, systematically traverse those postings lists to compute query–document scores,

and then return the top k results to the user. Looking up postings implies random disk

seeks, since for the most part postings are too large to ﬁt into memory (leaving aside

caching and other special cases for now). Unfortunately, random access is not a forte

of the distributed ﬁle system underlying MapReduce—such operations require multiple

round-trip network exchanges (and associated latencies). In HDFS, a client must ﬁrst

obtain the location of the desired data block from the namenode before the appropriate

datanode can be contacted for the actual data. Of course, access will typically require

a random disk seek on the datanode itself.

It should be fairly obvious that serving the search needs of a large number of

users, each of whom demand sub-second response times, is beyond the capabilities of

any single machine. The only solution is to distribute retrieval across a large number

of machines, which necessitates breaking up the index in some manner. There are two

main partitioning strategies for distributed retrieval: document partitioning and term

partitioning. Under document partitioning, the entire collection is broken up into mul-

tiple smaller sub-collections, each of which is assigned to a server. In other words, each

4.6. WHAT ABOUT RETRIEVAL? 87

d

1

d

2

d

3

d

4

d

5

d

6

d

7

d

8

d

9

2 3

1 1 4

1 1 2

t

1

t

2

t

3

partition

a

5 2 2

1 1 3

2 1

t

4

t

5

t

partition

b

2 1

2 1 4

1 2 3

t

6

t

7

t

8

partition

c

1 2 1 t

9

partition

1

partition

2

partition

3

Figure 4.6: Term–document matrix for a toy collection (nine documents, nine terms) illus-

trating diﬀerent partitioning strategies: partitioning vertically (1, 2, 3) corresponds to document

partitioning, whereas partitioning horizontally (a, b, c) corresponds to term partitioning.

server holds the complete index for a subset of the entire collection. This corresponds

to partitioning vertically in Figure 4.6. With term partitioning, on the other hand,

each server is responsible for a subset of the terms for the entire collection. That is, a

server holds the postings for all documents in the collection for a subset of terms. This

corresponds to partitioning horizontally in Figure 4.6.

Document and term partitioning require diﬀerent retrieval strategies and represent

diﬀerent tradeoﬀs. Retrieval under document partitioning involves a query broker, which

forwards the user’s query to all partition servers, merges partial results from each, and

then returns the ﬁnal results to the user. With this architecture, searching the entire

collection requires that the query be processed by every partition server. However,

since each partition operates independently and traverses postings in parallel, document

partitioning typically yields shorter query latencies (compared to a single monolithic

index with much longer postings lists).

Retrieval under term partitioning, on the other hand, requires a very diﬀerent

strategy. Suppose the user’s query Q contains three terms, q

1

, q

2

, and q

3

. Under the

88 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

pipelined query evaluation strategy, the broker begins by forwarding the query to the

server that holds the postings for q

1

(usually the least frequent term). The server tra-

verses the appropriate postings list and computes partial query–document scores, stored

in the accumulators. The accumulators are then passed to the server that holds the post-

ings associated with q

2

for additional processing, and then to the server for q

3

, before

ﬁnal results are passed back to the broker and returned to the user. Although this

query evaluation strategy may not substantially reduce the latency of any particular

query, it can theoretically increase a system’s throughput due to the far smaller number

of total disk seeks required for each user query (compared to document partitioning).

However, load-balancing is tricky in a pipelined term-partitioned architecture due to

skew in the distribution of query terms, which can create “hot spots” on servers that

hold the postings for frequently-occurring query terms.

In general, studies have shown that document partitioning is a better strategy

overall [109], and this is the strategy adopted by Google [16]. Furthermore, it is known

that Google maintains its indexes in memory (although this is certainly not the common

case for search engines in general). One key advantage of document partitioning is

that result quality degrades gracefully with machine failures. Partition servers that are

oﬄine will simply fail to deliver results for their subsets of the collection. With suﬃcient

partitions, users might not even be aware that documents are missing. For most queries,

the web contains more relevant documents than any user has time to digest: users of

course care about getting relevant documents (sometimes, they are happy with a single

relevant document), but they are generally less discriminating when it comes to which

relevant documents appear in their results (out of the set of all relevant documents).

Note that partitions may be unavailable due to reasons other than machine failure:

cycling through diﬀerent partitions is a very simple and non-disruptive strategy for

index updates.

Working in a document-partitioned architecture, there are a variety of approaches

to dividing up the web into smaller pieces. Proper partitioning of the collection can

address one major weakness of this architecture, which is that every partition server

is involved in every user query. Along one dimension, it is desirable to partition by

document quality using one or more classiﬁers; see [124] for a recent survey on web

page classiﬁcation. Partitioning by document quality supports a multi-phase search

strategy: the system examines partitions containing high quality documents ﬁrst, and

only backs oﬀ to partitions containing lower quality documents if necessary. This reduces

the number of servers that need to be contacted for a user query. Along an orthogonal

dimension, it is desirable to partition documents by content (perhaps also guided by

the distribution of user queries from logs), so that each partition is “well separated”

from the others in terms of topical coverage. This also reduces the number of machines

that need to be involved in serving a user’s query: the broker can direct queries only to

4.7. SUMMARY AND ADDITIONAL READINGS 89

the partitions that are likely to contain relevant documents, as opposed to forwarding

the user query to all the partitions.

On a large-scale, reliability of service is provided by replication, both in terms

of multiple machines serving the same partition within a single datacenter, but also

replication across geographically-distributed datacenters. This creates at least two query

routing problems: since it makes sense to serve clients from the closest datacenter, a

service must route queries to the appropriate location. Within a single datacenter, the

system needs to properly balance load across replicas.

There are two ﬁnal components of real-world search engines that are worth dis-

cussing. First, recall that postings only store document ids. Therefore, raw retrieval

results consist of a ranked list of semantically meaningless document ids. It is typically

the responsibility of document servers, functionally distinct from the partition servers

holding the indexes, to generate meaningful output for user presentation. Abstractly, a

document server takes as input a query and a document id, and computes an appropri-

ate result entry, typically comprising the title and URL of the page, a snippet of the

source document showing the user’s query terms in context, and additional metadata

about the document. Second, query evaluation can beneﬁt immensely from caching, of

individual postings (assuming that the index is not already in memory) and even results

of entire queries [13]. This is made possible by the Zipﬁan distribution of queries, with

very frequent queries at the head of the distribution dominating the total number of

queries. Search engines take advantage of this with cache servers, which are functionally

distinct from all of the components discussed above.

4.7 SUMMARY AND ADDITIONAL READINGS

Web search is a complex problem that breaks down into three conceptually-distinct

components. First, the documents collection must be gathered (by crawling the web).

Next, inverted indexes and other auxiliary data structures must be built from the docu-

ments. Both of these can be considered oﬄine problems. Finally, index structures must

be accessed and processed in response to user queries to generate search results. This

last task is an online problem that demands both low latency and high throughput.

This chapter primarily focused on building inverted indexes, the problem most

suitable for MapReduce. After all, inverted indexing is nothing but a very large dis-

tributed sort and group by operation! We began with a baseline implementation of an

inverted indexing algorithm, but quickly noticed a scalability bottleneck that stemmed

from having to buﬀer postings in memory. Application of the value-to-key conversion

design pattern (Section 3.4) addressed the issue by oﬄoading the task of sorting post-

ings by document id to the MapReduce execution framework. We also surveyed various

techniques for integer compression, which yield postings lists that are both more com-

pact and faster to process. As a speciﬁc example, one could use Golomb codes for

90 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

compressing d-gaps and γ codes for term frequencies. We showed how the order inver-

sion design pattern introduced in Section 3.3 for computing relative frequencies can be

used to properly set compression parameters.

Additional Readings. Our brief discussion of web search glosses over many com-

plexities and does a huge injustice to the tremendous amount of research in information

retrieval. Here, however, we provide a few entry points into the literature. A survey ar-

ticle by Zobel and Moﬀat [162] is an excellent starting point on indexing and retrieval

algorithms. Another by Baeza-Yates et al. [11] overviews many important issues in

distributed retrieval. A keynote talk at the WSDM 2009 conference by Jeﬀ Dean re-

vealed a lot of information about the evolution of the Google search architecture.

15

Finally, a number of general information retrieval textbooks have been recently pub-

lished [101, 42, 30]. Of these three, the one by B¨ uttcher et al. [30] is noteworthy in

having detailed experimental evaluations that compare the performance (both eﬀec-

tiveness and eﬃciency) of a wide range of algorithms and techniques. While outdated

in many other respects, the textbook Managing Gigabytes [156] remains an excellent

source for index compression techniques. Finally, ACM SIGIR is an annual conference

and the most prestigious venue for academic information retrieval research; proceedings

from those events are perhaps the best starting point for those wishing to keep abreast

of publicly-documented developments in the ﬁeld.

15

http://research.google.com/people/jeff/WSDM09-keynote.pdf

91

C H A P T E R 5

Graph Algorithms

Graphs are ubiquitous in modern society: examples encountered by almost everyone

on a daily basis include the hyperlink structure of the web (simply known as the web

graph), social networks (manifest in the ﬂow of email, phone call patterns, connections

on social networking sites, etc.), and transportation networks (roads, bus routes, ﬂights,

etc.). Our very own existence is dependent on an intricate metabolic and regulatory

network, which can be characterized as a large, complex graph involving interactions

between genes, proteins, and other cellular products. This chapter focuses on graph

algorithms in MapReduce. Although most of the content has nothing to do with text

processing per se, documents frequently exist in the context of some underlying network,

making graph analysis an important component of many text processing applications.

Perhaps the best known example is PageRank, a measure of web page quality based

on the structure of hyperlinks, which is used in ranking results for web search. As one

of the ﬁrst applications of MapReduce, PageRank exempliﬁes a large class of graph

algorithms that can be concisely captured in the programming model. We will discuss

PageRank in detail later this chapter.

In general, graphs can be characterized by nodes (or vertices) and links (or edges)

that connect pairs of nodes.

1

These connections can be directed or undirected. In some

graphs, there may be an edge from a node to itself, resulting in a self loop; in others,

such edges are disallowed. We assume that both nodes and links may be annotated with

additional metadata: as a simple example, in a social network where nodes represent

individuals, there might be demographic information (e.g., age, gender, location) at-

tached to the nodes and type information attached to the links (e.g., indicating type of

relationship such as “friend” or “spouse”).

Mathematicians have always been fascinated with graphs, dating back to Euler’s

paper on the Seven Bridges of K¨onigsberg in 1736. Over the past few centuries, graphs

have been extensively studied, and today much is known about their properties. Far

more than theoretical curiosities, theorems and algorithms on graphs can be applied to

solve many real-world problems:

• Graph search and path planning. Search algorithms on graphs are invoked millions

of times a day, whenever anyone searches for directions on the web. Similar algo-

rithms are also involved in friend recommendations and expert-ﬁnding in social

networks. Path planning problems involving everything from network packets to

delivery trucks represent another large class of graph search problems.

1

Throughout this chapter, we use node interchangeably with vertex and similarly with link and edge.

92 CHAPTER 5. GRAPH ALGORITHMS

• Graph clustering. Can a large graph be divided into components that are relatively

disjoint (for example, as measured by inter-component links [59])? Among other

applications, this task is useful for identifying communities in social networks (of

interest to sociologists who wish to understand how human relationships form and

evolve) and for partitioning large graphs (of interest to computer scientists who

seek to better parallelize graph processing). See [158] for a survey.

• Minimum spanning trees. A minimum spanning tree for a graph G with weighted

edges is a tree that contains all vertices of the graph and a subset of edges that

minimizes the sum of edge weights. A real-world example of this problem is a

telecommunications company that wishes to lay optical ﬁber to span a number

of destinations at the lowest possible cost (where weights denote costs). This ap-

proach has also been applied to wide variety of problems, including social networks

and the migration of Polynesian islanders [64].

• Bipartite graph matching. A bipartite graph is one whose vertices can be divided

into two disjoint sets. Matching problems on such graphs can be used to model

job seekers looking for employment or singles looking for dates.

• Maximum ﬂow. In a weighted directed graph with two special nodes called the

source and the sink, the max ﬂow problem involves computing the amount of

“traﬃc” that can be sent from source to sink given various ﬂow capacities deﬁned

by edge weights. Transportation companies (airlines, shipping, etc.) and network

operators grapple with complex versions of these problems on a daily basis.

• Identifying “special” nodes. There are many ways to deﬁne what special means,

including metrics based on node in-degree, average distance to other nodes, and

relationship to cluster structure. These special nodes are important to investigators

attempting to break up terrorist cells, epidemiologists modeling the spread of

diseases, advertisers trying to promote products, and many others.

A common feature of these problems is the scale of the datasets on which the algorithms

must operate: for example, the hyperlink structure of the web, which contains billions

of pages, or social networks that contain hundreds of millions of individuals. Clearly,

algorithms that run on a single machine and depend on the entire graph residing in

memory are not scalable. We’d like to put MapReduce to work on these challenges.

2

This chapter is organized as follows: we begin in Section 5.1 with an introduction

to graph representations, and then explore two classic graph algorithms in MapReduce:

2

As a side note, Google recently published a short description of a system called Pregel [98], based on Valiant’s

Bulk Synchronous Parallel model [148], for large-scale graph algorithms; a longer description is anticipated in

a forthcoming paper [99]

5.1. GRAPH REPRESENTATIONS 93

n

1

n

2

n

1

n

2

n

3

n

4

n

5

n

1

0 1 0 1 0

n

2

0 0 1 0 1

n

1

[n

2

, n

4

]

n

2

[n

3

, n

5

]

n

3

n

5

n

2

0 0 1 0 1

n

3

0 0 0 1 0

n

4

0 0 0 0 1

n

5

1 1 1 0 0

n

2

[n

3

, n

5

]

n

3

[n

4

]

n

4

[n

5

]

n

5

[n

1

, n

2

, n

3

]

n

4

adjacency matrix adjacency lists

Figure 5.1: A simple directed graph (left) represented as an adjacency matrix (middle) and

with adjacency lists (right).

parallel breadth-ﬁrst search (Section 5.2) and PageRank (Section 5.3). Before conclud-

ing with a summary and pointing out additional readings, Section 5.4 discusses a number

of general issue that aﬀect graph processing with MapReduce.

5.1 GRAPH REPRESENTATIONS

One common way to represent graphs is with an adjacency matrix. A graph with n nodes

can be represented as an n n square matrix M, where a value in cell m

ij

indicates an

edge from node n

i

to node n

j

. In the case of graphs with weighted edges, the matrix cells

contain edge weights; otherwise, each cell contains either a one (indicating an edge),

or a zero (indicating none). With undirected graphs, only half the matrix is used (e.g.,

cells above the diagonal). For graphs that allow self loops (a directed edge from a node

to itself), the diagonal might be populated; otherwise, the diagonal remains empty.

Figure 5.1 provides an example of a simple directed graph (left) and its adjacency

matrix representation (middle).

Although mathematicians prefer the adjacency matrix representation of graphs

for easy manipulation with linear algebra, such a representation is far from ideal for

computer scientists concerned with eﬃcient algorithmic implementations. Most of the

applications discussed in the chapter introduction involve sparse graphs, where the

number of actual edges is far smaller than the number of possible edges.

3

For example,

in a social network of n individuals, there are n(n −1) possible “friendships” (where n

may be on the order of hundreds of millions). However, even the most gregarious will

have relatively few friends compared to the size of the network (thousands, perhaps, but

still far smaller than hundreds of millions). The same is true for the hyperlink structure

of the web: each individual web page links to a minuscule portion of all the pages on the

3

Unfortunately, there is no precise deﬁnition of sparseness agreed upon by all, but one common deﬁnition is

that a sparse graph has O(n) edges, where n is the number of vertices.

94 CHAPTER 5. GRAPH ALGORITHMS

web. In this chapter, we assume processing of sparse graphs, although we will return to

this issue in Section 5.4.

The major problem with an adjacency matrix representation for sparse graphs

is its O(n

2

) space requirement. Furthermore, most of the cells are zero, by deﬁnition.

As a result, most computational implementations of graph algorithms operate over

adjacency lists, in which a node is associated with neighbors that can be reached via

outgoing edges. Figure 5.1 also shows the adjacency list representation of the graph

under consideration (on the right). For example, since n

1

is connected by directed

edges to n

2

and n

4

, those two nodes will be on the adjacency list of n

1

. There are two

options for encoding undirected graphs: one could simply encode each edge twice (if n

i

and n

j

are connected, each appears on each other’s adjacency list). Alternatively, one

could order the nodes (arbitrarily or otherwise) and encode edges only on the adjacency

list of the node that comes ﬁrst in the ordering (i.e., if i < j, then n

j

is on the adjacency

list of n

i

, but not the other way around).

Note that certain graph operations are easier on adjacency matrices than on ad-

jacency lists. In the ﬁrst, operations on incoming links for each node translate into a

column scan on the matrix, whereas operations on outgoing links for each node trans-

late into a row scan. With adjacency lists, it is natural to operate on outgoing links, but

computing anything that requires knowledge of the incoming links of a node is diﬃcult.

However, as we shall see, the shuﬄe and sort mechanism in MapReduce provides an

easy way to group edges by their destination nodes, thus allowing us to compute over

incoming edges with in the reducer. This property of the execution framework can also

be used to invert the edges of a directed graph, by mapping over the nodes’ adjacency

lists and emitting key–value pairs with the destination node id as the key and the source

node id as the value.

4

5.2 PARALLEL BREADTH-FIRST SEARCH

One of the most common and well-studied problems in graph theory is the single-source

shortest path problem, where the task is to ﬁnd shortest paths from a source node to

all other nodes in the graph (or alternatively, edges can be associated with costs or

weights, in which case the task is to compute lowest-cost or lowest-weight paths). Such

problems are a staple in undergraduate algorithm courses, where students are taught the

solution using Dijkstra’s algorithm. However, this famous algorithm assumes sequential

processing—how would we solve this problem in parallel, and more speciﬁcally, with

MapReduce?

4

This technique is used in anchor text inversion, where one gathers the anchor text of hyperlinks pointing to a

particular page. It is common practice to enrich a web page’s standard textual representation with all of the

anchor text associated with its incoming hyperlinks (e.g., [107]).

5.2. PARALLEL BREADTH-FIRST SEARCH 95

1: Dijkstra(G, w, s)

2: d[s] ← 0

3: for all vertex v ∈ V do

4: d[v] ← ∞

5: Q ← ¦V ¦

6: while Q = ∅ do

7: u ← ExtractMin(Q)

8: for all vertex v ∈ u.AdjacencyList do

9: if d[v] > d[u] + w(u, v) then

10: d[v] ← d[u] + w(u, v)

Figure 5.2: Pseudo-code for Dijkstra’s algorithm, which is based on maintaining a global

priority queue of nodes with priorities equal to their distances from the source node. At each

iteration, the algorithm expands the node with the shortest distance and updates distances to

all reachable nodes.

As a refresher and also to serve as a point of comparison, Dijkstra’s algorithm is

shown in Figure 5.2, adapted from Cormen, Leiserson, and Rivest’s classic algorithms

textbook [41] (often simply known as CLR). The input to the algorithm is a directed,

connected graph G = (V, E) represented with adjacency lists, w containing edge dis-

tances such that w(u, v) ≥ 0, and the source node s. The algorithm begins by ﬁrst

setting distances to all vertices d[v], v ∈ V to ∞, except for the source node, whose

distance to itself is zero. The algorithm maintains Q, a global priority queue of vertices

with priorities equal to their distance values d.

Dijkstra’s algorithm operates by iteratively selecting the node with the lowest

current distance from the priority queue (initially, this is the source node). At each

iteration, the algorithm “expands” that node by traversing the adjacency list of the

selected node to see if any of those nodes can be reached with a path of a shorter

distance. The algorithm terminates when the priority queue Q is empty, or equivalently,

when all nodes have been considered. Note that the algorithm as presented in Figure 5.2

only computes the shortest distances. The actual paths can be recovered by storing

“backpointers” for every node indicating a fragment of the shortest path.

A sample trace of the algorithm running on a simple graph is shown in Figure 5.3

(example also adapted from CLR). We start out in (a) with n

1

having a distance of zero

(since it’s the source) and all other nodes having a distance of ∞. In the ﬁrst iteration

(a), n

1

is selected as the node to expand (indicated by the thicker border). After the

expansion, we see in (b) that n

2

and n

3

can be reached at a distance of 10 and 5,

respectively. Also, we see in (b) that n

3

is the next node selected for expansion. Nodes

we have already considered for expansion are shown in black. Expanding n

3

, we see in

96 CHAPTER 5. GRAPH ALGORITHMS

∞ ∞

1

n

2

n

4

10 ∞

1

n

2

n

4

8 14

1

n

2

n

4

0

∞ ∞

10

5

2 3

9

4 6

n

1

0

10 ∞

10

5

2 3

9

4 6

n

1

0

8 14

10

5

2 3

9

4 6

n

1

∞ ∞

5

2

7

1

n

3

n

5

5 ∞

5

2

7

1

n

3

n

5

5 7

5

2

7

1

n

3

n

5

(a) (b) (c)

8 13

10

1

n

2

n

4

8 9

10

1

n

2

n

4

8 9

10

1

n

2

n

4

0

5 7

10

5

2 3

9

7

4 6

n

1

0

5 7

10

5

2 3

9

7

4 6

n

1

0

5 7

10

5

2 3

9

7

4 6

n

1

5 7

2

n

3

n

5

5 7

2

n

3

n

5

5 7

2

n

3

n

5

(d) (e) (f)

Figure 5.3: Example of Dijkstra’s algorithm applied to a simple graph with ﬁve nodes, with n

1

as the source and edge distances as indicated. Parts (a)–(e) show the running of the algorithm

at each iteration, with the current distance inside the node. Nodes with thicker borders are

those being expanded; nodes that have already been expanded are shown in black.

(c) that the distance to n

2

has decreased because we’ve found a shorter path. The nodes

that will be expanded next, in order, are n

5

, n

2

, and n

4

. The algorithm terminates with

the end state shown in (f), where we’ve discovered the shortest distance to all nodes.

The key to Dijkstra’s algorithm is the priority queue that maintains a globally-

sorted list of nodes by current distance. This is not possible in MapReduce, as the

programming model does not provide a mechanism for exchanging global data. Instead,

we adopt a brute force approach known as parallel breadth-ﬁrst search. First, as a

simpliﬁcation let us assume that all edges have unit distance (modeling, for example,

hyperlinks on the web). This makes the algorithm easier to understand, but we’ll relax

this restriction later.

The intuition behind the algorithm is this: the distance of all nodes connected

directly to the source node is one; the distance of all nodes directly connected to those

is two; and so on. Imagine water rippling away from a rock dropped into a pond—

that’s a good image of how parallel breadth-ﬁrst search works. However, what if there

are multiple paths to the same node? Suppose we wish to compute the shortest distance

5.2. PARALLEL BREADTH-FIRST SEARCH 97

to node n. The shortest path must go through one of the nodes in M that contains an

outgoing edge to n: we need to examine all m ∈ M to ﬁnd m

s

, the node with the shortest

distance. The shortest distance to n is the distance to m

s

plus one.

Pseudo-code for the implementation of the parallel breadth-ﬁrst search algorithm

is provided in Figure 5.4. As with Dijkstra’s algorithm, we assume a connected, directed

graph represented as adjacency lists. Distance to each node is directly stored alongside

the adjacency list of that node, and initialized to ∞ for all nodes except for the source

node. In the pseudo-code, we use n to denote the node id (an integer) and N to denote

the node’s corresponding data structure (adjacency list and current distance). The

algorithm works by mapping over all nodes and emitting a key-value pair for each

neighbor on the node’s adjacency list. The key contains the node id of the neighbor,

and the value is the current distance to the node plus one. This says: if we can reach node

n with a distance d, then we must be able to reach all the nodes that are connected to

n with distance d + 1. After shuﬄe and sort, reducers will receive keys corresponding to

the destination node ids and distances corresponding to all paths leading to that node.

The reducer will select the shortest of these distances and then update the distance in

the node data structure.

It is apparent that parallel breadth-ﬁrst search is an iterative algorithm, where

each iteration corresponds to a MapReduce job. The ﬁrst time we run the algorithm, we

“discover” all nodes that are connected to the source. The second iteration, we discover

all nodes connected to those, and so on. Each iteration of the algorithm expands the

“search frontier” by one hop, and, eventually, all nodes will be discovered with their

shortest distances (assuming a fully-connected graph). Before we discuss termination

of the algorithm, there is one more detail required to make the parallel breadth-ﬁrst

search algorithm work. We need to “pass along” the graph structure from one iteration

to the next. This is accomplished by emitting the node data structure itself, with the

node id as a key (Figure 5.4, line 4 in the mapper). In the reducer, we must distinguish

the node data structure from distance values (Figure 5.4, lines 5–6 in the reducer), and

update the minimum distance in the node data structure before emitting it as the ﬁnal

value. The ﬁnal output is now ready to serve as input to the next iteration.

5

So how many iterations are necessary to compute the shortest distance to all

nodes? The answer is the diameter of the graph, or the greatest distance between any

pair of nodes. This number is surprisingly small for many real-world problems: the

saying “six degrees of separation” suggests that everyone on the planet is connected to

everyone else by at most six steps (the people a person knows are one step away, people

that they know are two steps away, etc.). If this is indeed true, then parallel breadth-

ﬁrst search on the global social network would take at most six MapReduce iterations.

5

Note that in this algorithm we are overloading the value type, which can either be a distance (integer) or a

complex data structure representing a node. The best way to achieve this in Hadoop is to create a wrapper

object with an indicator variable specifying what the content is.

98 CHAPTER 5. GRAPH ALGORITHMS

1: class Mapper

2: method Map(nid n, node N)

3: d ← N.Distance

4: Emit(nid n, N) Pass along graph structure

5: for all nodeid m ∈ N.AdjacencyList do

6: Emit(nid m, d + 1) Emit distances to reachable nodes

1: class Reducer

2: method Reduce(nid m, [d

1

, d

2

, . . .])

3: d

min

← ∞

4: M ← ∅

5: for all d ∈ counts [d

1

, d

2

, . . .] do

6: if IsNode(d) then

7: M ← d Recover graph structure

8: else if d < d

min

then Look for shorter distance

9: d

min

← d

10: M.Distance ← d

min

Update shortest distance

11: Emit(nid m, node M)

Figure 5.4: Pseudo-code for parallel breath-ﬁrst search in MapReduce: the mappers emit dis-

tances to reachable nodes, while the reducers select the minimum of those distances for each

destination node. Each iteration (one MapReduce job) of the algorithm expands the “search

frontier” by one hop.

For more serious academic studies of “small world” phenomena in networks, we refer

the reader to a number of publications [61, 62, 152, 2]. In practical terms, we iterate

the algorithm until there are no more node distances that are ∞. Since the graph is

connected, all nodes are reachable, and since all edge distances are one, all discovered

nodes are guaranteed to have the shortest distances (i.e., there is not a shorter path

that goes through a node that hasn’t been discovered).

The actual checking of the termination condition must occur outside of Map-

Reduce. Typically, execution of an iterative MapReduce algorithm requires a non-

MapReduce “driver” program, which submits a MapReduce job to iterate the algorithm,

checks to see if a termination condition has been met, and if not, repeats. Hadoop pro-

vides a lightweight API for constructs called “counters”, which, as the name suggests,

can be used for counting events that occur during execution, e.g., number of corrupt

records, number of times a certain condition is met, or anything that the programmer

desires. Counters can be deﬁned to count the number of nodes that have distances of

∞: at the end of the job, the driver program can access the ﬁnal counter value and

check to see if another iteration is necessary.

5.2. PARALLEL BREADTH-FIRST SEARCH 99

r

search frontier

s

p

q

Figure 5.5: In the single source shortest path problem with arbitrary edge distances, the

shortest path from source s to node r may go outside the current search frontier, in which case

we will not ﬁnd the shortest distance to r until the search frontier expands to cover q.

Finally, as with Dijkstra’s algorithm in the form presented earlier, the parallel

breadth-ﬁrst search algorithm only ﬁnds the shortest distances, not the actual shortest

paths. However, the path can be straightforwardly recovered. Storing “backpointers”

at each node, as with Dijkstra’s algorithm, will work, but may not be eﬃcient since

the graph needs to be traversed again to reconstruct the path segments. A simpler

approach is to emit paths along with distances in the mapper, so that each node will

have its shortest path easily accessible at all times. The additional space requirements

for shuﬄing these data from mappers to reducers are relatively modest, since for the

most part paths (i.e., sequence of node ids) are relatively short.

Up until now, we have been assuming that all edges are unit distance. Let us relax

that restriction and see what changes are required in the parallel breadth-ﬁrst search

algorithm. The adjacency lists, which were previously lists of node ids, must now encode

the edge distances as well. In line 6 of the mapper code in Figure 5.4, instead of emitting

d + 1 as the value, we must now emit d + w where w is the edge distance. No other

changes to the algorithm are required, but the termination behavior is very diﬀerent.

To illustrate, consider the graph fragment in Figure 5.5, where s is the source node,

and in this iteration, we just “discovered” node r for the very ﬁrst time. Assume for

the sake of argument that we’ve already discovered the shortest distance to node p, and

that the shortest distance to r so far goes through p. This, however, does not guarantee

that we’ve discovered the shortest distance to r, since there may exist a path going

through q that we haven’t encountered yet (because it lies outside the search frontier).

6

However, as the search frontier expands, we’ll eventually cover q and all other nodes

along the path from p to q to r—which means that with suﬃcient iterations, we will

discover the shortest distance to r. But how do we know that we’ve found the shortest

distance to p? Well, if the shortest path to p lies within the search frontier, we would

6

Note that the same argument does not apply to the unit edge distance case: the shortest path cannot lie outside

the search frontier since any such path would necessarily be longer.

100 CHAPTER 5. GRAPH ALGORITHMS

n

6

n

7

n

8

1

1

1

10

n

1

n

5

n

9

1

1

1

n

2

n

3

n

4

1

1

Figure 5.6: A sample graph that elicits worst-case behavior for parallel breadth-ﬁrst search.

Eight iterations are required to discover shortest distances to all nodes from n

1

.

have already discovered it. And if it doesn’t, the above argument applies. Similarly, we

can repeat the same argument for all nodes on the path from s to p. The conclusion is

that, with suﬃcient iterations, we’ll eventually discover all the shortest distances.

So exactly how many iterations does “eventually” mean? In the worst case, we

might need as many iterations as there are nodes in the graph minus one. In fact, it

is not diﬃcult to construct graphs that will elicit this worse-case behavior: Figure 5.6

provides an example, with n

1

as the source. The parallel breadth-ﬁrst search algorithm

would not discover that the shortest path from n

1

to n

6

goes through n

3

, n

4

, and n

5

until the ﬁfth iteration. Three more iterations are necessary to cover the rest of the

graph. Fortunately, for most real-world graphs, such extreme cases are rare, and the

number of iterations necessary to discover all shortest distances is quite close to the

diameter of the graph, as in the unit edge distance case.

In practical terms, how do we know when to stop iterating in the case of arbitrary

edge distances? The algorithm can terminate when shortest distances at every node no

longer change. Once again, we can use counters to keep track of such events. Every time

we encounter a shorter distance in the reducer, we increment a counter. At the end of

each MapReduce iteration, the driver program reads the counter value and determines

if another iteration is necessary.

Compared to Dijkstra’s algorithm on a single processor, parallel breadth-ﬁrst

search in MapReduce can be characterized as a brute force approach that “wastes” a

lot of time performing computations whose results are discarded. At each iteration, the

algorithm attempts to recompute distances to all nodes, but in reality only useful work

is done along the search frontier: inside the search frontier, the algorithm is simply

repeating previous computations.

7

Outside the search frontier, the algorithm hasn’t

7

Unless the algorithm discovers an instance of the situation described in Figure 5.5, in which case, updated

distances will propagate inside the search frontier.

5.2. PARALLEL BREADTH-FIRST SEARCH 101

discovered any paths to nodes there yet, so no meaningful work is done. Dijkstra’s

algorithm, on the other hand, is far more eﬃcient. Every time a node is explored, we’re

guaranteed to have already found the shortest path to it. However, this is made possible

by maintaining a global data structure (a priority queue) that holds nodes sorted by

distance—this is not possible in MapReduce because the programming model does not

provide support for global data that is mutable and accessible by the mappers and

reducers. These ineﬃciencies represent the cost of parallelization.

The parallel breadth-ﬁrst search algorithm is instructive in that it represents the

prototypical structure of a large class of graph algorithms in MapReduce. They share

in the following characteristics:

• The graph structure is represented with adjacency lists, which is part of some larger

node data structure that may contain additional information (variables to store

intermediate output, features of the nodes). In many cases, features are attached

to edges as well (e.g., edge weights).

• The MapReduce algorithm maps over the node data structures and performs a

computation that is a function of features of the node, intermediate output at-

tached to each node, and features of the adjacency list (outgoing edges and their

features). In other words, computations can only involve a node’s internal state

and its local graph structure. The results of these computations are emitted as val-

ues, keyed with the node ids of the neighbors (i.e., those nodes on the adjacency

lists). Conceptually, we can think of this as “passing” the results of the computa-

tion along outgoing edges. In the reducer, the algorithm receives all partial results

that have the same destination node, and performs another computation (usually,

some form of aggregation).

• In addition to computations, the graph itself is also passed from the mapper to the

reducer. In the reducer, the data structure corresponding to each node is updated

and written back to disk.

• Graph algorithms in MapReduce are generally iterative, where the output of the

previous iteration serves as input to the next iteration. The process is controlled

by a non-MapReduce driver program that checks for termination.

For parallel breadth-ﬁrst search, the mapper computation is the current distance plus

edge distance (emitting distances to neighbors), while the reducer computation is the

Min function (selecting the shortest path). As we will see in the next section, the

MapReduce algorithm for PageRank works in much the same way.

102 CHAPTER 5. GRAPH ALGORITHMS

5.3 PAGERANK

PageRank [117] is a measure of web page quality based on the structure of the hyperlink

graph. Although it is only one of thousands of features that is taken into account in

Google’s search algorithm, it is perhaps one of the best known and most studied.

A vivid way to illustrate PageRank is to imagine a random web surfer: the surfer

visits a page, randomly clicks a link on that page, and repeats ad inﬁnitum. Page-

Rank is a measure of how frequently a page would be encountered by our tireless web

surfer. More precisely, PageRank is a probability distribution over nodes in the graph

representing the likelihood that a random walk over the link structure will arrive at a

particular node. Nodes that have high in-degrees tend to have high PageRank values,

as well as nodes that are linked to by other nodes with high PageRank values. This

behavior makes intuitive sense: if PageRank is a measure of page quality, we would ex-

pect high-quality pages to contain “endorsements” from many other pages in the form

of hyperlinks. Similarly, if a high-quality page links to another page, then the second

page is likely to be high quality also. PageRank represents one particular approach to

inferring the quality of a web page based on hyperlink structure; two other popular

algorithms, not covered here, are SALSA [88] and HITS [84] (also known as “hubs and

authorities”).

The complete formulation of PageRank includes an additional component. As it

turns out, our web surfer doesn’t just randomly click links. Before the surfer decides

where to go next, a biased coin is ﬂipped—heads, the surfer clicks on a random link on

the page as usual. Tails, however, the surfer ignores the links on the page and randomly

“jumps” or “teleports” to a completely diﬀerent page.

But enough about random web surﬁng. Formally, the PageRank P of a page n is

deﬁned as follows:

P(n) = α

1

[G[

+ (1 −α)

¸

m∈L(n)

P(m)

C(m)

(5.1)

where [G[ is the total number of nodes (pages) in the graph, α is the random jump

factor, L(n) is the set of pages that link to n, and C(m) is the out-degree of node m

(the number of links on page m). The random jump factor α is sometimes called the

“teleportation” factor; alternatively, (1 −α) is referred to as the “damping” factor.

Let us break down each component of the formula in detail. First, note that

PageRank is deﬁned recursively—this gives rise to an iterative algorithm we will detail

in a bit. A web page n receives PageRank “contributions” from all pages that link to

it, L(n). Let us consider a page m from the set of pages L(n): a random surfer at

m will arrive at n with probability 1/C(m) since a link is selected at random from all

outgoing links. Since the PageRank value of m is the probability that the random surfer

will be at m, the probability of arriving at n from m is P(m)/C(m). To compute the

5.3. PAGERANK 103

PageRank of n, we need to sum contributions from all pages that link to n. This is

the summation in the second half of the equation. However, we also need to take into

account the random jump: there is a 1/[G[ chance of landing at any particular page,

where [G[ is the number of nodes in the graph. Of course, the two contributions need to

be combined: with probability α the random surfer executes a random jump, and with

probability 1 −α the random surfer follows a hyperlink.

Note that PageRank assumes a community of honest users who are not trying to

“game” the measure. This is, of course, not true in the real world, where an adversarial

relationship exists between search engine companies and a host of other organizations

and individuals (marketers, spammers, activists, etc.) who are trying to manipulate

search results—to promote a cause, product, or service, or in some cases, to trap and

intentionally deceive users (see, for example, [12, 63]). A simple example is a so-called

“spider trap”, a inﬁnite chain of pages (e.g., generated by CGI) that all link to a single

page (thereby artiﬁcially inﬂating its PageRank). For this reason, PageRank is only one

of thousands of features used in ranking web pages.

The fact that PageRank is recursively deﬁned translates into an iterative algo-

rithm which is quite similar in basic structure to parallel breadth-ﬁrst search. We start

by presenting an informal sketch. At the beginning of each iteration, a node passes its

PageRank contributions to other nodes that it is connected to. Since PageRank is a

probability distribution, we can think of this as spreading probability mass to neigh-

bors via outgoing links. To conclude the iteration, each node sums up all PageRank

contributions that have been passed to it and computes an updated PageRank score.

We can think of this as gathering probability mass passed to a node via its incoming

links. This algorithm iterates until PageRank values don’t change anymore.

Figure 5.7 shows a toy example that illustrates two iterations of the algorithm.

As a simpliﬁcation, we ignore the random jump factor for now (i.e., α = 0) and further

assume that there are no dangling nodes (i.e., nodes with no outgoing edges). The

algorithm begins by initializing a uniform distribution of PageRank values across nodes.

In the beginning of the ﬁrst iteration (top, left), partial PageRank contributions are sent

from each node to its neighbors connected via outgoing links. For example, n

1

sends

0.1 PageRank mass to n

2

and 0.1 PageRank mass to n

4

. This makes sense in terms of

the random surfer model: if the surfer is at n

1

with a probability of 0.2, then the surfer

could end up either in n

2

or n

4

with a probability of 0.1 each. The same occurs for all

the other nodes in the graph: note that n

5

must split its PageRank mass three ways,

since it has three neighbors, and n

4

receives all the mass belonging to n

3

because n

3

isn’t connected to any other node. The end of the ﬁrst iteration is shown in the top

right: each node sums up PageRank contributions from its neighbors. Note that since

n

1

has only one incoming link, from n

3

, its updated PageRank value is smaller than

before, i.e., it “passed along” more PageRank mass than it received. The exact same

104 CHAPTER 5. GRAPH ALGORITHMS

It ti 1

n

1

(0.2)

n

2

(0.2)

0.1

0.1

0.1

0.1

n

1

(0.066)

n

2

(0.166) Iteration 1

n

3

(0.2)

n

5

(0.2)

0.2 0.2

0.066

0.066

0.066

n

3

(0.166)

n

5

(0.3)

n

4

(0.2) n

4

(0.3)

n

2

(0.166) n

2

(0.133) Iteration 2

n

1

(0.066)

0.033

0.033

0.083

0.083

0 1

0 1

0.1

n

1

(0.1)

n (0 3)

n

3

(0.166)

n

5

(0.3)

0.3

0.166

0.1

0.1

n (0 2)

n

3

(0.183)

n

5

(0.383)

n

4

(0.3) n

4

(0.2)

Figure 5.7: PageRank toy example showing two iterations, top and bottom. Left graphs show

PageRank values at the beginning of each iteration and how much PageRank mass is passed to

each neighbor. Right graphs show updated PageRank values at the end of each iteration.

process repeats, and the second iteration in our toy example is illustrated by the bottom

two graphs. At the beginning of each iteration, the PageRank values of all nodes sum

to one. PageRank mass is preserved by the algorithm, guaranteeing that we continue

to have a valid probability distribution at the end of each iteration.

Pseudo-code of the MapReduce PageRank algorithm is shown in Figure 5.8; it is

simpliﬁed in that we continue to ignore the random jump factor and assume no dangling

nodes (complications that we will return to later). An illustration of the running algo-

rithm is shown in Figure 5.9 for the ﬁrst iteration of the toy graph in Figure 5.7. The

algorithm maps over the nodes, and for each node computes how much PageRank mass

needs to be distributed to its neighbors (i.e., nodes on the adjacency list). Each piece

of the PageRank mass is emitted as the value, keyed by the node ids of the neighbors.

Conceptually, we can think of this as passing PageRank mass along outgoing edges.

In the shuﬄe and sort phase, the MapReduce execution framework groups values

(piece of PageRank mass) passed along the graph edges by destination node (i.e., all

edges that point to the same node). In the reducer, PageRank mass contributions from

all incoming edges are summed to arrive at the updated PageRank value for each node.

5.3. PAGERANK 105

1: class Mapper

2: method Map(nid n, node N)

3: p ← N.PageRank/[N.AdjacencyList[

4: Emit(nid n, N) Pass along graph structure

5: for all nodeid m ∈ N.AdjacencyList do

6: Emit(nid m, p) Pass PageRank mass to neighbors

1: class Reducer

2: method Reduce(nid m, [p

1

, p

2

, . . .])

3: M ← ∅

4: for all p ∈ counts [p

1

, p

2

, . . .] do

5: if IsNode(p) then

6: M ← p Recover graph structure

7: else

8: s ← s + p Sum incoming PageRank contributions

9: M.PageRank ← s

10: Emit(nid m, node M)

Figure 5.8: Pseudo-code for PageRank in MapReduce (leaving aside dangling nodes and the

random jump factor). In the map phase we evenly divide up each node’s PageRank mass and

pass each piece along outgoing edges to neighbors. In the reduce phase PageRank contributions

are summed up at each destination node. Each MapReduce job corresponds to one iteration of

the algorithm.

106 CHAPTER 5. GRAPH ALGORITHMS

n

1

(0.2)

n

2

(0.2)

0.1 0.1

0.1

n

1

(0.066)

n

2

(0.166) Iteration 1

1

( )

n (0 2)

n

5

(0.2)

0.1

0.066

0.066

0.066

1

( )

n (0 166)

n

5

(0.3)

n

4

(0.2)

n

3

(0.2)

0.2 0.2

n

4

(0.3)

n

3

(0.166)

n

5

[n

1

, n

2

, n

3

] n

1

[n

2

, n

4

] n

2

[n

3

, n

5

] n

3

[n

4

] n

4

[n

5

]

Map

n

2

n

4

n

3

n

5

n

1

n

2

n

3

n

4

n

5

Map

n

2

n

4

n

3

n

5

n

1

n

2

n

3

n

4

n

5

Reduce

n

5

[n

1

, n

2

, n

3

] n

1

[n

2

, n

4

] n

2

[n

3

, n

5

] n

3

[n

4

] n

4

[n

5

]

Figure 5.9: Illustration of the MapReduce PageRank algorithm corresponding to the ﬁrst

iteration in Figure 5.7. The size of each box is proportion to its PageRank value. During the

map phase, PageRank mass is distributed evenly to nodes on each node’s adjacency list (shown

at the very top). Intermediate values are keyed by node (shown inside the boxes). In the reduce

phase, all partial PageRank contributions are summed together to arrive at updated values.

As with the parallel breadth-ﬁrst search algorithm, the graph structure itself must be

passed from iteration to iteration. Each node data structure is emitted in the mapper

and written back out to disk in the reducer. All PageRank mass emitted by the mappers

are accounted for in the reducer: since we begin with the sum of PageRank values across

all nodes equal to one, the sum of all the updated PageRank values should remain a

valid probability distribution.

Having discussed the simpliﬁed PageRank algorithm in MapReduce, let us now

take into account the random jump factor and dangling nodes: as it turns out both are

treated similarly. Dangling nodes are nodes in the graph that have no outgoing edges,

i.e., their adjacency lists are empty. In the hyperlink graph of the web, these might

correspond to pages in a crawl that have not been downloaded yet. If we simply run

the algorithm in Figure 5.8 on graphs with dangling nodes, the total PageRank mass

will not be conserved, since no key-value pairs will be emitted when a dangling node is

encountered in the mappers.

The proper treatment of PageRank mass “lost” at the dangling nodes is to re-

distribute it across all nodes in the graph evenly (cf. [22]). There are many ways to

determine the missing PageRank mass. One simple approach is by instrumenting the

algorithm in Figure 5.8 with counters: whenever the mapper processes a node with an

empty adjacency list, it keeps track of the node’s PageRank value in the counter. At

the end of the iteration, we can access the counter to ﬁnd out how much PageRank

5.3. PAGERANK 107

mass was lost at the dangling nodes.

8

Another approach is to reserve a special key for

storing PageRank mass from dangling nodes. When the mapper encounters a dangling

node, its PageRank mass is emitted with the special key; the reducer must be modiﬁed

to contain special logic for handling the missing PageRank mass. Yet another approach

is to write out the missing PageRank mass as “side data” for each map task (using

the in-mapper combining technique for aggregation); a ﬁnal pass in the driver program

is needed to sum the mass across all map tasks. Either way, we arrive at the amount

of PageRank mass lost at the dangling nodes—this then must be redistribute evenly

across all nodes.

This redistribution process can be accomplished by mapping over all nodes again.

At the same time, we can take into account the random jump factor. For each node, its

current PageRank value p is updated to the ﬁnal PageRank value p

t

according to the

following formula:

p

t

= α

1

[G[

+ (1 −α)

m

[G[

+ p

(5.2)

where m is the missing PageRank mass, and [G[ is the number of nodes in the entire

graph. We add the PageRank mass from link traversal (p, computed from before) to

the share of the lost PageRank mass that is distributed to each node (m/[G[). Finally,

we take into account the random jump factor: with probability α the random surfer

arrives via jumping, and with probability 1 −α the random surfer arrives via incoming

links. Note that this MapReduce job requires no reducers.

Putting everything together, one iteration of PageRank requires two MapReduce

jobs: the ﬁrst to distribute PageRank mass along graph edges, and the second to take

care of dangling nodes and the random jump factor. At end of each iteration, we end

up with exactly the same data structure as the beginning, which is a requirement for

the iterative algorithm to work. Also, the PageRank values of all nodes sum up to one,

which ensures a valid probability distribution.

Typically, PageRank is iterated until convergence, i.e., when the PageRank values

of nodes no longer change (within some tolerance, to take into account, for example,

ﬂoating point precision errors). Therefore, at the end of each iteration, the PageRank

driver program must check to see if convergence has been reached. Alternative stopping

criteria include running a ﬁxed number of iterations (useful if one wishes to bound

algorithm running time) or stopping when the ranks of PageRank values no longer

change. The latter is useful for some applications that only care about comparing the

PageRank of two arbitrary pages and do not need the actual PageRank values. Rank

stability is obtained faster than the actual convergence of values.

8

In Hadoop, counters are 8-byte integers: a simple workaround is to multiply PageRank values by a large

constant, and then cast as an integer.

108 CHAPTER 5. GRAPH ALGORITHMS

In absolute terms, how many iterations are necessary for PageRank to converge?

This is a diﬃcult question to precisely answer since it depends on many factors, but

generally, fewer than one might expect. In the original PageRank paper [117], conver-

gence on a graph with 322 million edges was reached in 52 iterations (see also Bianchini

et al. [22] for additional discussion). On today’s web, the answer is not very meaningful

due to the adversarial nature of web search as previously discussed—the web is full

of spam and populated with sites that are actively trying to “game” PageRank and

related hyperlink-based metrics. As a result, running PageRank in its unmodiﬁed form

presented here would yield unexpected and undesirable results. Of course, strategies

developed by web search companies to combat link spam are proprietary (and closely-

guarded secrets, for obvious reasons)—but undoubtedly these algorithmic modiﬁcations

impact convergence behavior. A full discussion of the escalating “arms race” between

search engine companies and those that seek to promote their sites is beyond the scope

of this book.

9

5.4 ISSUES WITH GRAPH PROCESSING

The biggest diﬀerence between MapReduce graph algorithms and single-machine graph

algorithms is that with the latter, it is usually possible to maintain global data structures

in memory for fast, random access. For example, Dijkstra’s algorithm uses a global

priority queue that guides the expansion of nodes. This, of course, is not possible with

MapReduce—the programming model does not provide any built-in mechanism for

communicating global state. Since the most natural representation of large sparse graphs

is with adjacency lists, communication can only occur from a node to the nodes it links

to, or to a node from nodes linked to it—in other words, passing information is only

possible within the local graph structure.

10

This restriction gives rise to the structure of many graph algorithms in Map-

Reduce: local computation is performed on each node, the results of which are “passed”

to its neighbors. With multiple iterations, convergence on the global graph is possible.

The passing of partial results along a graph edge is accomplished by the shuﬄing and

sorting provided by the MapReduce execution framework. The amount of intermediate

data generated is on the order of the number of edges, which explains why all the algo-

rithms we have discussed assume sparse graphs. For dense graphs, MapReduce running

time would be dominated by copying intermediate data across the network, which in the

worst case is O(n

2

) in the number of nodes in the graph. Since MapReduce clusters are

9

For the interested reader, the proceedings of a workshop series on Adversarial Information Retrieval (AIRWeb)

provide great starting points into the literature.

10

Of course, it is perfectly reasonable to compute derived graph structures in a pre-processing step. For example,

if one wishes to propagate information from a node to all nodes that are within two links, one could process

graph G to derive graph G

**, where there would exist a link from node n
**

i

to n

j

if n

j

was reachable within two

link traversals of n

i

in the original graph G.

5.4. ISSUES WITH GRAPH PROCESSING 109

designed around commodity networks (e.g., gigabit Ethernet), MapReduce algorithms

are often impractical on large, dense graphs.

Combiners and the in-mapper combining pattern described in Section 3.1 can

be used to decrease the running time of graph iterations. It is straightforward to use

combiners for both parallel breadth-ﬁrst search and PageRank since Min and sum,

used in the two algorithms, respectively, are both associative and commutative. How-

ever, combiners are only eﬀective to the extent that there are opportunities for partial

aggregation—unless there are nodes pointed to by multiple nodes being processed by

an individual map task, combiners are not very useful. This implies that it would be

desirable to partition large graphs into smaller components where there are many intra-

component links and fewer inter-component links. This way, we can arrange the data

such that nodes in the same component are handled by the same map task—thus max-

imizing opportunities for combiners to perform local aggregation.

Unfortunately, this sometimes creates a chick-and-egg problem. It would be desir-

able to partition a large graph to facilitate eﬃcient processing by MapReduce. But the

graph may be so large that we can’t partition it except with MapReduce algorithms!

Fortunately, in many cases there are simple solutions around this problem in the form

of “cheap” partitioning heuristics based on reordering the data [106]. For example, in

a social network, we might sort nodes representing users by zip code, as opposed to

by last name—based on the observation that friends tend to live close to each other.

Sorting by an even more cohesive property such as school would be even better (if

available): the probability of any two random students from the same school knowing

each other is much higher than two random students from diﬀerent schools. Another

good example is to partition the web graph by the language of the page (since pages in

one language tend to link mostly to other pages in that language) or by domain name

(since inter-domain links are typically much denser than intra-domain links). Resorting

records using MapReduce is both easy to do and a relatively cheap operation—however,

whether the eﬃciencies gained by this crude form of partitioning are worth the extra

time taken in performing the resort is an empirical question that will depend on the

actual graph structure and algorithm.

Finally, there is a practical consideration to keep in mind when implementing

graph algorithms that estimate probability distributions over nodes (such as Page-

Rank). For large graphs, the probability of any particular node is often so small that

it underﬂows standard ﬂoating point representations. A very common solution to this

problem is to represent probabilities using their logarithms. When probabilities are

stored as logs, the product of two values is simply their sum. However, addition of

probabilities is also necessary, for example, when summing PageRank contribution for

a node. This can be implemented with reasonable precision as follows:

110 CHAPTER 5. GRAPH ALGORITHMS

a ⊕b =

b + log(1 + e

a−b

) a < b

a + log(1 + e

b−a

) a ≥ b

Furthermore, many math libraries include a log1p function which computes log(1 + x)

with higher precision than the na¨ıve implementation would have when x is very small

(as is often the case when working with probabilities). Its use may further improve the

accuracy of implementations that use log probabilities.

5.5 SUMMARY AND ADDITIONAL READINGS

This chapter covers graph algorithms in MapReduce, discussing in detail parallel

breadth-ﬁrst search and PageRank. Both are instances of a large class of iterative algo-

rithms that share the following characteristics:

• The graph structure is represented with adjacency lists.

• Algorithms map over nodes and pass partial results to nodes on their adjacency

lists. Partial results are aggregated for each node in the reducer.

• The graph structure itself is passed from the mapper to the reducer, such that the

output is in the same form as the input.

• Algorithms are iterative and under the control of a non-MapReduce driver pro-

gram, which checks for termination at the end of each iteration.

The MapReduce programming model does not provide a mechanism to maintain global

data structures accessible and mutable by all the mappers and reducers.

11

One impli-

cation of this is that communication between pairs of arbitrary nodes is diﬃcult to

accomplish. Instead, information typically propagates along graph edges—which gives

rise to the structure of algorithms discussed above.

Additional Readings. The ubiquity of large graphs translates into substantial inter-

est in scalable graph algorithms using MapReduce in industry, academia, and beyond.

There is, of course, much beyond what has been covered in this chapter. For additional

material, we refer readers to the following: Kang et al. [80] presented an approach to

estimating the diameter of large graphs using MapReduce and a library for graph min-

ing [81]; Cohen [39] discussed a number of algorithms for processing undirected graphs,

with social network analysis in mind; Rao and Yarowsky [128] described an implemen-

tation of label propagation, a standard algorithm for semi-supervised machine learning,

on graphs derived from textual data; Schatz [132] tackled the problem of DNA sequence

11

However, maintaining globally-synchronized state may be possible with the assistance of other tools (e.g., a

distributed database).

5.5. SUMMARY AND ADDITIONAL READINGS 111

alignment and assembly with graph algorithms in MapReduce. Finally, it is easy to for-

get that parallel graph algorithms have been studied by computer scientists for several

decades, particular in the PRAM model [77, 60]. It is not clear, however, to what extent

well-known PRAM algorithms translate naturally into the MapReduce framework.

112

C H A P T E R 6

EM Algorithms for Text Processing

Until the end of the 1980s, text processing systems tended to rely on large numbers

of manually written rules to analyze, annotate, and transform text input, usually in

a deterministic way. This rule-based approach can be appealing: a system’s behavior

can generally be understood and predicted precisely, and, when errors surface, they can

be corrected by writing new rules or reﬁning old ones. However, rule-based systems

suﬀer from a number of serious problems. They are brittle with respect to the natural

variation found in language, and developing systems that can deal with inputs from

diverse domains is very labor intensive. Furthermore, when these systems fail, they

often do so catastrophically, unable to oﬀer even a “best guess” as to what the desired

analysis of the input might be.

In the last 20 years, the rule-based approach has largely been abandoned in favor

of more data-driven methods, where the “rules” for processing the input are inferred

automatically from large corpora of examples, called training data. The basic strategy of

the data-driven approach is to start with a processing algorithm capable of capturing

how any instance of the kinds of inputs (e.g., sentences or emails) can relate to any

instance of the kinds of outputs that the ﬁnal system should produce (e.g., the syntactic

structure of the sentence or a classiﬁcation of the email as spam). At this stage, the

system can be thought of as having the potential to produce any output for any input,

but they are not distinguished in any way. Next, a learning algorithm is applied which

reﬁnes this process based on the training data—generally attempting to make the model

perform as well as possible at predicting the examples in the training data. The learning

process, which often involves iterative algorithms, typically consists of activities like

ranking rules, instantiating the content of rule templates, or determining parameter

settings for a given model. This is known as machine learning, an active area of research.

Data-driven approaches have turned out to have several beneﬁts over rule-based

approaches to system development. Since data-driven systems can be trained using

examples of the kind that they will eventually be used to process, they tend to deal

with the complexities found in real data more robustly than rule-based systems do.

Second, developing training data tends to be far less expensive than developing rules. For

some applications, signiﬁcant quantities of training data may even exist for independent

reasons (e.g., translations of text into multiple languages are created by authors wishing

to reach an audience speaking diﬀerent languages, not because they are generating

training data for a data-driven machine translation system). These advantages come

at the cost of systems that often behave internally quite diﬀerently than a human-

113

engineered system. As a result, correcting errors that the trained system makes can be

quite challenging.

Data-driven information processing systems can be constructed using a variety

of mathematical techniques, but in this chapter we focus on statistical models, which

probabilistically relate inputs from an input set A (e.g., sentences, documents, etc.),

which are always observable, to annotations from a set \, which is the space of possible

annotations or analyses that the system should predict. This model may take the form

of either a joint model Pr(x, y) which assigns a probability to every pair 'x, y` ∈ A

\ or a conditional model Pr(y[x), which assigns a probability to every y ∈ \, given

an x ∈ A. For example, to create a statistical spam detection system, we might have

\ = ¦Spam, NotSpam¦ and A be the set of all possible email messages. For machine

translation, A might be the set of Arabic sentences and \ the set of English sentences.

1

There are three closely related, but distinct challenges in statistical text-

processing. The ﬁrst is model selection. This entails selecting a representation of a

joint or conditional distribution over the desired A and \. For a problem where A and

\ are very small, one could imagine representing these probabilities in look-up tables.

However, for something like email classiﬁcation or machine translation, where the model

space is inﬁnite, the probabilities cannot be represented directly, and must be computed

algorithmically. As an example of such models, we introduce hidden Markov models

(HMMs), which deﬁne a joint distribution over sequences of inputs and sequences of

annotations. The second challenge is parameter estimation or learning, which involves

the application of a optimization algorithm and training criterion to select the param-

eters of the model to optimize the model’s performance (with respect to the given

training criterion) on the training data.

2

The parameters of a statistical model are

the values used to compute the probability of some event described by the model. In

this chapter we will focus on one particularly simple training criterion for parameter

estimation, maximum likelihood estimation, which says to select the parameters that

make the training data most probable under the model, and one learning algorithm

that attempts to meet this criterion, called expectation maximization (EM). The ﬁnal

challenge for statistical modeling is the problem of decoding, or, given some x, using the

model to select an annotation y. One very common strategy is to select y according to

the following criterion:

y

∗

= arg max

y∈`

Pr(y[x)

1

In this chapter, we will consider discrete models only. They tend to be suﬃcient for text processing, and their

presentation is simpler than models with continuous densities. It should be kept in mind that the sets X and

Y may still be countably inﬁnite.

2

We restrict our discussion in this chapter to models with ﬁnite numbers of parameters and where the learning

process refers to setting those parameters. Inference in and learning of so-called nonparameteric models, which

have an inﬁnite number of parameters and have become important statistical models for text processing in

recent years, is beyond the scope of this chapter.

114 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

In a conditional (or direct) model, this is a straightforward search for the best y un-

der the model. In a joint model, the search is also straightforward, on account of the

deﬁnition of conditional probability:

y

∗

= arg max

y∈`

Pr(y[x) = arg max

y∈`

Pr(x, y)

¸

y

Pr(x, y

t

)

= arg max

y∈`

Pr(x, y)

The speciﬁc form that the search takes will depend on how the model is represented.

Our focus in this chapter will primarily be on the second problem: learning parameters

for models, but we will touch on the third problem as well.

Machine learning is often categorized as either supervised or unsupervised. Su-

pervised learning of statistical models simply means that the model parameters are

estimated from training data consisting of pairs of inputs and annotations, that is

Z = ''x

1

, y

1

`, 'x

2

, y

2

`, . . .` where 'x

i

, y

i

` ∈ A \ and y

i

is the gold standard (i.e., cor-

rect) annotation of x

i

. While supervised models often attain quite good performance,

they are often uneconomical to use, since the training data requires each object that

is to be classiﬁed (to pick a speciﬁc task), x

i

to be paired with its correct label, y

i

.

In many cases, these gold standard training labels must be generated by a process of

expert annotation, meaning that each x

i

must be manually labeled by a trained indi-

vidual. Even when the annotation task is quite simple for people to carry out (e.g., in

the case of spam detection), the number of potential examples that could be classiﬁed

(representing a subset of A, which may of course be inﬁnite in size) will far exceed

the amount of data that can be annotated. As the annotation task becomes more com-

plicated (e.g., when predicting more complex structures such as sequences of labels or

when the annotation task requires specialized expertise), annotation becomes far more

challenging.

Unsupervised learning, on the other hand, requires only that the training data

consist of a representative collection of objects that should be annotated, that is

Z = 'x

1

, x

2

, . . .` where x

i

∈ A, but without any example annotations. While it may

at ﬁrst seem counterintuitive that meaningful annotations can be learned without any

examples of the desired annotations being given, the learning criteria and model struc-

ture (which crucially deﬁne the space of possible annotations \ and the process by

which annotations relate to observable inputs) make it possible to induce annotations

by relying on regularities in the unclassiﬁed training instances. While a thorough discus-

sion of unsupervised learning is beyond the scope of this book, we focus on a particular

class of algorithms—expectation maximization (EM) algorithms—that can be used to

learn the parameters of a joint model Pr(x, y) from incomplete data (i.e., data where

some of the variables in the model cannot be observed; in the case of unsupervised

learning, the y

i

’s are unobserved). Expectation maximization algorithms ﬁt naturally

into the MapReduce paradigm, and are used to solve a number of problems of interest

in text processing. Furthermore, these algorithms can be quite computationally expen-

6.1. EXPECTATION MAXIMIZATION 115

sive, since they generally require repeated evaluations of the training data. MapReduce

therefore provides an opportunity not only to scale to larger amounts of data, but also

to improve eﬃciency bottlenecks at scales where non-parallel solutions could be utilized.

This chapter is organized as follows. In Section 6.1, we describe maximum likeli-

hood estimation for statistical models, show how this is generalized to models where not

all variables are observable, and then introduce expectation maximization (EM). We

describe hidden Markov models (HMMs) in Section 6.2, a very versatile class of models

that uses EM for parameter estimation. Section 6.3 discusses how EM algorithms can be

expressed in MapReduce, and then in Section 6.4 we look at a case study of word align-

ment for statistical machine translation. Section 6.5 examines similar algorithms that

are appropriate for supervised learning tasks. This chapter concludes with a summary

and pointers to additional readings.

6.1 EXPECTATION MAXIMIZATION

Expectation maximization (EM) algorithms [49] are a family of iterative optimization

algorithms for learning probability distributions from incomplete data. They are ex-

tensively used in statistical natural language processing where one seeks to infer latent

linguistic structure from unannotated text. To name just a few applications, EM algo-

rithms have been used to ﬁnd part-of-speech sequences, constituency and dependency

trees, alignments between texts in diﬀerent languages, alignments between acoustic sig-

nals and their transcriptions, as well as for numerous other clustering and structure

discovery problems.

Expectation maximization generalizes the principle of maximum likelihood esti-

mation to the case where the values of some variables are unobserved (speciﬁcally, those

characterizing the latent structure that is sought).

6.1.1 MAXIMUM LIKELIHOOD ESTIMATION

Maximum likelihood estimation (MLE) is a criterion for ﬁtting the parameters θ of

a statistical model to some given data x. Speciﬁcally, it says to select the parameter

settings θ

∗

such that the likelihood of observing the training data given the model is

maximized:

θ

∗

= arg max

θ

Pr(X = x; θ) (6.1)

To illustrate, consider the simple marble game shown in Figure 6.1. In this game,

a marble is released at the position indicated by the black dot, and it bounces down

into one of the cups at the bottom of the board, being diverted to the left or right by

the peg (indicated by a triangle) in the center. Our task is to construct a model that

predicts which cup the ball will drop into. A “rule-based” approach might be to take

116 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

a b

0 1

0 1 2 3

a b

0

c

1 2 3

Figure 6.1: A simple marble game where a released marble takes one of two possible paths.

This game can be modeled using a Bernoulli random variable with parameter p, which indicates

the probability that the marble will go to the right when it hits the peg.

exact measurements of the board and construct a physical model that we can use to

predict the behavior of the ball. Given sophisticated enough measurements, this could

certainly lead to a very accurate model. However, the construction of this model would

be quite time consuming and diﬃcult.

A statistical approach, on the other hand, might be to assume that the behavior

of the marble in this game can be modeled using a Bernoulli random variable Y with

parameter p. That is, the value of the random variable indicates whether path 0 or 1 is

taken. We also deﬁne a random variable X whose value is the label of the cup that the

marble ends up in; note that X is deterministically related to Y , so an observation of

X is equivalent to an observation of Y .

To estimate the parameter p of the statistical model of our game, we need

some training data, so we drop 10 marbles into the game which end up in cups

x = 'b, b, b, a, b, b, b, b, b, a`.

What is the maximum likelihood estimate of p given this data? By assuming

that our samples are independent and identically distributed (i.i.d.), we can write the

likelihood of our data as follows:

3

Pr(x; p) =

10

¸

j=1

p

δ(xj,a)

(1 −p)

δ(xj,b)

= p

2

(1 −p)

8

Since log is a monotonically increasing function, maximizing log Pr(x; p) will give us

the desired result. We can do this diﬀerentiating with respect to p and ﬁnding where

3

In this equation, δ is the Kroneker delta function which evaluates to 1 where its arguments are equal and 0

otherwise.

6.1. EXPECTATION MAXIMIZATION 117

the resulting expression equals 0:

d log Pr(x; p)

dp

= 0

d[2 log p + 8 log(1 −p)]

dp

= 0

2

p

−

8

1 −p

= 0

Solving for p yields 0.2, which is the intuitive result. Furthermore, it is straightforward

to show that in N trials where N

0

marbles followed path 0 to cup a, and N

1

marbles

followed path 1 to cup b, the maximum likelihood estimate of p is N

1

/(N

0

+ N

1

).

While this model only makes use of an approximation of the true physical process

at work when the marble interacts with the game board, it is an empirical question

whether the model works well enough in practice to be useful. Additionally, while a

Bernoulli trial is an extreme approximation of the physical process, if insuﬃcient re-

sources were invested in building a physical model, the approximation may perform

better than the more complicated “rule-based” model. This sort of dynamic is found

often in text processing problems: given enough data, astonishingly simple models can

outperform complex knowledge-intensive models that attempt to simulate complicated

processes.

6.1.2 A LATENT VARIABLE MARBLE GAME

To see where latent variables might come into play in modeling, consider a more com-

plicated variant of our marble game shown in Figure 6.2. This version consists of three

pegs that inﬂuence the marble’s path, and the marble may end up in one of three cups.

Note that both paths 1 and 2 lead to cup b.

To construct a statistical model of this game, we again assume that the behavior of

a marble interacting with a peg can be modeled with a Bernoulli random variable. Since

there are three pegs, we have three random variables with parameters θ = 'p

0

, p

1

, p

2

`,

corresponding to the probabilities that the marble will go to the right at the top,

left, and right pegs. We further deﬁne a random variable X taking on values from

¦a, b, c¦ indicating what cup the marble ends in, and Y , taking on values from ¦0, 1, 2, 3¦

indicating which path was taken. Note that the full joint distribution Pr(X = x, Y = y)

is determined by θ.

How should the parameters θ be estimated? If it were possible to observe the

paths taken by marbles as they were dropped into the game, it would be trivial to

estimate the parameters for our model using the maximum likelihood estimator—we

would simply need to count the number of times the marble bounced left or right at

each peg. If N

x

counts the number of times a marble took path x in N trials, this is:

118 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

a b

0 1

0 1 2 3

a b

0

c

1 2 3

Figure 6.2: A more complicated marble game where the released marble takes one of four

possible paths. We assume that we can only observe which cup the marble ends up in, not the

speciﬁc path taken.

p

0

=

N

2

+ N

3

N

p

1

=

N

1

N

0

+ N

1

p

2

=

N

3

N

2

+ N

3

However, we wish to consider the case where the paths taken are unobservable (imagine

an opaque sheet covering the center of the game board), but where we can see what cup a

marble ends in. In other words, we want to consider the case where we have partial data.

This is exactly the problem encountered in unsupervised learning: there is a statistical

model describing the relationship between two sets of variables (X’s and Y ’s), and there

is data available from just one of them. Furthermore, such algorithms are quite useful in

text processing, where latent variables may describe latent linguistic structures of the

observed variables, such as parse trees or part-of-speech tags, or alignment structures

relating sets of observed variables (see Section 6.4).

6.1.3 MLE WITH LATENT VARIABLES

Formally, we consider the problem of estimating parameters for statistical models of

the form Pr(X, Y ; θ) which describe not only an observable variable X but a latent, or

hidden, variable Y .

In these models, since only the values of the random variable X are observable,

we deﬁne our optimization criterion to be the maximization of the marginal likelihood,

that is, summing over all settings of the latent variable Y , which takes on values from

set designated \:

4

Again, we assume that samples in the training data x are i.i.d.:

4

For this description, we assume that the variables in our model take on discrete values. Not only does this

simplify exposition, but discrete models are widely used in text processing.

6.1. EXPECTATION MAXIMIZATION 119

Pr(X = x) =

¸

y∈`

Pr(X = x, Y = y; θ)

For a vector of training observations x = 'x

1

, x

2

, . . . , x

**`, if we assume the samples are
**

i.i.d.:

Pr(x; θ) =

]x]

¸

j=1

¸

y∈`

Pr(X = x

j

, Y = y; θ)

Thus, the maximum (marginal) likelihood estimate of the model parameters θ

∗

given a

vector of i.i.d. observations x becomes:

θ

∗

= arg max

θ

]x]

¸

j=1

¸

y∈`

Pr(X = x

j

, Y = y; θ)

Unfortunately, in many cases, this maximum cannot be computed analytically, but the

iterative hill-climbing approach of expectation maximization can be used instead.

6.1.4 EXPECTATION MAXIMIZATION

Expectation maximization (EM) is an iterative algorithm that ﬁnds a successive series

of parameter estimates θ

(0)

, θ

(1)

, . . . that improve the marginal likelihood of the training

data. That is, EM guarantees:

]x]

¸

j=1

¸

y∈`

Pr(X = x

j

, Y = y; θ

(i+1)

) ≥

]x]

¸

j=1

¸

y∈`

Pr(X = x

j

, Y = y; θ

(i)

)

The algorithm starts with some initial set of parameters θ

(0)

and then updates them

using two steps: expectation (E-step), which computes the posterior distribution over

the latent variables given the observable data x and a set of parameters θ

(i)

,

5

and

maximization (M-step), which computes new parameters θ

(i+1)

maximizing the expected

log likelihood of the joint distribution with respect to the distribution computed in the

E-step. The process then repeats with these new parameters. The algorithm terminates

when the likelihood remains unchanged.

6

In more detail, the steps are as follows:

5

The term ‘expectation’ is used since the values computed in terms of the posterior distribution Pr(y|x; θ

(i)

)

that are required to solve the M-step have the form of an expectation (with respect to this distribution).

6

The ﬁnal solution is only guaranteed to be a local maximum, but if the model is fully convex, it will also be

the global maximum.

120 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

E-step. Compute the posterior probability of each possible hidden variable assign-

ments y ∈ \ for each x ∈ A and the current parameter settings, weighted by the rela-

tive frequency with which x occurs in x. Call this q(X = x, Y = y; θ

(i)

) and note that

it deﬁnes a joint probability distribution over A \ in that

¸

(x,y)∈··`

q(x, y) = 1.

q(x, y; θ

(i)

) = f(x[x) Pr(Y = y[X = x; θ

(i)

) = f(x[x)

Pr(x, y; θ

(i)

)

¸

y

Pr(x, y

t

; θ

(i)

)

M-step. Compute new parameter settings that maximize the expected log of the

probability of the joint distribution under the q-distribution that was computed in the

E-step:

θ

(i+1)

= arg max

θ

E

q(X=x,Y =y;θ

(i)

)

log Pr(X = x, Y = y; θ

t

)

= arg max

θ

¸

(x,y)∈··`

q(X = x, Y = y; θ

(i)

) log Pr(X = x, Y = y; θ

t

)

We omit the proof that the model with parameters θ

(i+1)

will have equal or greater

marginal likelihood on the training data than the model with parameters θ

(i)

, but this

is provably true [78].

Before continuing, we note that the eﬀective application of expectation maximiza-

tion requires that both the E-step and the M-step consist of tractable computations.

Speciﬁcally, summing over the space of hidden variable assignments must not be in-

tractable. Depending on the independence assumptions made in the model, this may

be achieved through techniques such as dynamic programming. However, some models

may require intractable computations.

6.1.5 AN EM EXAMPLE

Let’s look at how to estimate the parameters from our latent variable marble game from

Section 6.1.2 using EM. We assume training data x consisting of N = [x[ observations of

X with N

a

, N

b

, and N

c

indicating the number of marbles ending in cups a, b, and c. We

start with some parameters θ

(0)

= 'p

(0)

0

, p

(0)

1

, p

(0)

2

` that have been randomly initialized

to values between 0 and 1.

E-step. We need to compute the distribution q(X = x, Y = y; θ

(i)

), as deﬁned above.

We ﬁrst note that the relative frequency f(x[x) is:

f(x[x) =

N

x

N

6.2. HIDDEN MARKOV MODELS 121

Next, we observe that Pr(Y = 0[X = a) = 1 and Pr(Y = 3[X = c) = 1 since cups a and

c fully determine the value of the path variable Y . The posterior probability of paths 1

and 2 are only non-zero when X is b:

Pr(1[b; θ

(i)

) =

(1 −p

(i)

0

)p

(i)

1

(1 −p

(i)

0

)p

(i)

1

+ p

(i)

0

(1 −p

(i)

2

)

Pr(2[b; θ

(i)

) =

p

(i)

0

(1 −p

(i)

2

)

(1 −p

(i)

0

)p

(i)

1

+ p

(i)

0

(1 −p

(i)

2

)

Except for the four cases just described, Pr(Y = y[X = x) is zero for all other values of

x and y (regardless of the value of the parameters).

M-step. We now need to maximize the expectation of log Pr(X, Y ; θ

t

) (which will

be a function in terms of the three parameter variables) under the q-distribution we

computed in the E step. The non-zero terms in the expectation are as follows:

x y q(X = x, Y = y; θ

(i)

) log Pr(X = x, Y = y; θ

t

)

a 0 N

a

/N log(1 −p

t

0

) + log(1 −p

t

1

)

b 1 N

b

/N Pr(1[b; θ

(i)

) log(1 −p

t

0

) + log p

t

1

b 2 N

b

/N Pr(2[b; θ

(i)

) log p

t

0

+ log(1 −p

t

2

)

c 3 N

c

/N log p

t

0

+ log p

t

2

Multiplying across each row and adding from top to bottom yields the expectation we

wish to maximize. Each parameter can be optimized independently using diﬀerentiation.

The resulting optimal values are expressed in terms of the counts in x and θ

(i)

:

p

0

=

Pr(2[b; θ

(i)

) N

b

+ N

c

N

p

1

=

Pr(1[b; θ

(i)

) N

b

N

a

+ Pr(1[b; θ

(i)

) N

b

p

2

=

N

c

Pr(2[b; θ

(i)

) N

b

+ N

c

It is worth noting that the form of these expressions is quite similar to the fully observed

maximum likelihood estimate. However, rather than depending on exact path counts,

the statistics used are the expected path counts, given x and parameters θ

(i)

.

Typically, the values computed at the end of the M-step would serve as new

parameters for another iteration of EM. However, the example we have presented here

is quite simple and the model converges to a global optimum after a single iteration. For

most models, EM requires several iterations to converge, and it may not ﬁnd a global

optimum. And since EM only ﬁnds a locally optimal solution, the ﬁnal parameter values

depend on the values chose for θ

(0)

.

6.2 HIDDEN MARKOV MODELS

To give a more substantial and useful example of models whose parameters may be

estimated using EM, we turn to hidden Markov models (HMMs). HMMs are models of

122 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

data that are ordered sequentially (temporally, from left to right, etc.), such as words

in a sentence, base pairs in a gene, or letters in a word. These simple but powerful

models have been used in applications as diverse as speech recognition [78], information

extraction [139], gene ﬁnding [143], part of speech tagging [44], stock market forecasting

[70], text retrieval [108], and word alignment of parallel (translated) texts [150] (more

in Section 6.4).

In an HMM, the data being modeled is posited to have been generated from an

underlying Markov process, which is a stochastic process consisting of a ﬁnite set of

states where the probability of entering a state at time t + 1 depends only on the state

of the process at time t [130]. Alternatively, one can view a Markov process as a prob-

abilistic variant of a ﬁnite state machine, where transitions are taken probabilistically.

As another point of comparison, the PageRank algorithm considered in the previous

chapter (Section 5.3) can be understood as a Markov process: the probability of follow-

ing any link on a particular page is independent of the path taken to reach that page.

The states of this Markov process are, however, not directly observable (i.e., hidden).

Instead, at each time step, an observable token (e.g., a word, base pair, or letter) is

emitted according to a probability distribution conditioned on the identity of the state

that the underlying process is in.

A hidden Markov model ´is deﬁned as a tuple 'o, O, θ`. o is a ﬁnite set of states,

which generate symbols from a ﬁnite observation vocabulary O. Following convention,

we assume that variables q, r, and s refer to states in o, and o refers to symbols in

the observation vocabulary O. This model is parameterized by the tuple θ = 'A, B, π`

consisting of an [o[ [o[ matrix A of transition probabilities, where A

q

(r) gives the

probability of transitioning from state q to state r; an [o[ [O[ matrix B of emission

probabilities, where B

q

(o) gives the probability that symbol o will be emitted from

state q; and an [o[-dimensional vector π, where π

q

is the probability that the process

starts in state q.

7

These matrices may be dense, but for many applications sparse

parameterizations are useful. We further stipulate that A

q

(r) ≥ 0, B

q

(o) ≥ 0, and π

q

≥ 0

for all q, r, and o, as well as that:

¸

r∈S

A

q

(r) = 1 ∀q

¸

o∈c

B

q

(o) = 1 ∀q

¸

q∈S

π

q

= 1

A sequence of observations of length τ is generated as follows:

Step 0, let t = 1 and select an initial state q according to the distribution π.

Step 1, an observation symbol from O is emitted according to the distribution B

q

.

7

This is only one possible deﬁnition of an HMM, but it is one that is useful for many text processing problems. In

alternative deﬁnitions, initial and ﬁnal states may be handled diﬀerently, observations may be emitted during

the transition between states, or continuous-valued observations may be emitted (for example, from a Gaussian

distribution).

6.2. HIDDEN MARKOV MODELS 123

Step 2, a new q is drawn according to the distribution A

q

.

Step 3, t is incremented, and if t ≤ τ, the process repeats from Step 1.

Since all events generated by this process are conditionally independent, the joint prob-

ability of this sequence of observations and the state sequence used to generate it is the

product of the individual event probabilities.

Figure 6.3 shows a simple example of a hidden Markov model for part-of-

speech tagging, which is the task of assigning to each word in an input sentence

its grammatical category (one of the ﬁrst steps in analyzing textual content). States

o = ¦det, adj, nn, v¦ correspond to the parts of speech (determiner, adjective, noun,

and verb), and observations O = ¦the, a, green, . . .¦ are a subset of English words. This

example illustrates a key intuition behind many applications of HMMs: states corre-

spond to equivalence classes or clustering of observations, and a single observation type

may associated with several clusters (in this example, the word wash can be generated

by an nn or v, since wash can either be a noun or a verb).

6.2.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS

There are three fundamental questions associated with hidden Markov models:

8

1. Given a model ´ = 'o, O, θ`, and an observation sequence of symbols from O,

x = 'x

1

, x

2

, . . . , x

τ

`, what is the probability that ´ generated the data (summing

over all possible state sequences, \)?

Pr(x) =

¸

y∈`

Pr(x, y; θ)

2. Given a model ´ = 'o, O, θ` and an observation sequence x, what is the most

likely sequence of states that generated the data?

y

∗

= arg max

y∈`

Pr(x, y; θ)

3. Given a set of states o, an observation vocabulary O, and a series of i.i.d.

observation sequences 'x

1

, x

2

, . . . , x

**`, what are the parameters θ = 'A, B, π` that
**

maximize the likelihood of the training data?

θ

∗

= arg max

θ

¸

i=1

¸

y∈`

Pr(x

i

, y; θ)

Using our deﬁnition of an HMM, the answers to the ﬁrst two questions are in principle

quite trivial to compute: by iterating over all state sequences \, the probability that

8

The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [125].

124 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

0 0 0 0.5

0.3 0.2 0.1 0.2

0.7 0.7 0.4 0.2

0 0.1 0.5 0.1

DET

ADJ

NN

V

DET

the

a

0.7

0.3

ADJ

green

big

old

might

0.1

0.4

0.4

NN

book

plants

people

person

John

wash

0.3

0.2

0.2

0.1

V

might

wash

washes

loves

reads

books

0.2

0.3

0.2

0.01

DET ADJ NN V

DET

ADJ

NN

V

Transition probabilities:

Emission probabilities:

0.1

0.19 0.1

Example outputs:

John might wash

the big green person loves old plants

NN V V

DET ADJ ADJ NN V ADJ NN

plants washes books books books

NN V V NN V

Initial probabilities:

DET ADJ NN V

0.5 0.3 0.1 0.1

0.1

0.1

Figure 6.3: An example HMM that relates part-of-speech tags to vocabulary items in an

English-like language. Possible (probability > 0) transitions for the Markov process are shown

graphically. In the example outputs, the state sequences corresponding to the emissions are

written beneath the emitted symbols.

6.2. HIDDEN MARKOV MODELS 125

each generated x can be computed by looking up and multiplying the relevant prob-

abilities in A, B, and π, and then summing the result or taking the maximum. And,

as we hinted at in the previous section, the third question can be answered using EM.

Unfortunately, even with all the distributed computing power MapReduce makes avail-

able, we will quickly run into trouble if we try to use this na¨ıve strategy since there are

[o[

τ

distinct state sequences of length τ, making exhaustive enumeration computation-

ally intractable. Fortunately, because the underlying model behaves exactly the same

whenever it is in some state, regardless of how it got to that state, we can use dynamic

programming algorithms to answer all of the above questions without summing over

exponentially many sequences.

6.2.2 THE FORWARD ALGORITHM

Given some observation sequence, for example x = 'John, might, wash`, Question 1 asks

what is the probability that this sequence was generated by an HMM ´ = 'o, O, θ`.

For the purposes of illustration, we assume that ´ is deﬁned as shown in Figure 6.3.

There are two ways to compute the probability of x having been generated by

´. The ﬁrst is to compute the sum over the joint probability of x and every possible

labeling y

t

∈ ¦'det, det, det`, 'det, det, nn`, 'det, det, v`, . . .¦. As indicated above

this is not feasible for most sequences, since the set of possible labels is exponential in

the length of x. The second, fortunately, is much more eﬃcient.

We can make use of what is known as the forward algorithm to compute the

desired probability in polynomial time. We assume a model ´ = 'o, O, θ` as deﬁned

above. This algorithm works by recursively computing the answer to a related question:

what is the probability that the process is in state q at time t and has generated

'x

1

, x

2

, . . . , x

t

`? Call this probability α

t

(q). Thus, α

t

(q) is a two dimensional matrix (of

size [x[ [o[), called a trellis. It is easy to see that the values of α

1

(q) can be computed

as the product of two independent probabilities: the probability of starting in state q

and the probability of state q generating x

1

:

α

1

(q) = π

q

B

q

(x

1

)

From this, it’s not hard to see that the values of α

2

(r) for every r can be computed in

terms of the [o[ values in α

1

() and the observation x

2

:

α

2

(r) = B

r

(x

2

)

¸

q∈S

α

1

(q) A

q

(r)

This works because there are [o[ diﬀerent ways to get to state r at time t = 2: starting

from state 1, 2, . . . , [o[ and transitioning to state r. Furthermore, because the behavior

126 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

of a Markov process is determined only by the state it is in at some time (not by how

it got to that state), α

t

(r) can always be computed in terms of the [o[ values in α

t−1

()

and the observation x

t

:

α

t

(r) = B

r

(x

t

)

¸

q∈S

α

t−1

(q) A

q

(r)

We have now shown how to compute the probability of being in any state q at any time

t, having generated 'x

1

, x

2

, . . . , x

t

`, with the forward algorithm. The probability of the

full sequence is the probability of being in time [x[ and in any state, so the answer to

Question 1 can be computed simply by summing over α values at time [x[ for all states:

Pr(x; θ) =

¸

q∈S

α

]x]

(q)

In summary, there are two ways of computing the probability that a sequence of obser-

vations x was generated by ´: exhaustive enumeration with summing and the forward

algorithm. Figure 6.4 illustrates the two possibilities. The upper panel shows the na¨ıve

exhaustive approach, enumerating all 4

3

possible labels y

t

of x and computing their joint

probability Pr(x, y

t

). Summing over all y

t

, the marginal probability of x is found to be

0.00018. The lower panel shows the forward trellis, consisting of 4 3 cells. Summing

over the ﬁnal column also yields 0.00018, the same result.

6.2.3 THE VITERBI ALGORITHM

Given an observation sequence x, the second question we might want to ask of ´ is:

what is the most likely sequence of states that generated the observations? As with the

previous question, the na¨ıve approach to solving this problem is to enumerate all possible

labels and ﬁnd the one with the highest joint probability. Continuing with the example

observation sequence x = 'John, might, wash`, examining the chart of probabilities in

the upper panel of Figure 6.4 shows that y

∗

= 'nn, v, v` is the most likely sequence of

states under our example HMM.

However, a more eﬃcient answer to Question 2 can be computed using the same

intuition in the forward algorithm: determine the best state sequence for a short se-

quence and extend this to easily compute the best sequence for longer ones. This is

known as the Viterbi algorithm. We deﬁne γ

t

(q), the Viterbi probability, to be the most

probable sequence of states ending in state q at time t and generating observations

'x

1

, x

2

, . . . , x

t

`. Since we wish to be able to reconstruct the sequence of states, we deﬁne

bp

t

(q), the “backpointer”, to be the state used in this sequence at time t −1. The base

case for the recursion is as follows (the state index of −1 is used as a placeholder since

there is no previous best state at time t = 1):

6.2. HIDDEN MARKOV MODELS 127

John might wash

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

ADJ

ADJ

ADJ

ADJ

NN

NN

NN

NN

V

V

V

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

John might wash

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

DET

DET

DET

DET

ADJ

ADJ

ADJ

ADJ

NN

NN

NN

NN

V

V

V

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

John might wash

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

DET

DET

DET

DET

ADJ

ADJ

ADJ

ADJ

NN

NN

NN

NN

V

V

V

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

0.0

0.0

0.0

0.0

0.0

0.0

0.000021

0.000009

0.0

0.0

0.0

0.0

0.0

0.0

0.00006

0.00009

John might wash

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

DET

DET

DET

DET

ADJ

ADJ

ADJ

ADJ

NN

NN

NN

NN

V

V

V

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

p(x,y) p(x,y) p(x,y) p(x,y)

0.0

John might wash

DET

ADJ

V

NN

0.0 0.0

0.0

0.03

0.0

0.0003

0.0

0.003

0.0

0.000081

0.000099

38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING

A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select

an initial state i according to the distribution π. Step 1, an observation symbol from O

is emitted according to the distribution B

i

. Step 2, a new i is drawn according to A

i

.

Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used

in this process are conditionally independent, the joint probability of this sequence of

observations and the state sequence used to generate it is the product of the individual

event probabilities.

Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.

States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =

{the, a, green, . . .} are a subset of the English words. This example illustrates a key

intuition behind many applications of HMMs: states correspond to equivalence classes

or clustering of observations, and a single observation type may associated with several

clusters.

2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS

There are three fundamental questions associated with hidden Markov models:

7

1. Given a model M = S, O, θ, and an observation sequence of symbols from O,

x = x

1

, x

2

, . . . , x

τ

, what is the probability that M generated the data, summing

over all possible state sequences, Y?

Pr(x) =

y∈Y

Pr(x, y; θ) = 0.00018

2. Given a model M = S, O, θ and an observation sequence x, what is the most

likely sequence of states that generated the data?

y

∗

= arg max

y∈Y

Pr(x, y; θ)

3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.

observation sequences x

1

, x

2

, . . . , x

**, what are the parameters θ = A, B, π that
**

maximize the likelihood of the training data?

θ

∗

= arg max

θ

i=1

y∈Y

Pr(x

i

, y; θ)

Given our deﬁnition of an HMM, the answers to the ﬁrst two questions are in principle

quite trivial to compute: by iterating over all state sequences Y, the probability that each

generated x can be computed by looking up and multiplying the relevant probabilities

7

The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].

38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING

A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select

an initial state i according to the distribution π. Step 1, an observation symbol from O

is emitted according to the distribution B

i

. Step 2, a new i is drawn according to A

i

.

Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used

in this process are conditionally independent, the joint probability of this sequence of

observations and the state sequence used to generate it is the product of the individual

event probabilities.

Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.

States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =

{the, a, green, . . .} are a subset of the English words. This example illustrates a key

intuition behind many applications of HMMs: states correspond to equivalence classes

or clustering of observations, and a single observation type may associated with several

clusters.

2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS

There are three fundamental questions associated with hidden Markov models:

7

1. Given a model M = S, O, θ, and an observation sequence of symbols from O,

x = x

1

, x

2

, . . . , x

τ

, what is the probability that M generated the data, summing

over all possible state sequences, Y?

Pr(x) =

y∈Y

Pr(x, y; θ)

Pr(x) =

q∈S

α

3

(q) = 0.00018

2. Given a model M = S, O, θ and an observation sequence x, what is the most

likely sequence of states that generated the data?

y

∗

= arg max

y∈Y

Pr(x, y; θ)

3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.

observation sequences x

1

, x

2

, . . . , x

**, what are the parameters θ = A, B, π that
**

maximize the likelihood of the training data?

θ

∗

= arg max

θ

i=1

y∈Y

Pr(x

i

, y; θ)

7

The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].

38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING

A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select

an initial state i according to the distribution π. Step 1, an observation symbol from O

is emitted according to the distribution B

i

. Step 2, a new i is drawn according to A

i

.

Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used

in this process are conditionally independent, the joint probability of this sequence of

observations and the state sequence used to generate it is the product of the individual

event probabilities.

Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.

States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =

{the, a, green, . . .} are a subset of the English words. This example illustrates a key

intuition behind many applications of HMMs: states correspond to equivalence classes

or clustering of observations, and a single observation type may associated with several

clusters.

2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS

There are three fundamental questions associated with hidden Markov models:

7

1. Given a model M = S, O, θ, and an observation sequence of symbols from O,

x = x

1

, x

2

, . . . , x

τ

, what is the probability that M generated the data, summing

over all possible state sequences, Y?

Pr(x) =

y∈Y

Pr(x, y; θ)

α

1

α

2

α

3

2. Given a model M = S, O, θ and an observation sequence x, what is the most

likely sequence of states that generated the data?

y

∗

= arg max

y∈Y

Pr(x, y; θ)

3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.

observation sequences x

1

, x

2

, . . . , x

**, what are the parameters θ = A, B, π that
**

maximize the likelihood of the training data?

θ

∗

= arg max

θ

i=1

y∈Y

Pr(x

i

, y; θ)

Given our deﬁnition of an HMM, the answers to the ﬁrst two questions are in principle

quite trivial to compute: by iterating over all state sequences Y, the probability that each

7

The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].

38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING

A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select

an initial state i according to the distribution π. Step 1, an observation symbol from O

is emitted according to the distribution B

i

. Step 2, a new i is drawn according to A

i

.

Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used

in this process are conditionally independent, the joint probability of this sequence of

observations and the state sequence used to generate it is the product of the individual

event probabilities.

Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.

States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =

{the, a, green, . . .} are a subset of the English words. This example illustrates a key

intuition behind many applications of HMMs: states correspond to equivalence classes

or clustering of observations, and a single observation type may associated with several

clusters.

2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS

There are three fundamental questions associated with hidden Markov models:

7

1. Given a model M = S, O, θ, and an observation sequence of symbols from O,

x = x

1

, x

2

, . . . , x

τ

, what is the probability that M generated the data, summing

over all possible state sequences, Y?

Pr(x) =

y∈Y

Pr(x, y; θ)

α

1

α

2

α

3

2. Given a model M = S, O, θ and an observation sequence x, what is the most

likely sequence of states that generated the data?

y

∗

= arg max

y∈Y

Pr(x, y; θ)

3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.

observation sequences x

1

, x

2

, . . . , x

**, what are the parameters θ = A, B, π that
**

maximize the likelihood of the training data?

θ

∗

= arg max

θ

i=1

y∈Y

Pr(x

i

, y; θ)

Given our deﬁnition of an HMM, the answers to the ﬁrst two questions are in principle

quite trivial to compute: by iterating over all state sequences Y, the probability that each

7

The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].

38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING

A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select

an initial state i according to the distribution π. Step 1, an observation symbol from O

is emitted according to the distribution B

i

. Step 2, a new i is drawn according to A

i

.

Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used

in this process are conditionally independent, the joint probability of this sequence of

observations and the state sequence used to generate it is the product of the individual

event probabilities.

Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.

States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =

{the, a, green, . . .} are a subset of the English words. This example illustrates a key

intuition behind many applications of HMMs: states correspond to equivalence classes

or clustering of observations, and a single observation type may associated with several

clusters.

2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS

There are three fundamental questions associated with hidden Markov models:

7

1. Given a model M = S, O, θ, and an observation sequence of symbols from O,

x = x

1

, x

2

, . . . , x

τ

, what is the probability that M generated the data, summing

over all possible state sequences, Y?

Pr(x) =

y∈Y

Pr(x, y; θ)

α

1

α

2

α

3

2. Given a model M = S, O, θ and an observation sequence x, what is the most

likely sequence of states that generated the data?

y

∗

= arg max

y∈Y

Pr(x, y; θ)

3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.

observation sequences x

1

, x

2

, . . . , x

**, what are the parameters θ = A, B, π that
**

maximize the likelihood of the training data?

θ

∗

= arg max

θ

i=1

y∈Y

Pr(x

i

, y; θ)

Given our deﬁnition of an HMM, the answers to the ﬁrst two questions are in principle

quite trivial to compute: by iterating over all state sequences Y, the probability that each

7

The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].

Figure 6.4: Computing the probability of the sequence 'John, might, wash` under the HMM

given in Figure 6.3 by explicitly summing over all possible sequence labels (upper panel) and

using the forward algorithm (lower panel).

128 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

0.0

John might wash

DET

ADJ

V

NN

0.0 0.0

0.0

0.03

0.0

0.0003

0.0

0.003

0.0

0.00006

0.00009

2.4. HIDDEN MARKOV MODELS 41

We have now shown how to compute the probability of being in any state q at any time

t using the forward algorithm. The probability of the full sequence is the probability of

being in time t and in any state, so the answer to Question 1 can be computed simply

by summing over α values at time |x| for all states:

Pr(x; θ) =

n

q=1

α

|x|

(q)

Figure ?? illustrates the two techniques for computing the forward probability. The

upper panel shows the naive approach, enumerating all 4

3

possible labels y

of x and

computing their joint probability Pr(x, y

). Summing over all y

**, the marginal proba-
**

bility of x is found to be 0.00018. The lower panel shows the forward trellis, consisting

of 4 × 3 cells. Summing over the ﬁnal column, also computes 0.00018.

2.4.3 THE VITERBI ALGORITHM

If we have an observation sequence and wish to ﬁnd the most probable label under a

model, the naive approach to solving this problem is to enumerate all possible labels

and ﬁnd the one with the highest joint probability. For example, examining the chart

of probabilities in Figure ?? shows that nn, v, v is the most likely sequence of states

for the observation sequence John, might, wash under our example HMM.

However, a more eﬃcient answer to Question 2 can be computed using the same

intuition as we used in the forward algorithm: determine the best state sequence for

a short sequence and extend this to easily compute the best sequence for longer ones.

This is known as the Viterbi algorithm. We deﬁne γ

t

(q), the Viterbi probability, to

be the most probable sequence of states of ending in state q at time t and generating

observations x

1

, x

2

, . . . , x

t

. Since we wish to be able to reconstruct the sequence of

states, we deﬁne bp

t

(q), the “back pointer”, to be the state used in this sequence at

time t − 1. The base case for the recursion is as follows (the state index of −1 is used

as a place-holder since there is no previous best state at time 1):

γ

1

γ

2

γ

3

γ

1

(q) = π

q

· B

q

(x

1

)

bp

1

(q) = −1

The recursion is similar to that of the forward algorithm, except rather than summing

over previous states, the maximum value of all possible trajectories into state r at time

t is computed. Note that the back-pointer just records the index of the originating state

– a separate computation is not necessary.

2.4. HIDDEN MARKOV MODELS 41

We have now shown how to compute the probability of being in any state q at any time

t using the forward algorithm. The probability of the full sequence is the probability of

being in time t and in any state, so the answer to Question 1 can be computed simply

by summing over α values at time |x| for all states:

Pr(x; θ) =

n

q=1

α

|x|

(q)

Figure ?? illustrates the two techniques for computing the forward probability. The

upper panel shows the naive approach, enumerating all 4

3

possible labels y

of x and

computing their joint probability Pr(x, y

). Summing over all y

**, the marginal proba-
**

bility of x is found to be 0.00018. The lower panel shows the forward trellis, consisting

of 4 × 3 cells. Summing over the ﬁnal column, also computes 0.00018.

2.4.3 THE VITERBI ALGORITHM

If we have an observation sequence and wish to ﬁnd the most probable label under a

model, the naive approach to solving this problem is to enumerate all possible labels

and ﬁnd the one with the highest joint probability. For example, examining the chart

of probabilities in Figure ?? shows that nn, v, v is the most likely sequence of states

for the observation sequence John, might, wash under our example HMM.

However, a more eﬃcient answer to Question 2 can be computed using the same

intuition as we used in the forward algorithm: determine the best state sequence for

a short sequence and extend this to easily compute the best sequence for longer ones.

This is known as the Viterbi algorithm. We deﬁne γ

t

(q), the Viterbi probability, to

be the most probable sequence of states of ending in state q at time t and generating

observations x

1

, x

2

, . . . , x

t

. Since we wish to be able to reconstruct the sequence of

states, we deﬁne bp

t

(q), the “back pointer”, to be the state used in this sequence at

time t − 1. The base case for the recursion is as follows (the state index of −1 is used

as a place-holder since there is no previous best state at time 1):

γ

1

γ

2

γ

3

γ

1

(q) = π

q

· B

q

(x

1

)

bp

1

(q) = −1

The recursion is similar to that of the forward algorithm, except rather than summing

over previous states, the maximum value of all possible trajectories into state r at time

t is computed. Note that the back-pointer just records the index of the originating state

– a separate computation is not necessary.

2.4. HIDDEN MARKOV MODELS 41

We have now shown how to compute the probability of being in any state q at any time

t using the forward algorithm. The probability of the full sequence is the probability of

being in time t and in any state, so the answer to Question 1 can be computed simply

by summing over α values at time |x| for all states:

Pr(x; θ) =

n

q=1

α

|x|

(q)

Figure ?? illustrates the two techniques for computing the forward probability. The

upper panel shows the naive approach, enumerating all 4

3

possible labels y

of x and

computing their joint probability Pr(x, y

). Summing over all y

**, the marginal proba-
**

bility of x is found to be 0.00018. The lower panel shows the forward trellis, consisting

of 4 × 3 cells. Summing over the ﬁnal column, also computes 0.00018.

2.4.3 THE VITERBI ALGORITHM

If we have an observation sequence and wish to ﬁnd the most probable label under a

model, the naive approach to solving this problem is to enumerate all possible labels

and ﬁnd the one with the highest joint probability. For example, examining the chart

of probabilities in Figure ?? shows that nn, v, v is the most likely sequence of states

for the observation sequence John, might, wash under our example HMM.

However, a more eﬃcient answer to Question 2 can be computed using the same

intuition as we used in the forward algorithm: determine the best state sequence for

a short sequence and extend this to easily compute the best sequence for longer ones.

This is known as the Viterbi algorithm. We deﬁne γ

t

(q), the Viterbi probability, to

be the most probable sequence of states of ending in state q at time t and generating

observations x

1

, x

2

, . . . , x

t

. Since we wish to be able to reconstruct the sequence of

states, we deﬁne bp

t

(q), the “back pointer”, to be the state used in this sequence at

time t − 1. The base case for the recursion is as follows (the state index of −1 is used

as a place-holder since there is no previous best state at time 1):

γ

1

γ

2

γ

3

γ

1

(q) = π

q

· B

q

(x

1

)

bp

1

(q) = −1

The recursion is similar to that of the forward algorithm, except rather than summing

over previous states, the maximum value of all possible trajectories into state r at time

t is computed. Note that the back-pointer just records the index of the originating state

– a separate computation is not necessary.

Figure 6.5: Computing the most likely state sequence that generated 'John, might, wash` un-

der the HMM given in Figure 6.3 using the Viterbi algorithm. The most likely state sequence

is highlighted in bold and could be recovered programmatically by following backpointers from

the maximal probability cell in the last column to the ﬁrst column (thicker arrows).

γ

1

(q) = π

q

B

q

(x

1

)

bp

1

(q) = −1

The recursion is similar to that of the forward algorithm, except rather than summing

over previous states, the maximum value of all possible trajectories into state r at time

t is computed. Note that the backpointer simply records the index of the originating

state—a separate computation is not necessary.

γ

t

(r) = max

q∈S

γ

t−1

(q) A

q

(r) B

r

(x

t

)

bp

t

(r) = arg max

q∈S

γ

t−1

(q) A

q

(r) B

r

(x

t

)

To compute the best sequence of states, y

∗

, the state with the highest probability path

at time [x[ is selected, and then the backpointers are followed, recursively, to construct

the rest of the sequence:

y

∗

]x]

= arg max

q∈S

γ

]x]

(q)

y

∗

t−1

= bp

t

(y

t

)

Figure 6.5 illustrates a Viterbi trellis, including backpointers that have been used to

compute the most likely state sequence.

6.2. HIDDEN MARKOV MODELS 129

John might wash

DET

ADJ

V

NN

Figure 6.6: A “fully observable” HMM training instance. The output sequence is at the top

of the ﬁgure, and the corresponding states and transitions are shown in the trellis below.

6.2.4 PARAMETER ESTIMATION FOR HMMS

We now turn to Question 3: given a set of states o and observation vocabulary O,

what are the parameters θ

∗

= 'A, B, π` that maximize the likelihood of a set of train-

ing examples, 'x

1

, x

2

, . . . , x

`?

9

Since our model is constructed in terms of variables

whose values we cannot observe (the state sequence) in the training data, we may train

it to optimize the marginal likelihood (summing over all state sequences) of x using

EM. Deriving the EM update equations requires only the application of the techniques

presented earlier in this chapter and some diﬀerential calculus. However, since the for-

malism is cumbersome, we will skip a detailed derivation, but readers interested in more

information can ﬁnd it in the relevant citations [78, 125].

In order to make the update equations as intuitive as possible, consider a fully

observable HMM, that is, one where both the emissions and the state sequence are

observable in all training instances. In this case, a training instance can be depicted

as shown in Figure 6.6. When this is the case, such as when we have a corpus of sentences

in which all words have already been tagged with their parts of speech, the maximum

likelihood estimate for the parameters can be computed in terms of the counts of the

number of times the process transitions from state q to state r in all training instances,

T(q → r); the number of times that state q emits symbol o, O(q ↑ o); and the number

of times the process starts in state q, I(q). In this example, the process starts in state

nn; there is one nn → v transition and one v → v transition. The nn state emits John

in the ﬁrst time step, and v state emits might and wash in the second and third time

steps, respectively. We also deﬁne N(q) to be the number of times the process enters

state q. The maximum likelihood estimates of the parameters in the fully observable

case are:

9

Since an HMM models sequences, its training data consists of a collection of example sequences.

130 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

π

q

=

I(q)

=

¸

r

I(r)

A

q

(r) =

T(q → r)

N(q) =

¸

r

T(q → r

t

)

B

q

(o) =

O(q ↑ o)

N(q) =

¸

o

O(q ↑ o

t

)

(6.2)

For example, to compute the emission parameters from state nn, we simply need to keep

track of the number of times the process is in state nn and what symbol it generates

at each of these times. Transition probabilities are computed similarly: to compute, for

example, the distribution A

det

(), that is, the probabilities of transitioning away from

state det, we count the number of times the process is in state det, and keep track

of what state the process transitioned into at the next time step. This counting and

normalizing be accomplished using the exact same counting and relative frequency al-

gorithms that we described in Section 3.3. Thus, in the fully observable case, parameter

estimation is not a new algorithm at all, but one we have seen before.

How should the model parameters be estimated when the state sequence is not

provided? It turns out that the update equations have the satisfying form where the

optimal parameter values for iteration i + 1 are expressed in terms of the expectations of

the counts referenced in the fully observed case, according to the posterior distribution

over the latent variables given the observations x and the parameters θ

(i)

:

π

q

=

E[I(q)]

A

q

(r) =

E[T(q → r)]

E[N(q)]

B

q

(o) =

E[O(q ↑ o)]

E[N(q)]

(6.3)

Because of the independence assumptions made in the HMM, the update equations

consist of 2 [o[ + 1 independent optimization problems, just as was the case with the

‘observable’ HMM. Solving for the initial state distribution, π, is one problem; there

are [o[ solving for the transition distributions A

q

() from each state q; and [o[ solving

for the emissions distributions B

q

() from each state q. Furthermore, we note that the

following must hold:

E[N(q)] =

¸

r∈S

E[T(q → r)] =

¸

o∈c

E[O(q ↑ o)]

As a result, the optimization problems (i.e., Equations 6.2) require completely indepen-

dent sets of statistics, which we will utilize later to facilitate eﬃcient parallelization in

MapReduce.

How can the expectations in Equation 6.3 be understood? In the fully observed

training case, between every time step, there is exactly one transition taken and the

source and destination states are observable. By progressing through the Markov chain,

we can let each transition count as ‘1’, and we can accumulate the total number of times

each kind of transition was taken (by each kind, we simply mean the number of times

that one state follows another, for example, the number of times nn follows det). These

statistics can then in turn be used to compute the MLE for an ‘observable’ HMM, as

6.2. HIDDEN MARKOV MODELS 131

described above. However, when the transition sequence is not observable (as is most

often the case), we can instead imagine that at each time step, every possible transition

(there are [o[

2

of them, and typically [o[ is quite small) is taken, with a particular

probability. The probability used is the posterior probability of the transition, given the

model and an observation sequence (we describe how to compute this value below).

By summing over all the time steps in the training data, and using this probability as

the ‘count’ (rather than ‘1’ as in the observable case), we compute the expected count

of the number of times a particular transition was taken, given the training sequence.

Furthermore, since the training instances are statistically independent, the value of the

expectations can be computed by processing each training instance independently and

summing the results.

Similarly for the necessary emission counts (the number of times each symbol in

O was generated by each state in o), we assume that any state could have generated

the observation. We must therefore compute the probability of being in every state at

each time point, which is then the size of the emission ‘count’. By summing over all

time steps we compute the expected count of the number of times that a particular

state generated a particular symbol. These two sets of expectations, which are written

formally here, are suﬃcient to execute the M-step.

E[O(q ↑ o)] =

]x]

¸

i=1

Pr(y

i

= q[x; θ) δ(x

i

, o) (6.4)

E[T(q → r)] =

]x]−1

¸

i=1

Pr(y

i

= q, y

i+1

= r[x; θ) (6.5)

Posterior probabilities. The expectations necessary for computing the M-step in

HMM training are sums of probabilities that a particular transition is taken, given an

observation sequence, and that some state emits some observation symbol, given an

observation sequence. These are referred to as posterior probabilities, indicating that

they are the probability of some event whose distribution we have a prior belief about,

after addition evidence has been taken into consideration (here, the model parame-

ters characterize our prior beliefs, and the observation sequence is the evidence). Both

posterior probabilities can be computed by combining the forward probabilities, α

t

(),

which give the probability of reaching some state at time t, by any path, and generat-

ing the observations 'x

1

, x

2

, . . . , x

t

`, with backward probabilities, β

t

(), which give the

probability of starting in some state at time t and generating the rest of the sequence

'x

t+1

, x

t+2

, . . . , x

]x]

`, using any sequence of states to do so. The algorithm for comput-

ing the backward probabilities is given a bit later. Once the forward and backward

probabilities have been computed, the state transition posterior probabilities and the

emission posterior probabilities can be written as follows:

132 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

a b c b

S

1

S

2

S

3

α

2

(2) β

3

(2)

b

Figure 6.7: Using forward and backward probabilities to compute the posterior probability of

the dashed transition, given the observation sequence a b b c b. The shaded area on the left

corresponds to the forward probability α

2

(s

2

), and the shaded area on the right corresponds to

the backward probability β

3

(s

2

).

Pr(y

i

= q[x; θ) = α

i

(q) β

i

(q) (6.6)

Pr(y

i

= q, y

i+1

= r[x; θ) = α

i

(q) A

q

(r) B

r

(x

i+1

) β

i+1

(r) (6.7)

Equation 6.6 is the probability of being in state q at time i, given x, and the correctness

of the expression should be clear from the deﬁnitions of forward and backward proba-

bilities. The intuition for Equation 6.7, the probability of taking a particular transition

at a particular time, is also not complicated: it is the product of four conditionally

independent probabilities: the probability of getting to state q at time i (having gener-

ated the ﬁrst part of the sequence), the probability of taking transition q → r (which

is speciﬁed in the parameters, θ), the probability of generating observation x

i+1

from

state r (also speciﬁed in θ), and the probability of generating the rest of the sequence,

along any path. A visualization of the quantities used in computing this probability is

shown in Figure 6.7. In this illustration, we assume an HMM with o = ¦s

1

, s

2

, s

3

¦ and

O = ¦a, b, c¦.

The backward algorithm. Like the forward and Viterbi algorithms introduced

above to answer Questions 1 and 2, the backward algorithm uses dynamic program-

ming to incrementally compute β

t

(). Its base case starts at time [x[, and is deﬁned as

follows:

6.2. HIDDEN MARKOV MODELS 133

β

]x]

(q) = 1

To understand the intuition for this base case, keep in mind that since the backward

probabilities β

t

() are the probability of generating the remainder of the sequence after

time t (as well as being in some state), and since there is nothing left to generate after

time [x[, the probability must be 1. The recursion is deﬁned as follows:

β

t

(q) =

¸

r∈S

β

t+1

(r) A

q

(r) B

r

(x

t+1

)

Unlike the forward and Viterbi algorithms, the backward algorithm is computed from

right to left and makes no reference to the start probabilities, π.

6.2.5 FORWARD-BACKWARD TRAINING: SUMMARY

In the preceding section, we have shown how to compute all quantities needed to ﬁnd

the parameter settings θ

(i+1)

using EM training with a hidden Markov model ´ =

'o, O, θ

(i)

`. To recap: each training instance x is processed independently, using the

parameter settings of the current iteration, θ

(i)

. For each x in the training data, the

forward and backward probabilities are computed using the algorithms given above

(for this reason, this training algorithm is often referred to as the forward-backward

algorithm). The forward and backward probabilities are in turn used to compute the

expected number of times the underlying Markov process enters into each state, the

number of times each state generates each output symbol type, and the number of times

each state transitions into each other state. These expectations are summed over all

training instances, completing the E-step. The M-step involves normalizing the expected

counts computed in the E-step using the calculations in Equation 6.3, which yields θ

(i+1)

.

The process then repeats from the E-step using the new parameters. The number of

iterations required for convergence depends on the quality of the initial parameters,

and the complexity of the model. For some applications, only a handful of iterations

are necessary, whereas for others, hundreds may be required.

Finally, a few practical considerations: HMMs have a non-convex likelihood surface

(meaning that it has the equivalent of many hills and valleys in the number of dimensions

corresponding to the number of parameters in the model). As a result, EM training is

only guaranteed to ﬁnd a local maximum, and the quality of the learned model may vary

considerably, depending on the initial parameters that are used. Strategies for optimal

selection of initial parameters depend on the phenomena being modeled. Additionally,

if some parameter is assigned a probability of 0 (either as an initial value or during

one of the M-step parameter updates), EM will never change this in future iterations.

134 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

This can be useful, since it provides a way of constraining the structures of the Markov

model; however, one must be aware of this behavior.

Another pitfall to avoid when implementing HMMs is arithmetic underﬂow.

HMMs typically deﬁne a massive number of sequences, and so the probability of any one

of them is often vanishingly small—so small that they often underﬂow standard ﬂoating

point representations. A very common solution to this problem is to represent prob-

abilities using their logarithms. Note that expected counts do not typically have this

problem and can be represented using normal ﬂoating point numbers. See Section 5.4

for additional discussion on working with log probabilities.

6.3 EM IN MAPREDUCE

Expectation maximization algorithms ﬁt quite naturally into the MapReduce program-

ming model. Although the model being optimized determines the details of the required

computations, MapReduce implementations of EM algorithms share a number of char-

acteristics:

• Each iteration of EM is one MapReduce job.

• A controlling process (i.e., driver program) spawns the MapReduce jobs, keeps

track of the number of iterations and convergence criteria.

• Model parameters θ

(i)

, which are static for the duration of the MapReduce job,

are loaded by each mapper from HDFS or other data provider (e.g., a distributed

key-value store).

• Mappers map over independent training instances, computing partial latent vari-

able posteriors (or summary statistics, such as expected counts).

• Reducers sum together the required training statistics and solve one or more of

the M-step optimization problems.

• Combiners, which sum together the training statistics, are often quite eﬀective at

reducing the amount of data that must be written to disk.

The degree of parallelization that can be attained depends on the statistical indepen-

dence assumed in the model and in the derived quantities required to solve the opti-

mization problems in the M-step. Since parameters are estimated from a collection of

samples that are assumed to be i.i.d., the E-step can generally be parallelized eﬀectively

since every training instance can be processed independently of the others. In the limit,

in fact, each independent training instance could be processed by a separate mapper!

10

10

Although the wisdom of doing this is questionable, given that the startup costs associated with individual map

tasks in Hadoop may be considerable.

6.3. EM IN MAPREDUCE 135

Reducers, however, must aggregate the statistics necessary to solve the optimiza-

tion problems as required by the model. The degree to which these may be solved

independently depends on the structure of the model, and this constrains the number

of reducers that may be used. Fortunately, many common models (such as HMMs) re-

quire solving several independent optimization problems in the M-step. In this situation,

a number of reducers may be run in parallel. Still, it is possible that in the worst case,

the M-step optimization problem will not decompose into independent subproblems,

making it necessary to use a single reducer.

6.3.1 HMM TRAINING IN MAPREDUCE

As we would expect, the training of hidden Markov models parallelizes well in Map-

Reduce. The process can be summarized as follows: in each iteration, mappers pro-

cess training instances, emitting expected event counts computed using the forward-

backward algorithm introduced in Section 6.2.4. Reducers aggregate the expected

counts, completing the E-step, and then generate parameter estimates for the next

iteration using the updates given in Equation 6.3.

This parallelization strategy is eﬀective for several reasons. First, the majority of

the computational eﬀort in HMM training is the running of the forward and backward

algorithms. Since there is no limit on the number of mappers that may be run, the

full computational resources of a cluster may be brought to bear to solve this problem.

Second, since the M-step of an HMM training iteration with [o[ states in the model

consists of 2 [o[ + 1 independent optimization problems that require non-overlapping

sets of statistics, this may be exploited with as many as 2 [o[ + 1 reducers running

in parallel. While the optimization problem is computationally trivial, being able to

reduce in parallel helps avoid the data bottleneck that would limit performance if only

a single reducer is used.

The quantities that are required to solve the M-step optimization problem are

quite similar to the relative frequency estimation example discussed in Section 3.3;

however, rather than counts of observed events, we aggregate expected counts of events.

As a result of the similarity, we can employ the stripes representation for aggregating

sets of related values, as described in Section 3.2. A pairs approach that requires less

memory at the cost of slower performance is also feasible.

HMM training mapper. The pseudo-code for the HMM training mapper is given

in Figure 6.8. The input consists of key-value pairs with a unique id as the key and

a training instance (e.g., a sentence) as the value. For each training instance, 2n + 1

stripes are emitted with unique keys, and every training instance emits the same set

of keys. Each unique key corresponds to one of the independent optimization problems

that will be solved in the M-step. The outputs are:

136 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

1: class Mapper

2: method Initialize(integer iteration)

3: 'o, O` ← ReadModel

4: θ ← 'A, B, π` ← ReadModelParams(iteration)

5: method Map(sample id, sequence x)

6: α ← Forward(x, θ) cf. Section 6.2.2

7: β ← Backward(x, θ) cf. Section 6.2.4

8: I ← new AssociativeArray Initial state expectations

9: for all q ∈ o do Loop over states

10: I¦q¦ ← α

1

(q) β

1

(q)

11: O ← new AssociativeArray of AssociativeArray Emissions

12: for t = 1 to [x[ do Loop over observations

13: for all q ∈ o do Loop over states

14: O¦q¦¦x

t

¦ ← O¦q¦¦x

t

¦ + α

t

(q) β

t

(q)

15: t ← t + 1

16: T ← new AssociativeArray of AssociativeArray Transitions

17: for t = 1 to [x[ −1 do Loop over observations

18: for all q ∈ o do Loop over states

19: for all r ∈ o do Loop over states

20: T¦q¦¦r¦ ← T¦q¦¦r¦ + α

t

(q) A

q

(r) B

r

(x

t+1

) β

t+1

(r)

21: t ← t + 1

22: Emit(string ‘initial ’, stripe I)

23: for all q ∈ o do Loop over states

24: Emit(string ‘emit from ’ + q, stripe O¦q¦)

25: Emit(string ‘transit from ’ + q, stripe T¦q¦)

Figure 6.8: Mapper pseudo-code for training hidden Markov models using EM. The mappers

map over training instances (i.e., sequences of observations x

i

) and generate the expected counts

of initial states, emissions, and transitions taken to generate the sequence.

6.3. EM IN MAPREDUCE 137

1: class Combiner

2: method Combine(string t, stripes [C

1

, C

2

, . . .])

3: C

f

← new AssociativeArray

4: for all stripe C ∈ stripes [C

1

, C

2

, . . .] do

5: Sum(C

f

, C)

6: Emit(string t, stripe C

f

)

1: class Reducer

2: method Reduce(string t, stripes [C

1

, C

2

, . . .])

3: C

f

← new AssociativeArray

4: for all stripe C ∈ stripes [C

1

, C

2

, . . .] do

5: Sum(C

f

, C)

6: z ← 0

7: for all 'k, v` ∈ C

f

do

8: z ← z + v

9: P

f

← new AssociativeArray Final parameters vector

10: for all 'k, v` ∈ C

f

do

11: P

f

¦k¦ ← v/z

12: Emit(string t, stripe P

f

)

Figure 6.9: Combiner and reducer pseudo-code for training hidden Markov models using EM.

The HMMs considered in this book are fully parameterized by multinomial distributions, so

reducers do not require special logic to handle diﬀerent types of model parameters (since they

are all of the same type).

138 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

1. the probabilities that the unobserved Markov process begins in each state q, with

a unique key designating that the values are initial state counts;

2. the expected number of times that state q generated each emission symbol o (the

set of emission symbols included will be just those found in each training instance

x), with a key indicating that the associated value is a set of emission counts from

state q; and

3. the expected number of times state q transitions to each state r, with a key indi-

cating that the associated value is a set of transition counts from state q.

HMM training reducer. The reducer for one iteration of HMM training, shown

together with an optional combiner in Figure 6.9, aggregates the count collections as-

sociated with each key by summing them. When the values for each key have been

completely aggregated, the associative array contains all of the statistics necessary to

compute a subset of the parameters for the next EM iteration. The optimal parameter

settings for the following iteration are computed simply by computing the relative fre-

quency of each event with respect to its expected count at the current iteration. The

new computed parameters are emitted from the reducer and written to HDFS. Note

that they will be spread across 2 [o[ + 1 keys, representing initial state probabilities

π, transition probabilities A

q

for each state q, and emission probabilities B

q

for each

state q.

6.4 CASE STUDY: WORD ALIGNMENT FOR STATISTICAL

MACHINE TRANSLATION

To illustrate the real-world beneﬁts of expectation maximization algorithms using Map-

Reduce, we turn to the problem of word alignment, which is an important task in sta-

tistical machine translation that is typically solved using models whose parameters are

learned with EM.

We begin by giving a brief introduction to statistical machine translation and

the phrase-based translation approach; for a more comprehensive introduction, refer to

[85, 97]. Fully-automated translation has been studied since the earliest days of elec-

tronic computers. After successes with code-breaking during World War II, there was

considerable optimism that translation of human languages would be another soluble

problem. In the early years, work on translation was dominated by manual attempts to

encode linguistic knowledge into computers—another instance of the ‘rule-based’ ap-

proach we described in the introduction to this chapter. These early attempts failed to

live up to the admittedly optimistic expectations. For a number of years, the idea of

fully automated translation was viewed with skepticism. Not only was constructing a

translation system labor intensive, but translation pairs had to be developed indepen-

6.4. CASE STUDY: WORDALIGNMENT FORSTATISTICAL MACHINE TRANSLATION 139

dently, meaning that improvements in a Russian-English translation system could not,

for the most part, be leveraged to improve a French-English system.

After languishing for a number of years, the ﬁeld was reinvigorated in the late

1980s when researchers at IBM pioneered the development of statistical machine trans-

lation (SMT), which took a data-driven approach to solving the problem of machine

translation, attempting to improve both the quality of translation while reducing the

cost of developing systems [29]. The core idea of SMT is to equip the computer to learn

how to translate, using example translations which are produced for other purposes, and

modeling the process as a statistical process with some parameters θ relating strings

in a source language (typically denoted as f) to strings in a target language (typically

denoted as e):

e

∗

= arg max

e

Pr(e[f; θ)

With the statistical approach, translation systems can be developed cheaply and

quickly for any language pair, as long as there is suﬃcient training data available.

Furthermore, improvements in learning algorithms and statistical modeling can yield

beneﬁts in many translation pairs at once, rather than being speciﬁc to individual

language pairs. Thus, SMT, like many other topics we are considering in this book,

is an attempt to leverage the vast quantities of textual data that is available to solve

problems that would otherwise require considerable manual eﬀort to encode specialized

knowledge. Since the advent of statistical approaches to translation, the ﬁeld has grown

tremendously and numerous statistical models of translation have been developed, with

many incorporating quite specialized knowledge about the behavior of natural language

as biases in their learning algorithms.

6.4.1 STATISTICAL PHRASE-BASED TRANSLATION

One approach to statistical translation that is simple yet powerful is called phrase-based

translation [86]. We provide a rough outline of the process since it is representative of

most state-of-the-art statistical translation systems, such as the one used inside Google

Translate.

11

Phrase-based translation works by learning how strings of words, called

phrases, translate between languages.

12

Example phrase pairs for Spanish-English trans-

lation might include 'los estudiantes, the students`, 'los estudiantes, some students`,

and 'soy, i am`. From a few hundred thousand sentences of example translations, many

millions of such phrase pairs may be automatically learned.

The starting point is typically a parallel corpus (also called bitext), which contains

pairs of sentences in two languages that are translations of each other. Parallel corpora

11

http://translate.google.com

12

Phrases are simply sequences of words; they are not required to correspond to the deﬁnition of a phrase in any

linguistic theory.

140 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

are frequently generated as the byproduct of an organization’s eﬀort to disseminate in-

formation in multiple languages, for example, proceedings of the Canadian Parliament

in French and English, and text generated by the United Nations in many diﬀerent

languages. The parallel corpus is then annotated with word alignments, which indicate

which words in one language correspond to words in the other. By using these word

alignments as a skeleton, phrases can be extracted from the sentence that is likely to

preserve the meaning relationships represented by the word alignment. While an expla-

nation of the process is not necessary here, we mention it as a motivation for learning

word alignments, which we show below how to compute with EM. After phrase ex-

traction, each phrase pair is associated with a number of scores which, taken together,

are used to compute the phrase translation probability, a conditional probability that

reﬂects how likely the source phrase translates into the target phrase. We brieﬂy note

that although EM could be utilized to learn the phrase translation probabilities, this

is not typically done in practice since the maximum likelihood solution turns out to be

quite bad for this problem. The collection of phrase pairs and their scores are referred to

as the translation model. In addition to the translation model, phrase-based translation

depends on a language model, which gives the probability of a string in the target lan-

guage. The translation model attempts to preserve the meaning of the source language

during the translation process, while the language model ensures that the output is

ﬂuent and grammatical in the target language. The phrase-based translation process is

summarized in Figure 6.10.

A language model gives the probability that a string of words w =

'w

1

, w

2

, . . . , w

n

`, written as w

n

1

for short, is a string in the target language. By the

chain rule of probability, we get:

Pr(w

n

1

) = Pr(w

1

) Pr(w

2

[w

1

) Pr(w

3

[w

2

1

) . . . Pr(w

n

[w

n−1

1

) =

n

¸

k=1

Pr(w

k

[w

k−1

1

) (6.8)

Due to the extremely large number of parameters involved in estimating such a model

directly, it is customary to make the Markov assumption, that the sequence histories

only depend on prior local context. That is, an n-gram language model is equivalent to

a (n −1)th-order Markov model. Thus, we can approximate P(w

k

[w

k−1

1

) as follows:

bigrams: P(w

k

[w

k−1

1

) ≈ P(w

k

[w

k−1

) (6.9)

trigrams: P(w

k

[w

k−1

1

) ≈ P(w

k

[w

k−1

w

k−2

) (6.10)

n-grams: P(w

k

[w

k−1

1

) ≈ P(w

k

[w

k−1

k−n+1

) (6.11)

The probabilities used in computing Pr(w

n

1

) based on an n-gram language model are

generally estimated from a monolingual corpus of target language text. Since only target

6.4. CASE STUDY: WORDALIGNMENT FORSTATISTICAL MACHINE TRANSLATION 141

Word Alignment Phrase Extraction

Training Data

i saw the small table

vi la mesa pequeña

(vi, i saw)

(la mesa pequeña, the small table)

…

Parallel Sentences

he sat at the table

the service was good

Target-Language Text

Translation Model

Language

Model

Target Language Text

Decoder

Foreign Input Sentence English Output Sentence

maria no daba una bofetada a la bruja verde mary did not slap the green witch

Figure 6.10: The standard phrase-based machine translation architecture. The translation

model is constructed with phrases extracted from a word-aligned parallel corpus. The language

model is estimated from a monolingual corpus. Both serve as input to the decoder, which

performs the actual translation.

Maria no dio una bofetada a la bruja verde

Mary t i l t th it h Mary not

did not

no

give a slap to the witch green

slap

a slap

to the

green witch by

did not give to

the

the witch slap the witch slap

Figure 6.11: Translation coverage of the sentence Maria no dio una bofetada a la bruja verde

by a phrase-based model. The best possible translation path is indicated with a dashed line.

142 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

language text is necessary (without any additional annotation), language modeling has

been well served by large-data approaches that take advantage of the vast quantities of

text available on the web.

To translate an input sentence f, the phrase-based decoder creates a matrix of all

translation possibilities of all substrings in the input string, as an example illustrates in

Figure 6.11. A sequence of phrase pairs is selected such that each word in f is translated

exactly once.

13

The decoder seeks to ﬁnd the translation that maximizes the product of

the translation probabilities of the phrases used and the language model probability of

the resulting string in the target language. Because the phrase translation probabilities

are independent of each other and the Markov assumption made in the language model,

this may be done eﬃciently using dynamic programming. For a detailed introduction

to phrase-based decoding, we refer the reader to a recent textbook by Koehn [85].

6.4.2 BRIEF DIGRESSION: LANGUAGE MODELING WITH MAPREDUCE

Statistical machine translation provides the context for a brief digression on distributed

parameter estimation for language models using MapReduce, and provides another

example illustrating the eﬀectiveness data-driven approaches in general. We brieﬂy

touched upon this work in Chapter 1. Even after making the Markov assumption, train-

ing n-gram language models still requires estimating an enormous number of parame-

ters: potentially V

n

, where V is the number of words in the vocabulary. For higher-order

models (e.g., 5-grams) used in real-world applications, the number of parameters can

easily exceed the number of words from which to estimate those parameters. In fact,

most n-grams will never be observed in a corpus, no matter how large. To cope with this

sparseness, researchers have developed a number of smoothing techniques [102], which

all share the basic idea of moving probability mass from observed to unseen events in

a principled manner. For many applications, a state-of-the-art approach is known as

Kneser-Ney smoothing [35].

In 2007, Brants et al. [25] reported experimental results that answered an inter-

esting question: given the availability of large corpora (i.e., the web), could a simpler

smoothing strategy, applied to more text, beat Kneser-Ney in a machine translation

task? It should come as no surprise that the answer is yes. Brants et al. introduced a

technique known as “stupid backoﬀ” that was exceedingly simple and so na¨ıve that the

resulting model didn’t even deﬁne a valid probability distribution (it assigned arbitrary

scores as opposed to probabilities). The simplicity, however, aﬀorded an extremely scal-

able implementations in MapReduce. With smaller corpora, stupid backoﬀ didn’t work

as well as Kneser-Ney in generating accurate and ﬂuent translations. However, as the

amount of data increased, the gap between stupid backoﬀ and Kneser-Ney narrowed,

13

The phrases may not necessarily be selected in a strict left-to-right order. Being able to vary the order of the

phrases used is necessary since languages may express the same ideas using diﬀerent word orders.

6.4. CASE STUDY: WORDALIGNMENT FORSTATISTICAL MACHINE TRANSLATION 143

and eventually disappeared with suﬃcient data. Furthermore, with stupid backoﬀ it

was possible to train a language model on more data than was feasible with Kneser-

Ney smoothing. Applying this language model to a machine translation task yielded

better results than a (smaller) language model trained with Kneser-Ney smoothing.

The role of the language model in statistical machine translation is to select

ﬂuent, grammatical translations from a large hypothesis space: the more training data a

language model has access to, the better its description of relevant language phenomena

and hence its ability to select good translations. Once again, large data triumphs! For

more information about estimating language models using MapReduce, we refer the

reader to a forthcoming book from Morgan & Claypool [26].

6.4.3 WORD ALIGNMENT

Word alignments, which are necessary for building phrase-based translation models (as

well as many other more sophisticated translation models), can be learned automatically

using EM. In this section, we introduce a popular alignment model based on HMMs.

In the statistical model of word alignment considered here, the observable variables

are the words in the source and target sentences (conventionally written using the

variables f and e, respectively), and their alignment is a latent variable. To make this

model tractable, we assume that words are translated independently of one another,

which means that the model’s parameters include the probability of any word in the

source language translating to any word in the target language. While this independence

assumption is problematic in many ways, it results in a simple model structure that

admits eﬃcient inference yet produces reasonable alignments. Alignment models that

make this assumption generate a string e in the target language by selecting words in

the source language according to a lexical translation distribution. The indices of the

words in f used to generate each word in e are stored in an alignment variable, a.

14

This means that the variable a

i

indicates the source word position of the i

th

target word

generated, and [a[ = [e[. Using these assumptions, the probability of an alignment and

translation can be written as follows:

Pr(e, a[f) = Pr(a[f, e)

. .. .

Alignment probability

]e]

¸

i=1

Pr(e

i

[f

ai

)

. .. .

Lexical probability

Since we have parallel corpora consisting of only 'f, e` pairs, we can learn the parame-

ters for this model using EM and treating a as a latent variable. However, to combat

14

In the original presentation of statistical lexical translation models, a special null word is added to the source

sentences, which permits words to be inserted ‘out of nowhere’. Since this does not change any of the important

details of training, we omit it from our presentation for simplicity.

144 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

data sparsity in the alignment probability, we must make some further simplifying as-

sumptions. By letting the probability of an alignment depend only on the position of

the previous aligned word we capture a valuable insight (namely, words that are nearby

in the source language will tend to be nearby in the target language), and our model

acquires the structure of an HMM [150]:

Pr(e, a[f) =

]e]

¸

i=1

Pr(a

i

[a

i−1

)

. .. .

Transition probability

]e]

¸

i=1

Pr(e

i

[f

ai

)

. .. .

Emission probability

This model can be trained using the forward-backward algorithm described in the pre-

vious section, summing over all settings of a, and the best alignment for a sentence pair

can be found using the Viterbi algorithm.

To properly initialize this HMM, it is conventional to further simplify the align-

ment probability model, and use this simpler model to learn initial lexical translation

(emission) parameters for the HMM. The favored simpliﬁcation is to assert that all

alignments are uniformly probable:

Pr(e, a[f) =

1

[f[

]e]

]e]

¸

i=1

Pr(e

i

[f

ai

)

This model is known as IBM Model 1. It is attractive for initialization because it is

convex everywhere, and therefore EM will learn the same solution regardless of initial-

ization. Finally, while the forward-backward algorithm could be used to compute the

expected counts necessary for training this model by setting A

q

(r) to be a constant

value for all q and r, the uniformity assumption means that the expected emission

counts can be estimated in time O([e[ [f[), rather than time O([e[ [f[

2

) required by

the forward-backward algorithm.

6.4.4 EXPERIMENTS

How well does a MapReduce word aligner for statistical machine translation perform?

We describe previously-published results [54] that compared a Java-based Hadoop im-

plementation against a highly optimized word aligner called Giza++ [112], which was

written in C++ and designed to run eﬃciently on a single core. We compared the train-

ing time of Giza++ and our aligner on a Hadoop cluster with 19 slave nodes, each with

two single-core processors and two disks (38 cores total).

Figure 6.12 shows the performance of Giza++ in terms of the running time of a

single EM iteration for both Model 1 and the HMM alignment model as a function of

the number of training pairs. Both axes in the ﬁgure are on a log scale, but the ticks

6.4. CASE STUDY: WORDALIGNMENT FORSTATISTICAL MACHINE TRANSLATION 145

on the y-axis are aligned with ‘meaningful’ time intervals rather than exact orders of

magnitude. There are three things to note. First, the running time scales linearly with

the size of the training data. Second, the HMM is a constant factor slower than Model 1.

Third, the alignment process is quite slow as the size of the training data grows—at one

million sentences, a single iteration takes over three hours to complete! Five iterations

are generally necessary to train the models, which means that full training takes the

better part of a day.

In Figure 6.13 we plot the running time of our MapReduce implementation run-

ning on the 38-core cluster described above. For reference, we plot points indicating

what 1/38 of the running time of the Giza++ iterations would be at each data size,

which gives a rough indication of what an ‘ideal’ parallelization could achieve, assum-

ing that there was no overhead associated with distributing computation across these

machines. Three things may be observed in the results. First, as the amount of data

increases, the relative cost of the overhead associated with distributing data, marshal-

ing and aggregating counts, decreases. At one million sentence pairs of training data,

the HMM alignment iterations begin to approach optimal runtime eﬃciency. Second,

Model 1, which we observe is light on computation, does not approach the theoretical

performance of an ideal parallelization, and in fact, has almost the same running time

as the HMM alignment algorithm. We conclude that the overhead associated with dis-

tributing and aggregating data is signiﬁcant compared to the Model 1 computations,

although a comparison with Figure 6.12 indicates that the MapReduce implementation

is still substantially faster than the single core implementation, at least once a certain

training data size is reached. Finally, we note that, in comparison to the running times

of the single-core implementation, at large data sizes, there is a signiﬁcant advantage

to using the distributed implementation, even of Model 1.

Although these results do confound several variables (Java vs. C++ performance,

memory usage patterns), it is reasonable to expect that the confounds would tend to

make the single-core system’s performance appear relatively better than the MapReduce

system (which is, of course, the opposite pattern from what we actually observe). Fur-

thermore, these results show that when computation is distributed over a cluster of

many machines, even an unsophisticated implementation of the HMM aligner could

compete favorably with a highly optimized single-core system whose performance is

well-known to many people in the MT research community.

Why are these results important? Perhaps the most signiﬁcant reason is that

the quantity of parallel data that is available to train statistical machine translation

models is ever increasing, and as is the case with so many problems we have encountered,

more data leads to improvements in translation quality [54]. Recently a corpus of one

billion words of French-English data was mined automatically from the web and released

146 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

3 s

10 s

30 s

90 s

5 min

20 min

60 min

3 hrs

10000 100000 1e+06

A

v

e

r

a

g

e

i

t

e

r

a

t

i

o

n

l

a

t

e

n

c

y

(

s

e

c

o

n

d

s

)

Corpus size (sentences)

Model 1

HMM

Figure 6.12: Running times of Giza++ (baseline single-core system) for Model 1 and HMM

training iterations at various corpus sizes.

3 s

10 s

30 s

90 s

5 min

20 min

60 min

3 hrs

10000 100000 1e+06

T

i

m

e

(

s

e

c

o

n

d

s

)

Corpus size (sentences)

Optimal Model 1 (Giza/38)

Optimal HMM (Giza/38)

MapReduce Model 1 (38 M/R)

MapReduce HMM (38 M/R)

Figure 6.13: Running times of our MapReduce implementation of Model 1 and HMM training

iterations at various corpus sizes. For reference, 1/38 running times of the Giza++ models are

shown.

6.5. EM-LIKE ALGORITHMS 147

publicly [33].

15

Single-core solutions to model construction simply cannot keep pace with

the amount of translated data that is constantly being produced. Fortunately, several

independent researchers have shown that existing modeling algorithms can be expressed

naturally and eﬀectively using MapReduce, which means that we can take advantage

of this data. Furthermore, the results presented here show that even at data sizes

that may be tractable on single machines, signiﬁcant performance improvements are

attainable using MapReduce implementations. This improvement reduces experimental

turnaround times, which allows researchers to more quickly explore the solution space—

which will, we hope, lead to rapid new developments in statistical machine translation.

For the reader interested in statistical machine translation, there is an open source

Hadoop-based MapReduce implementation of a training pipeline for phrase-based trans-

lation that includes word alignment, phrase extraction, and phrase scoring [56].

6.5 EM-LIKE ALGORITHMS

This chapter has focused on expectation maximization algorithms and their implemen-

tation in the MapReduce programming framework. These important algorithms are

indispensable for learning models with latent structure from unannotated data, and

they can be implemented quite naturally in MapReduce. We now explore some related

learning algorithms that are similar to EM but can be used to solve more general

problems, and discuss their implementation.

In this section we focus on gradient-based optimization, which refers to a class of

techniques used to optimize any objective function, provided it is diﬀerentiable with

respect to the parameters being optimized. Gradient-based optimization is particularly

useful in the learning of maximum entropy (maxent) models [110] and conditional ran-

dom ﬁelds (CRF) [87] that have an exponential form and are trained to maximize

conditional likelihood. In addition to being widely used supervised classiﬁcation models

in text processing (meaning that during training, both the data and their annotations

must be observable), their gradients take the form of expectations. As a result, some of

the previously-introduced techniques are also applicable for optimizing these models.

6.5.1 GRADIENT-BASED OPTIMIZATION AND LOG-LINEAR MODELS

Gradient-based optimization refers to a class of iterative optimization algorithms that

use the derivatives of a function to ﬁnd the parameters that yield a minimal or maximal

value of that function. Obviously, these algorithms are only applicable in cases where a

useful objective exists, is diﬀerentiable, and its derivatives can be eﬃciently evaluated.

Fortunately, this is the case for many important problems of interest in text process-

ing. For the purposes of this discussion, we will give examples in terms of minimizing

functions.

15

http://www.statmt.org/wmt10/translation-task.html

148 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

Assume that we have some real-valued function F(θ) where θ is a k-dimensional

vector and that F is diﬀerentiable with respect to θ. Its gradient is deﬁned as:

∇F(θ) =

∂F

∂θ

1

(θ),

∂F

∂θ

2

(θ), . . . ,

∂F

∂θ

k

(θ)

**The gradient has two crucial properties that are exploited in gradient-based optimiza-
**

tion. First, the gradient ∇F is a vector ﬁeld that points in the direction of the greatest

increase of F and whose magnitude indicates the rate of increase. Second, if θ

∗

is a

(local) minimum of F, then the following is true:

∇F(θ

∗

) = 0

An extremely simple gradient-based minimization algorithm produces a series of

parameter estimates θ

(1)

, θ

(2)

, . . . by starting with some initial parameter settings θ

(1)

and updating parameters through successive iterations according to the following rule:

θ

(i+1)

= θ

(i)

−η

(i)

∇F(θ

(i)

) (6.12)

The parameter η

(i)

> 0 is a learning rate which indicates how quickly the algorithm

moves along the gradient during iteration i. Provided this value is small enough that

F decreases, this strategy will ﬁnd a local minimum of F. However, while simple, this

update strategy may converge slowly, and proper selection of η is non-trivial. More

sophisticated algorithms perform updates that are informed by approximations of the

second derivative, which are estimated by successive evaluations of ∇F(θ), and can

converge much more rapidly [96].

Gradient-based optimization in MapReduce. Gradient-based optimization al-

gorithms can often be implemented eﬀectively in MapReduce. Like EM, where the

structure of the model determines the speciﬁcs of the realization, the details of the

function being optimized determines how it should best be implemented, and not ev-

ery function optimization problem will be a good ﬁt for MapReduce. Nevertheless,

MapReduce implementations of gradient-based optimization tend to have the following

characteristics:

• Each optimization iteration is one MapReduce job.

• The objective should decompose linearly across training instances. This implies

that the gradient also decomposes linearly, and therefore mappers can process

input data in parallel. The values they emit are pairs 'F(θ), ∇F(θ)`, which are

linear components of the objective and gradient.

6.5. EM-LIKE ALGORITHMS 149

• Evaluation of the function and its gradient is often computationally expensive

because they require processing lots of data. This make parallelization with Map-

Reduce worthwhile.

• Whether more than one reducer can run in parallel depends on the speciﬁc opti-

mization algorithm being used. Some, like the trivial algorithm of Equation 6.12

treat the dimensions of θ independently, whereas many are sensitive to global

properties of ∇F(θ). In the latter case, parallelization across multiple reducers is

non-trivial.

• Reducer(s) sum the component objective/gradient pairs, compute the total objec-

tive and gradient, run the optimization algorithm, and emit θ

(i+1)

.

• Many optimization algorithms are stateful and must persist their state between

optimization iterations. This may either be emitted together with θ

(i+1)

or written

to the distributed ﬁle system as a side eﬀect of the reducer. Such external side

eﬀects must be handled carefully; refer to Section 2.2 for a discussion.

Parameter learning for log-linear models. Gradient-based optimization tech-

niques can be quite eﬀectively used to learn the parameters of probabilistic models with

a log-linear parameterization [100]. While a comprehensive introduction to these models

is beyond the scope of this book, such models are used extensively in text processing

applications, and their training using gradient-based optimization, which may otherwise

be computationally expensive, can be implemented eﬀectively using MapReduce. We

therefore include a brief summary.

Log-linear models are particularly useful for supervised learning (unlike the un-

supervised models learned with EM), where an annotation y ∈ \ is available for every

x ∈ A in the training data. In this case, it is possible to directly model the conditional

distribution of label given input:

Pr(y[x; θ) =

exp

¸

i

θ

i

H

i

(x, y)

¸

y

exp

¸

i

θ

i

H

i

(x, y

t

)

In this expression, H

i

are real-valued functions sensitive to features of the input and

labeling. The parameters of the model is selected so as to minimize the negative condi-

tional log likelihood of a set of training instances ''x, y`

1

, 'x, y`

2

, . . .`, which we assume

to be i.i.d.:

F(θ) =

¸

¹x,y)

−log Pr(y[x; θ) (6.13)

θ

∗

= arg min

θ

F(θ) (6.14)

150 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

As Equation 6.13 makes clear, the objective decomposes linearly across training in-

stances, meaning it can be optimized quite well in MapReduce. The gradient derivative

of F with respect to θ

i

can be shown to have the following form [141]:

16

∂F

∂θ

i

(θ) =

¸

¹x,y)

H

i

(x, y) −E

Pr(y

]x;θ)

[H

i

(x, y

t

)]

**The expectation in the second part of the gradient’s expression can be computed using a
**

variety of techniques. However, as we saw with EM, when very large event spaces are be-

ing modeled, as is the case with sequence labeling, enumerating all possible values y can

become computationally intractable. And, as was the case with HMMs, independence

assumptions can be used to enable eﬃcient computation using dynamic programming.

In fact, the forward-backward algorithm introduced in Section 6.2.4 can, with only min-

imal modiﬁcation, be used to compute the expectation E

Pr(y

]x;θ)

[H

i

(x, y

t

)] needed in

CRF sequence models, as long as the feature functions respect the same Markov as-

sumption that is made in HMMs. For more information about inference in CRFs using

the forward-backward algorithm, we refer the reader to Sha et al. [140].

As we saw in the previous section, MapReduce oﬀers signiﬁcant speedups when

training iterations require running the forward-backward algorithm. The same pattern

of results holds when training linear CRFs.

6.6 SUMMARY AND ADDITIONAL READINGS

This chapter focused on learning the parameters of statistical models from data, using

expectation maximization algorithms or gradient-based optimization techniques. We

focused especially on EM algorithms for three reasons. First, these algorithms can be

expressed naturally in the MapReduce programming model, making them a good exam-

ple of how to express a commonly-used algorithm in this new framework. Second, many

models, such as the widely-used hidden Markov model (HMM) trained using EM, make

independence assumptions that permit an high degree of parallelism in both the E- and

M-steps. Thus, they are particularly well-positioned to take advantage of large clusters.

Finally, EM algorithms are unsupervised learning algorithms, which means that they

have access to far more training data than comparable supervised approaches. This is

quite important. In Chapter 1, when we hailed large data as the “rising tide that lifts

all boats” to yield more eﬀective algorithms, we were mostly referring to unsupervised

approaches, given that the manual eﬀort required to generate annotated data remains

a bottleneck in many supervised approaches. Data acquisition for unsupervised algo-

rithms is often as simple as crawling speciﬁc web sources, given the enormous quantities

of data available “for free”. This, combined with the ability of MapReduce to process

16

This assumes that when x, y is present the model is fully observed (i.e., there are no additional latent

variables).

6.6. SUMMARY AND ADDITIONAL READINGS 151

large datasets in parallel, provides researchers with an eﬀective strategy for developing

increasingly-eﬀective applications.

Since EM algorithms are relatively computationally expensive, even for small

amounts of data, this led us to consider how related supervised learning models (which

typically have much less training data available), can also be implemented in Map-

Reduce. The discussion demonstrates that not only does MapReduce provide a means

for coping with ever-increasing amounts of data, but it is also useful for parallelizing

expensive computations. Although MapReduce has been designed with mostly data-

intensive applications in mind, the ability to leverage clusters of commodity hardware

to parallelize computationally-expensive algorithms is an important use case.

Additional Readings. Because of its ability to leverage large amounts of training

data, machine learning is an attractive problem for MapReduce and an area of active

research. Chu et al. [37] presented general formulations of a variety of machine learning

problems, focusing on a normal form for expressing a variety of machine learning algo-

rithms in MapReduce. The Apache Mahout project is an open-source implementation

of these and other learning algorithms,

17

and it is also the subject of a forthcoming

book [116]. Issues associated with a MapReduce implementation of latent Dirichlet

allocation (LDA), which is another important unsupervised learning technique, with

certain similarities to EM, have been explored by Wang et al. [151].

17

http://lucene.apache.org/mahout/

152

C H A P T E R 7

Closing Remarks

The need to process enormous quantities of data has never been greater. Not only

are terabyte- and petabyte-scale datasets rapidly becoming commonplace, but there

is consensus that great value lies buried in them, waiting to be unlocked by the right

computational tools. In the commercial sphere, business intelligence—driven by the

ability to gather data from a dizzying array of sources—promises to help organizations

better understand their customers and the marketplace, hopefully leading to better

business decisions and competitive advantages. For engineers building information pro-

cessing tools and applications, larger datasets lead to more eﬀective algorithms for a

wide range of tasks, from machine translation to spam detection. In the natural and

physical sciences, the ability to analyze massive amounts of data may provide the key

to unlocking the secrets of the cosmos or the mysteries of life.

In the preceding chapters, we have shown how MapReduce can be exploited to

solve a variety of problems related to text processing at scales that would have been

unthinkable a few years ago. However, no tool—no matter how powerful or ﬂexible—

can be perfectly adapted to every task, so it is only fair to discuss the limitations of

the MapReduce programming model and survey alternatives. Section 7.1 covers online

learning algorithms and Monte Carlo simulations, which are examples of algorithms

that require maintaining global state. As we have seen, this is diﬃcult to accomplish

in MapReduce. Section 7.2 discusses alternative programming models, and the book

concludes in Section 7.3.

7.1 LIMITATIONS OF MAPREDUCE

As we have seen throughout this book, solutions to many interesting problems in text

processing do not require global synchronization. As a result, they can be expressed

naturally in MapReduce, since map and reduce tasks run independently and in iso-

lation. However, there are many examples of algorithms that depend crucially on the

existence of shared global state during processing, making them diﬃcult to implement

in MapReduce (since the single opportunity for global synchronization in MapReduce

is the barrier between the map and reduce phases of processing).

The ﬁrst example is online learning. Recall from Chapter 6 the concept of learning

as the setting of parameters in a statistical model. Both EM and the gradient-based

learning algorithms we described are instances of what are known as batch learning

algorithms. This simply means that the full “batch” of training data is processed before

any updates to the model parameters are made. On one hand, this is quite reasonable:

7.1. LIMITATIONS OF MAPREDUCE 153

updates are not made until the full evidence of the training data has been weighed

against the model. An earlier update would seem, in some sense, to be hasty. However,

it is generally the case that more frequent updates can lead to more rapid convergence

of the model (in terms of number of training instances processed), even if those updates

are made by considering less data [24]. Thinking in terms of gradient optimization (see

Section 6.5), online learning algorithms can be understood as computing an approx-

imation of the true gradient, using only a few training instances. Although only an

approximation, the gradient computed from a small subset of training instances is of-

ten quite reasonable, and the aggregate behavior of multiple updates tends to even out

errors that are made. In the limit, updates can be made after every training instance.

Unfortunately, implementing online learning algorithms in MapReduce is problem-

atic. The model parameters in a learning algorithm can be viewed as shared global state,

which must be updated as the model is evaluated against training data. All processes

performing the evaluation (presumably the mappers) must have access to this state.

In a batch learner, where updates occur in one or more reducers (or, alternatively, in

the driver code), synchronization of this resource is enforced by the MapReduce frame-

work. However, with online learning, these updates must occur after processing smaller

numbers of instances. This means that the framework must be altered to support faster

processing of smaller datasets, which goes against the design choices of most existing

MapReduce implementations. Since MapReduce was speciﬁcally optimized for batch

operations over large amounts of data, such a style of computation would likely result

in ineﬃcient use of resources. In Hadoop, for example, map and reduce tasks have con-

siderable startup costs. This is acceptable because in most circumstances, this cost is

amortized over the processing of many key-value pairs. However, for small datasets,

these high startup costs become intolerable. An alternative is to abandon shared global

state and run independent instances of the training algorithm in parallel (on diﬀerent

portions of the data). A ﬁnal solution is then arrived at by merging individual results.

Experiments, however, show that the merged solution is inferior to the output of running

the training algorithm on the entire dataset [52].

A related diﬃculty occurs when running what are called Monte Carlo simula-

tions, which are used to perform inference in probabilistic models where evaluating or

representing the model exactly is impossible. The basic idea is quite simple: samples

are drawn from the random variables in the model to simulate its behavior, and then

simple frequency statistics are computed over the samples. This sort of inference is par-

ticularly useful when dealing with so-called nonparametric models, which are models

whose structure is not speciﬁed in advance, but is rather inferred from training data.

For an illustration, imagine learning a hidden Markov model, but inferring the num-

ber of states, rather than having them speciﬁed. Being able to parallelize Monte Carlo

simulations would be tremendously valuable, particularly for unsupervised learning ap-

154 CHAPTER 7. CLOSING REMARKS

plications where they have been found to be far more eﬀective than EM-based learning

(which requires specifying the model). Although recent work [10] has shown that the

delays in synchronizing sample statistics due to parallel implementations do not neces-

sarily damage the inference, MapReduce oﬀers no natural mechanism for managing the

global shared state that would be required for such an implementation.

The problem of global state is suﬃciently pervasive that there has been substan-

tial work on solutions. One approach is to build a distributed datastore capable of

maintaining the global state. However, such a system would need to be highly scal-

able to be used in conjunction with MapReduce. Google’s BigTable [34], which is a

sparse, distributed, persistent multidimensional sorted map built on top of GFS, ﬁts

the bill, and has been used in exactly this manner. Amazon’s Dynamo [48], which is a

distributed key-value store (with a very diﬀerent architecture), might also be useful in

this respect, although it wasn’t originally designed with such an application in mind.

Unfortunately, it is unclear if the open-source implementations of these two systems

(HBase and Cassandra, respectively) are suﬃciently mature to handle the low-latency

and high-throughput demands of maintaining global state in the context of massively

distributed processing (but recent benchmarks are encouraging [40]).

7.2 ALTERNATIVE COMPUTING PARADIGMS

Streaming algorithms [3] represent an alternative programming model for dealing with

large volumes of data with limited computational and storage resources. This model

assumes that data are presented to the algorithm as one or more streams of inputs that

are processed in order, and only once. The model is agnostic with respect to the source

of these streams, which could be ﬁles in a distributed ﬁle system, but more interestingly,

data from an “external” source or some other data gathering device. Stream processing

is very attractive for working with time-series data (news feeds, tweets, sensor readings,

etc.), which is diﬃcult in MapReduce (once again, given its batch-oriented design).

Furthermore, since streaming algorithms are comparatively simple (because there is

only so much that can be done with a particular training instance), they can often take

advantage of modern GPUs, which have a large number of (relatively simple) functional

units [104]. In the context of text processing, streaming algorithms have been applied

to language modeling [90], translation modeling [89], and detecting the ﬁrst mention of

news event in a stream [121].

The idea of stream processing has been generalized in the Dryad framework as

arbitrary dataﬂow graphs [75, 159]. A Dryad job is a directed acyclic graph where each

vertex represents developer-speciﬁed computations and edges represent data channels

that capture dependencies. The dataﬂow graph is a logical computation graph that is

automatically mapped onto physical resources by the framework. At runtime, channels

7.2. ALTERNATIVE COMPUTING PARADIGMS 155

are used to transport partial results between vertices, and can be realized using ﬁles,

TCP pipes, or shared memory.

Another system worth mentioning is Pregel [98], which implements a program-

ming model inspired by Valiant’s Bulk Synchronous Parallel (BSP) model [148]. Pregel

was speciﬁcally designed for large-scale graph algorithms, but unfortunately there are

few published details at present. However, a longer description is anticipated in a forth-

coming paper [99].

What is the signiﬁcance of these developments? The power of MapReduce derives

from providing an abstraction that allows developers to harness the power of large

clusters. As anyone who has taken an introductory computer science course would know,

abstractions manage complexity by hiding details and presenting well-deﬁned behaviors

to users of those abstractions. This process makes certain tasks easier, but others more

diﬃcult, if not impossible. MapReduce is certainly no exception to this generalization,

and one of the goals of this book has been to give the reader a better understanding of

what’s easy to do in MapReduce and what its limitations are. But of course, this begs

the obvious question: What other abstractions are available in the massively-distributed

datacenter environment? Are there more appropriate computational models that would

allow us to tackle classes of problems that are diﬃcult for MapReduce?

Dryad and Pregel are alternative answers to these questions. They share in pro-

viding an abstraction for large-scale distributed computations, separating the what from

the how of computation and isolating the developer from the details of concurrent pro-

gramming. They diﬀer, however, in how distributed computations are conceptualized:

functional-style programming, arbitrary dataﬂows, or BSP. These conceptions represent

diﬀerent tradeoﬀs between simplicity and expressivity: for example, Dryad is more ﬂex-

ible than MapReduce, and in fact, MapReduce can be trivially implemented in Dryad.

However, it remains unclear, at least at present, which approach is more appropriate

for diﬀerent classes of applications. Looking forward, we can certainly expect the de-

velopment of new models and a better understanding of existing ones. MapReduce is

not the end, and perhaps not even the best. It is merely the ﬁrst of many approaches

to harness large-scaled distributed computing resources.

Even within the Hadoop/MapReduce ecosystem, we have already observed the

development of alternative approaches for expressing distributed computations. For

example, there is a proposal to add a third merge phase after map and reduce to

better support relational operations [36]. Pig [114], which was inspired by Google’s

Sawzall [122], can be described as a data analytics platform that provides a lightweight

scripting language for manipulating large datasets. Although Pig scripts (in a language

called Pig Latin) are ultimately converted into Hadoop jobs by Pig’s execution engine,

constructs in the language allow developers to specify data transformations (ﬁltering,

joining, grouping, etc.) at a much higher level. Similarly, Hive [68], another open-source

156 CHAPTER 7. CLOSING REMARKS

project, provides an abstraction on top of Hadoop that allows users to issue SQL queries

against large relational datasets stored in HDFS. Hive queries (in HiveQL) “compile

down” to Hadoop jobs by the Hive query engine. Therefore, the system provides a data

analysis tool for users who are already comfortable with relational databases, while

simultaneously taking advantage of Hadoop’s data processing capabilities.

7.3 MAPREDUCE AND BEYOND

The capabilities necessary to tackle large-data problems are already within reach by

many and will continue to become more accessible over time. By scaling “out” with

commodity servers, we have been able to economically bring large clusters of machines

to bear on problems of interest. But this has only been possible with corresponding

innovations in software and how computations are organized on a massive scale. Impor-

tant ideas include: moving processing to the data, as opposed to the other way around;

also, emphasizing throughput over latency for batch tasks by sequential scans through

data, avoiding random seeks. Most important of all, however, is the development of

new abstractions that hide system-level details from the application developer. These

abstractions are at the level of entire datacenters, and provide a model using which pro-

grammers can reason about computations at a massive scale without being distracted

by ﬁne-grained concurrency management, fault tolerance, error recovery, and a host of

other issues in distributed computing. This, in turn, paves the way for innovations in

scalable algorithms that can run on petabyte-scale datasets.

None of these points are new or particularly earth-shattering—computer scientists

have known about these principles for decades. However, MapReduce is unique in that,

for the ﬁrst time, all these ideas came together and were demonstrated on practical

problems at scales unseen before, both in terms of computational resources and the

impact on the daily lives of millions. The engineers at Google deserve a tremendous

amount of credit for that, and also for sharing their insights with the rest of the world.

Furthermore, the engineers and executives at Yahoo deserve a lot of credit for starting

the open-source Hadoop project, which has made MapReduce accessible to everyone

and created the vibrant software ecosystem that ﬂourishes today. Add to that the

advent of utility computing, which eliminates capital investments associated with cluster

infrastructure, large-data processing capabilities are now available “to the masses” with

a relatively low barrier to entry.

The golden age of massively distributed computing is ﬁnally upon us.

157

Bibliography

[1] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and

Alexander Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS

technologies for analytical workloads. In Proceedings of the 35th International

Conference on Very Large Data Base (VLDB 2009), pages 922–933, Lyon, France,

2009.

[2] R´eka Albert and Albert-L´aszl´o Barab´asi. Statistical mechanics of complex net-

works. Reviews of Modern Physics, 74:47–97, 2002.

[3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approxi-

mating the frequency moments. In Proceedings of the 28th Annual ACM Sympo-

sium on Theory of Computing (STOC ’96), pages 20–29, Philadelphia, Pennsyl-

vania, 1996.

[4] Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Heller-

stein, and Russell C. Sears. BOOM: Data-centric programming in the datacenter.

Technical Report UCB/EECS-2009-98, Electrical Engineering and Computer Sci-

ences, University of California at Berkeley, 2009.

[5] Gene Amdahl. Validity of the single processor approach to achieving large-scale

computing capabilities. In Proceedings of the AFIPS Spring Joint Computer Con-

ference, pages 483–485, 1967.

[6] Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu

Pucha, Prasenjit Sarkar, Mansi Shah, and Renu Tewari. Cloud analytics: Do

we really need to reinvent the storage stack? In Proceedings of the 2009 Workshop

on Hot Topics in Cloud Computing (HotCloud 09), San Diego, California, 2009.

[7] Thomas Anderson, Michael Dahlin, Jeanna Neefe, David Patterson, Drew Roselli,

and Randolph Wang. Serverless network ﬁle systems. In Proceedings of the 15th

ACM Symposium on Operating Systems Principles (SOSP 1995), pages 109–126,

Copper Mountain Resort, Colorado, 1995.

[8] Vo Ngoc Anh and Alistair Moﬀat. Inverted index compression using word-aligned

binary codes. Information Retrieval, 8(1):151–166, 2005.

[9] Michael Armbrust, Armando Fox, Rean Griﬃth, Anthony D. Joseph, Randy H.

Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion

158 CHAPTER 7. CLOSING REMARKS

Stoica, and Matei Zaharia. Above the clouds: A Berkeley view of cloud comput-

ing. Technical Report UCB/EECS-2009-28, Electrical Engineering and Computer

Sciences, University of California at Berkeley, 2009.

[10] Arthur Asuncion, Padhraic Smyth, and Max Welling. Asynchronous distributed

learning of topic models. In Advances in Neural Information Processing Systems

21 (NIPS 2008), pages 81–88, Vancouver, British Columbia, Canada, 2008.

[11] Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, and

Fabrizio Silvestri. Challenges on distributed web retrieval. In Proceedings of the

IEEE 23rd International Conference on Data Engineering (ICDE 2007), pages

6–20, Istanbul, Turkey, 2007.

[12] Ricardo Baeza-Yates, Carlos Castillo, and Vicente L´opez. PageRank increase un-

der diﬀerent collusion topologies. In Proceedings of the First International Work-

shop on Adversarial Information Retrieval on the Web (AIRWeb 2005), pages

17–24, Chiba, Japan, 2005.

[13] Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vas-

silis Plachouras, and Fabrizio Silvestri. The impact of caching on search engines.

In Proceedings of the 30th Annual International ACM SIGIR Conference on Re-

search and Development in Information Retrieval (SIGIR 2007), pages 183–190,

Amsterdam, The Netherlands, 2007.

[14] Michele Banko and Eric Brill. Scaling to very very large corpora for natural

language disambiguation. In Proceedings of the 39th Annual Meeting of the Asso-

ciation for Computational Linguistics (ACL 2001), pages 26–33, Toulouse, France,

2001.

[15] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho,

Rolf Neugebauer, Ian Pratt, and Andrew Warﬁeld. Xen and the art of virtualiza-

tion. In Proceedings of the 19th ACM Symposium on Operating Systems Principles

(SOSP 2003), pages 164–177, Bolton Landing, New York, 2003.

[16] Luiz Andr´e Barroso, Jeﬀrey Dean, and Urs H¨olzle. Web search for a planet: The

Google cluster architecture. IEEE Micro, 23(2):22–28, 2003.

[17] Luiz Andr´e Barroso and Urs H¨olzle. The case for energy-proportional computing.

Computer, 40(12):33–37, 2007.

[18] Luiz Andr´e Barroso and Urs H¨olzle. The Datacenter as a Computer: An Introduc-

tion to the Design of Warehouse-Scale Machines. Morgan & Claypool Publishers,

2009.

7.3. MAPREDUCE AND BEYOND 159

[19] Jacek Becla, Andrew Hanushevsky, Sergei Nikolaev, Ghaleb Abdulla, Alex Szalay,

Maria Nieto-Santisteban, Ani Thakar, and Jim Gray. Designing a multi-petabyte

database for LSST. SLAC Publications SLAC-PUB-12292, Stanford Linear Ac-

celerator Center, May 2006.

[20] Jacek Becla and Daniel L. Wang. Lessons learned from managing a petabyte.

In Proceedings of the Second Biennial Conference on Innovative Data Systems

Research (CIDR 2005), Asilomar, California, 2005.

[21] Gordon Bell, Tony Hey, and Alex Szalay. Beyond the data deluge. Science,

323(5919):1297–1298, 2009.

[22] Monica Bianchini, Marco Gori, and Franco Scarselli. Inside PageRank. ACM

Transactions on Internet Technology, 5(1):92–128, 2005.

[23] Jorge Luis Borges. Collected Fictions (translated by Andrew Hurley). Penguin,

1999.

[24] L´eon Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg,

editors, Advanced Lectures on Machine Learning, Lecture Notes in Artiﬁcial In-

telligence, LNAI 3176, pages 146–168. Springer Verlag, Berlin, 2004.

[25] Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeﬀrey Dean.

Large language models in machine translation. In Proceedings of the 2007 Joint

Conference on Empirical Methods in Natural Language Processing and Computa-

tional Natural Language Learning, pages 858–867, Prague, Czech Republic, 2007.

[26] Thorsten Brants and Peng Xu. Distributed Language Models. Morgan & Claypool

Publishers, 2010.

[27] Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais, and Andrew Ng. Data-

intensive question answering. In Proceedings of the Tenth Text REtrieval Confer-

ence (TREC 2001), pages 393–400, Gaithersburg, Maryland, 2001.

[28] Frederick P. Brooks. The Mythical Man-Month: Essays on Software Engineering,

Anniversary Edition. Addison-Wesley, Reading, Massachusetts, 1995.

[29] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L.

Mercer. The mathematics of statistical machine translation: Parameter estima-

tion. Computational Linguistics, 19(2):263–311, 1993.

[30] Stefan B¨ uttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information

Retrieval: Implementing and Evaluating Search Engines. MIT Press, Cambridge,

Massachusetts, 2010.

160 CHAPTER 7. CLOSING REMARKS

[31] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona

Brandic. Cloud computing and emerging IT platforms: Vision, hype, and reality

for delivering computing as the 5th utility. Future Generation Computer Systems,

25(6):599–616, 2009.

[32] Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk striping

to provide high I/O data rates. Computer Systems, 4(4):405–436, 1991.

[33] Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. Find-

ings of the 2009 workshop on statistical machine translation. In Proceedings of the

Fourth Workshop on Statistical Machine Translation (StatMT ’09), pages 1–28,

Athens, Greece, 2009.

[34] Fay Chang, Jeﬀrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wal-

lach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber.

Bigtable: A distributed storage system for structured data. In Proceedings of the

7th Symposium on Operating System Design and Implementation (OSDI 2006),

pages 205–218, Seattle, Washington, 2006.

[35] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing tech-

niques for language modeling. In Proceedings of the 34th Annual Meeting of

the Association for Computational Linguistics (ACL 1996), pages 310–318, Santa

Cruz, California, 1996.

[36] Hung chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Map-

Reduce-Merge: Simpliﬁed relational data processing on large clusters. In Pro-

ceedings of the 2007 ACM SIGMOD International Conference on Management of

Data, pages 1029–1040, Beijing, China, 2007.

[37] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, An-

drew Ng, and Kunle Olukotun. Map-Reduce for machine learning on multicore.

In Advances in Neural Information Processing Systems 19 (NIPS 2006), pages

281–288, Vancouver, British Columbia, Canada, 2006.

[38] Kenneth W. Church and Patrick Hanks. Word association norms, mutual infor-

mation, and lexicography. Computational Linguistics, 16(1):22–29, 1990.

[39] Jonathan Cohen. Graph twiddling in a MapReduce world. Computing in Science

and Engineering, 11(4):29–41, 2009.

[40] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Rus-

sell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the

First ACM Symposium on Cloud Computing (ACM SOCC 2010), Indianapolis,

Indiana, 2010.

7.3. MAPREDUCE AND BEYOND 161

[41] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to

Algorithms. MIT Press, Cambridge, Massachusetts, 1990.

[42] W. Bruce Croft, Donald Meztler, and Trevor Strohman. Search Engines: Infor-

mation Retrieval in Practice. Addison-Wesley, Reading, Massachusetts, 2009.

[43] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser,

Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards

a realistic model of parallel computation. ACM SIGPLAN Notices, 28(7):1–12,

1993.

[44] Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. A practical

part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural

Language Processing, pages 133–140, Trento, Italy, 1992.

[45] Jeﬀrey Dean and Sanjay Ghemawat. MapReduce: Simpliﬁed data processing on

large clusters. In Proceedings of the 6th Symposium on Operating System Design

and Implementation (OSDI 2004), pages 137–150, San Francisco, California, 2004.

[46] Jeﬀrey Dean and Sanjay Ghemawat. MapReduce: Simpliﬁed data processing on

large clusters. Communications of the ACM, 51(1):107–113, 2008.

[47] Jeﬀrey Dean and Sanjay Ghemawat. MapReduce: A ﬂexible data processing tool.

Communications of the ACM, 53(1):72–77, 2010.

[48] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,

Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall, and

Werner Vogels. Dynamo: Amazon’s highly available key-value store. In Proceed-

ings of the 21st ACM Symposium on Operating Systems Principles (SOSP 2007),

pages 205–220, Stevenson, Washington, 2007.

[49] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood

from incomplete data via the EM algorithm. Journal of the Royal Statistical

Society. Series B (Methodological), 39(1):1–38, 1977.

[50] David J. DeWitt and Jim Gray. Parallel database systems: The future of high

performance database systems. Communications of the ACM, 35(6):85–98, 1992.

[51] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R.

Stonebraker, and David Wood. Implementation techniques for main memory

database systems. ACM SIGMOD Record, 14(2):1–8, 1984.

[52] Mark Dredze, Alex Kulesza, and Koby Crammer. Multi-domain learning by

conﬁdence-weighted parameter combination. Machine Learning, 79:123–149, 2010.

162 CHAPTER 7. CLOSING REMARKS

[53] Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Ng. Web

question answering: Is more always better? In Proceedings of the 25th Annual

International ACM SIGIR Conference on Research and Development in Informa-

tion Retrieval (SIGIR 2002), pages 291–298, Tampere, Finland, 2002.

[54] Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin. Fast, easy, and cheap:

Construction of statistical machine translation models with MapReduce. In Pro-

ceedings of the Third Workshop on Statistical Machine Translation at ACL 2008,

pages 199–207, Columbus, Ohio, 2008.

[55] John R. Firth. A synopsis of linguistic theory 1930–55. In Studies in Linguis-

tic Analysis, Special Volume of the Philological Society, pages 1–32. Blackwell,

Oxford, 1957.

[56] Qin Gao and Stephan Vogel. Training phrase-based machine translation models on

the cloud: Open source machine translation toolkit Chaski. The Prague Bulletin

of Mathematical Linguistics, 93:37–46, 2010.

[57] Sanjay Ghemawat, Howard Gobioﬀ, and Shun-Tak Leung. The Google File Sys-

tem. In Proceedings of the 19th ACM Symposium on Operating Systems Principles

(SOSP 2003), pages 29–43, Bolton Landing, New York, 2003.

[58] Seth Gilbert and Nancy Lynch. Brewer’s Conjecture and the feasibility of consis-

tent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51–59,

2002.

[59] Michelle Girvan and Mark E. J. Newman. Community structure in social and

biological networks. Proceedings of the National Academy of Science, 99(12):7821–

7826, 2002.

[60] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction

to Parallel Computing. Addison-Wesley, Reading, Massachusetts, 2003.

[61] Mark S. Granovetter. The strength of weak ties. The American Journal of Soci-

ology, 78(6):1360–1380, 1973.

[62] Mark S. Granovetter. The strength of weak ties: A network theory revisited.

Sociological Theory, 1:201–233, 1983.

[63] Zolt´an Gy¨ongyi and Hector Garcia-Molina. Web spam taxonomy. In Proceedings

of the First International Workshop on Adversarial Information Retrieval on the

Web (AIRWeb 2005), pages 39–47, Chiba, Japan, 2005.

7.3. MAPREDUCE AND BEYOND 163

[64] Per Hage and Frank Harary. Island Networks: Communication, Kinship, and

Classiﬁcation Structures in Oceania. Cambridge University Press, Cambridge,

England, 1996.

[65] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable eﬀectiveness

of data. Communications of the ACM, 24(2):8–12, 2009.

[66] James Hamilton. On designing and deploying Internet-scale services. In Proceed-

ings of the 21st Large Installation System Administration Conference (LISA ’07),

pages 233–244, Dallas, Texas, 2007.

[67] James Hamilton. Cooperative Expendable Micro-Slice Servers (CEMS): Low cost,

low power servers for Internet-scale services. In Proceedings of the Fourth Bien-

nial Conference on Innovative Data Systems Research (CIDR 2009), Asilomar,

California, 2009.

[68] Jeﬀ Hammerbacher. Information platforms and the rise of the data scientist.

In Toby Segaran and Jeﬀ Hammerbacher, editors, Beautiful Data, pages 73–84.

O’Reilly, Sebastopol, California, 2009.

[69] Zelig S. Harris. Mathematical Structures of Language. Wiley, New York, 1968.

[70] Md. Raﬁul Hassan and Baikunth Nath. Stock market forecasting using hidden

Markov models: A new approach. In Proceedings of the 5th International Confer-

ence on Intelligent Systems Design and Applications (ISDA ’05), pages 192–196,

Wroclaw, Poland, 2005.

[71] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong

Wang. Mars: A MapReduce framework on graphics processors. In Proceedings

of the 17th International Conference on Parallel Architectures and Compilation

Techniques (PACT 2008), pages 260–269, Toronto, Ontario, Canada, 2008.

[72] Tony Hey, Stewart Tansley, and Kristin Tolle. The Fourth Paradigm: Data-

Intensive Scientiﬁc Discovery. Microsoft Research, Redmond, Washington, 2009.

[73] Tony Hey, Stewart Tansley, and Kristin Tolle. Jim Gray on eScience: A trans-

formed scientiﬁc method. In Tony Hey, Stewart Tansley, and Kristin Tolle, editors,

The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery. Microsoft Research,

Redmond, Washington, 2009.

[74] John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev Satya-

narayanan, Robert Sidebotham, and Michael West. Scale and performance in

a distributed ﬁle system. ACM Transactions on Computer Systems, 6(1):51–81,

1988.

164 CHAPTER 7. CLOSING REMARKS

[75] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad:

Distributed data-parallel programs from sequential building blocks. In Proceedings

of the ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

(EuroSys 2007), pages 59–72, Lisbon, Portugal, 2007.

[76] Adam Jacobs. The pathologies of big data. ACM Queue, 7(6), 2009.

[77] Joseph JaJa. An Introduction to Parallel Algorithms. Addison-Wesley, Reading,

Massachusetts, 1992.

[78] Frederick Jelinek. Statistical methods for speech recognition. MIT Press, Cam-

bridge, Massachusetts, 1997.

[79] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Pearson,

Upper Saddle River, New Jersey, 2009.

[80] U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, and

Jure Leskovec. HADI: Fast diameter estimation and mining in massive graphs

with Hadoop. Technical Report CMU-ML-08-117, School of Computer Science,

Carnegie Mellon University, 2008.

[81] U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. PEGASUS: A

peta-scale graph mining system—implementation and observations. In Proceed-

ings of the 2009 Ninth IEEE International Conference on Data Mining (ICDM

2009), pages 229–238, Miami, Floria, 2009.

[82] Howard Karloﬀ, Siddharth Suri, and Sergei Vassilvitskii. A model of computation

for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on

Discrete Algorithms (SODA 2010), Austin, Texas, 2010.

[83] Aaron Kimball, Sierra Michels-Slettvet, and Christophe Bisciglia. Cluster com-

puting for Web-scale data processing. In Proceedings of the 39th ACM Techni-

cal Symposium on Computer Science Education (SIGCSE 2008), pages 116–120,

Portland, Oregon, 2008.

[84] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal

of the ACM, 46(5):604–632, 1999.

[85] Philipp Koehn. Statistical Machine Translation. Cambridge University Press,

Cambridge, England, 2010.

[86] Philipp Koehn, Franz J. Och, and Daniel Marcu. Statistical phrase-based trans-

lation. In Proceedings of the 2003 Human Language Technology Conference of

the North American Chapter of the Association for Computational Linguistics

(HLT/NAACL 2003), pages 48–54, Edmonton, Alberta, Canada, 2003.

7.3. MAPREDUCE AND BEYOND 165

[87] John D. Laﬀerty, Andrew McCallum, and Fernando Pereira. Conditional random

ﬁelds: Probabilistic models for segmenting and labeling sequence data. In Pro-

ceedings of the Eighteenth International Conference on Machine Learning (ICML

’01), pages 282–289, San Francisco, California, 2001.

[88] Ronny Lempel and Shlomo Moran. SALSA: The Stochastic Approach for Link-

Structure Analysis. ACM Transactions on Information Systems, 19(2):131–160,

2001.

[89] Abby Levenberg, Chris Callison-Burch, and Miles Osborne. Stream-based transla-

tion models for statistical machine translation. In Proceedings of the 11th Annual

Conference of the North American Chapter of the Association for Computational

Linguistics (NAACL HLT 2010), Los Angeles, California, 2010.

[90] Abby Levenberg and Miles Osborne. Stream-based randomised language models

for SMT. In Proceedings of the 2009 Conference on Empirical Methods in Natural

Language Processing, pages 756–764, Singapore, 2009.

[91] Adam Leventhal. Triple-parity RAID and beyond. ACM Queue, 7(11), 2009.

[92] Jimmy Lin. An exploration of the principles underlying redundancy-based factoid

question answering. ACM Transactions on Information Systems, 27(2):1–55, 2007.

[93] Jimmy Lin. Exploring large-data issues in the curriculum: A case study with

MapReduce. In Proceedings of the Third Workshop on Issues in Teaching Com-

putational Linguistics (TeachCL-08) at ACL 2008, pages 54–61, Columbus, Ohio,

2008.

[94] Jimmy Lin. Scalable language processing algorithms for the masses: A case study

in computing word co-occurrence matrices with MapReduce. In Proceedings of the

2008 Conference on Empirical Methods in Natural Language Processing (EMNLP

2008), pages 419–428, Honolulu, Hawaii, 2008.

[95] Jimmy Lin, Anand Bahety, Shravya Konda, and Samantha Mahindrakar. Low-

latency, high-throughput access to static global resources within the Hadoop

framework. Technical Report HCIL-2009-01, University of Maryland, College

Park, Maryland, January 2009.

[96] Dong C. Liu, Jorge Nocedal, Dong C. Liu, and Jorge Nocedal. On the limited

memory BFGS method for large scale optimization. Mathematical Programming

B, 45(3):503–528, 1989.

[97] Adam Lopez. Statistical machine translation. ACM Computing Surveys, 40(3):1–

49, 2008.

166 CHAPTER 7. CLOSING REMARKS

[98] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan

Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale

graph processing. In Proceedings of the 28th ACM Symposium on Principles of

Distributed Computing (PODC 2009), page 6, Calgary, Alberta, Canada, 2009.

[99] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert,

Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-

scale graph processing. In Proceedings of the 2010 ACM SIGMOD International

Conference on Management of Data, Indianapolis, Indiana, 2010.

[100] Robert Malouf. A comparison of algorithms for maximum entropy parameter

estimation. In Proceedings of the Sixth Conference on Natural Language Learning

(CoNLL-2002), pages 49–55, Taipei, Taiwan, 2002.

[101] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze. An Introduc-

tion to Information Retrieval. Cambridge University Press, Cambridge, England,

2008.

[102] Christopher D. Manning and Hinrich Sch¨ utze. Foundations of Statistical Natural

Language Processing. MIT Press, Cambridge, Massachusetts, 1999.

[103] Elaine R. Mardis. The impact of next-generation sequencing technology on ge-

netics. Trends in Genetics, 24(3):133–141, 2008.

[104] Michael D. McCool. Scalable programming models for massively multicore pro-

cessors. Proceedings of the IEEE, 96(5):816–831, 2008.

[105] Marshall K. McKusick and Sean Quinlan. GFS: Evolution on fast-forward. ACM

Queue, 7(7), 2009.

[106] John Mellor-Crummey, David Whalley, and Ken Kennedy. Improving memory

hierarchy performance for irregular applications using data and computation re-

orderings. International Journal of Parallel Programming, 29(3):217–247, 2001.

[107] Donald Metzler, Jasmine Novak, Hang Cui, and Srihari Reddy. Building enriched

document representations using aggregated anchor text. In Proceedings of the

32nd Annual International ACM SIGIR Conference on Research and Development

in Information Retrieval (SIGIR 2009), pages 219–226, 2009.

[108] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden Markov model

information retrieval system. In Proceedings of the 22nd Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval

(SIGIR 1999), pages 214–221, Berkeley, California, 1999.

7.3. MAPREDUCE AND BEYOND 167

[109] Alistair Moﬀat, William Webber, and Justin Zobel. Load balancing for term-

distributed parallel retrieval. In Proceedings of the 29th Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval

(SIGIR 2006), pages 348–355, Seattle, Washington, 2006.

[110] Kamal Nigam, John Laﬀerty, and Andrew McCallum. Using maximum entropy

for text classiﬁcation. In Proceedings of the IJCAI-99 Workshop on Machine

Learning for Information Filtering, pages 61–67, Stockholm, Sweden, 1999.

[111] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman,

Lamia Youseﬀ, and Dmitrii Zagorodnov. The Eucalyptus open-source cloud-

computing system. In Proceedings of the 9th IEEE/ACM International Sym-

posium on Cluster Computing and the Grid, pages 124–131, Washington, D.C.,

2009.

[112] Franz J. Och and Hermann Ney. A systematic comparison of various statistical

alignment models. Computational Linguistics, 29(1):19–51, 2003.

[113] Christopher Olston and Marc Najork. Web crawling. Foundations and Trends in

Information Retrieval, 4(3):175–246, 2010.

[114] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and An-

drew Tomkins. Pig Latin: A not-so-foreign language for data processing. In Pro-

ceedings of the 2008 ACM SIGMOD International Conference on Management of

Data, pages 1099–1110, Vancouver, British Columbia, Canada, 2008.

[115] Kunle Olukotun and Lance Hammond. The future of microprocessors. ACM

Queue, 3(7):27–34, 2005.

[116] Sean Owen and Robin Anil. Mahout in Action. Manning Publications Co., Green-

wich, Connecticut, 2010.

[117] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The Page-

Rank citation ranking: Bringing order to the Web. Stanford Digital Library Work-

ing Paper SIDL-WP-1999-0120, Stanford University, 1999.

[118] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations

and Trends in Information Retrieval, 2(1–2):1–135, 2008.

[119] David A. Patterson. The data center is the computer. Communications of the

ACM, 52(1):105, 2008.

[120] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. De-

Witt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to

168 CHAPTER 7. CLOSING REMARKS

large-scale data analysis. In Proceedings of the 35th ACM SIGMOD International

Conference on Management of Data, pages 165–178, Providence, Rhode Island,

2009.

[121] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Streaming ﬁrst story detec-

tion with application to Twitter. In Proceedings of the 11th Annual Conference

of the North American Chapter of the Association for Computational Linguistics

(NAACL HLT 2010), Los Angeles, California, 2010.

[122] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the

data: Parallel analysis with Sawzall. Scientiﬁc Programming Journal, 13(4):277–

298, 2005.

[123] Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andr´e Barroso. Failure trends

in a large disk drive population. In Proceedings of the 5th USENIX Conference

on File and Storage Technologies (FAST 2007), San Jose, California, 2008.

[124] Xiaoguang Qi and Brian D. Davison. Web page classiﬁcation: Features and algo-

rithms. ACM Computing Surveys, 41(2), 2009.

[125] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected appli-

cations in speech recognition. In Readings in Speech Recognition, pages 267–296.

Morgan Kaufmann Publishers, San Francisco, California, 1990.

[126] M. Mustafa Raﬁque, Benjamin Rose, Ali R. Butt, and Dimitrios S. Nikolopou-

los. Supporting MapReduce on large-scale asymmetric multi-core clusters. ACM

Operating Systems Review, 43(2):25–34, 2009.

[127] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Chris-

tos Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems.

In Proceedings of the 13th International Symposium on High-Performance Com-

puter Architecture (HPCA 2007), pages 205–218, Phoenix, Arizona, 2007.

[128] Delip Rao and David Yarowsky. Ranking and semi-supervised classiﬁcation

on large scale graphs using Map-Reduce. In Proceedings of the ACL/IJCNLP

2009 Workshop on Graph-Based Methods for Natural Language Processing

(TextGraphs-4), Singapore, 2009.

[129] Michael A. Rappa. The utility business model and the future of computing ser-

vices. IBM Systems Journal, 34(1):32–42, 2004.

[130] Sheldon M. Ross. Stochastic processes. Wiley, New York, 1996.

7.3. MAPREDUCE AND BEYOND 169

[131] Thomas Sandholm and Kevin Lai. MapReduce optimization using regulated dy-

namic prioritization. In Proceedings of the Eleventh International Joint Confer-

ence on Measurement and Modeling of Computer Systems (SIGMETRICS ’09),

pages 299–310, Seattle, Washington, 2009.

[132] Michael Schatz. High Performance Computing for DNA Sequence Alignment and

Assembly. PhD thesis, University of Maryland, College Park, 2010.

[133] Frank Schmuck and Roger Haskin. GPFS: A shared-disk ﬁle system for large

computing clusters. In Proceedings of the First USENIX Conference on File and

Storage Technologies, pages 231–244, Monterey, California, 2002.

[134] Donovan A. Schneider and David J. DeWitt. A performance evaluation of four

parallel join algorithms in a shared-nothing multiprocessor environment. In Pro-

ceedings of the 1989 ACM SIGMOD International Conference on Management of

Data, pages 110–121, Portland, Oregon, 1989.

[135] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAM errors

in the wild: A large-scale ﬁeld study. In Proceedings of the Eleventh Interna-

tional Joint Conference on Measurement and Modeling of Computer Systems

(SIGMETRICS ’09), pages 193–204, Seattle, Washington, 2009.

[136] Hinrich Sch¨ utze. Automatic word sense discrimination. Computational Linguis-

tics, 24(1):97–123, 1998.

[137] Hinrich Sch¨ utze and Jan O. Pedersen. A cooccurrence-based thesaurus and two

applications to information retrieval. Information Processing and Management,

33(3):307–318, 1998.

[138] Satoshi Sekine and Elisabete Ranchhod. Named Entities: Recognition, Classiﬁca-

tion and Use. John Benjamins, Amsterdam, The Netherlands, 2009.

[139] Kristie Seymore, Andrew Mccallum, and Ronald Rosenfeld. Learning hidden

Markov model structure for information extraction. In Proceedings of the AAAI-99

Workshop on Machine Learning for Information Extraction, pages 37–42, Or-

lando, Florida, 1999.

[140] Fei Sha and Fernando Pereira. Shallow parsing with conditional random

ﬁelds. In Proceedings of the 2003 Human Language Technology Conference of

the North American Chapter of the Association for Computational Linguistics

(HLT/NAACL 2003), pages 134–141, Edmonton, Alberta, Canada, 2003.

[141] Noah Smith. Log-linear models. http://www.cs.cmu.edu/

~

nasmith/papers/

smith.tut04.pdf, 2004.

170 CHAPTER 7. CLOSING REMARKS

[142] Christopher Southan and Graham Cameron. Beyond the tsunami: Developing the

infrastructure to deal with life sciences data. In Tony Hey, Stewart Tansley, and

Kristin Tolle, editors, The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery.

Microsoft Research, Redmond, Washington, 2009.

[143] Mario Stanke and Stephan Waack. Gene prediction with a hidden Markov model

and a new intron submodel. Bioinformatics, 19 Suppl 2:ii215–225, October 2003.

[144] Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson,

Andrew Pavlo, and Alexander Rasin. MapReduce and parallel DBMSs: Friends

or foes? Communications of the ACM, 53(1):64–71, 2010.

[145] Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar, Jim Gray, Don Slutz, and

Robert J. Brunner. Designing and mining multi-terabyte astronomy archives:

The Sloan Digital Sky Survey. SIGMOD Record, 29(2):451–462, 2000.

[146] Wittawat Tantisiriroj, Swapnil Patil, and Garth Gibson. Data-intensive ﬁle sys-

tems for Internet services: A rose by any other name. . . . Technical Report CMU-

PDL-08-114, Parallel Data Laboratory, Carnegie Mellon University, 2008.

[147] Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A

scalable distributed ﬁle system. In Proceedings of the 16th ACM Symposium on

Operating Systems Principles (SOSP 1997), pages 224–237, Saint-Malo, France,

1997.

[148] Leslie G. Valiant. A bridging model for parallel computation. Communications

of the ACM, 33(8):103–111, 1990.

[149] Luis M. Vaquero, Luis Rodero-Merino, Juan Caceres, and Maik Lindner. A break

in the clouds: Towards a cloud deﬁnition. ACM SIGCOMM Computer Commu-

nication Review, 39(1):50–55, 2009.

[150] Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word align-

ment in statistical translation. In Proceedings of the 16th International Confer-

ence on Computational Linguistics (COLING 1996), pages 836–841, Copenhagen,

Denmark, 1996.

[151] Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang.

PLDA: Parallel latent Dirichlet allocation for large-scale applications. In Proceed-

ings of the Fifth International Conference on Algorithmic Aspects in Information

and Management (AAIM 2009), pages 301–314, San Francisco, California, 2009.

[152] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’

networks. Nature, 393:440–442, 1998.

7.3. MAPREDUCE AND BEYOND 171

[153] Xingzhi Wen and Uzi Vishkin. FPGA-based prototype of a PRAM-On-Chip

processor. In Proceedings of the 5th Conference on Computing Frontiers, pages

55–66, Ischia, Italy, 2008.

[154] Tom White. Hadoop: The Deﬁnitive Guide. O’Reilly, Sebastopol, California, 2009.

[155] Eugene Wigner. The unreasonable eﬀectiveness of mathematics in the natural

sciences. Communications in Pure and Applied Mathematics, 13(1):1–14, 1960.

[156] Ian H. Witten, Alistair Moﬀat, and Timothy C. Bell. Managing Gigabytes: Com-

pressing and Indexing Documents and Images. Morgan Kaufmann Publishing,

San Francisco, California, 1999.

[157] Jinxi Xu and W. Bruce Croft. Corpus-based stemming using cooccurrence of word

variants. ACM Transactions on Information Systems, 16(1):61–81, 1998.

[158] Rui Xu and Donald Wunsch II. Survey of clustering algorithms. IEEE Transac-

tions on Neural Networks, 16(3):645–678, 2005.

[159] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,

´

Ulfar Erlingsson,

Pradeep Kumar Gunda, and Jon Currey. DryadLINQ: A system for general-

purpose distributed data-parallel computing using a high-level language. In Pro-

ceedings of the 8th Symposium on Operating System Design and Implementation

(OSDI 2008), pages 1–14, San Diego, California, 2008.

[160] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott

Shenker, and Ion Stoica. Job scheduling for multi-user MapReduce clusters. Tech-

nical Report UCB/EECS-2009-55, Electrical Engineering and Computer Sciences,

University of California at Berkeley, 2009.

[161] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Sto-

ica. Improving MapReduce performance in heterogeneous environments. In Pro-

ceedings of the 8th Symposium on Operating System Design and Implementation

(OSDI 2008), pages 29–42, San Diego, California, 2008.

[162] Justin Zobel and Alistair Moﬀat. Inverted ﬁles for text search engines. ACM

Computing Surveys, 38(6):1–56, 2006.

ii

Contents

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 Computing in the Clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Big Ideas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Why Is This Diﬀerent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 What This Book Is Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2

MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Functional Programming Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Mappers and Reducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Partitioners and Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 The Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3

MapReduce Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.1 Combiners and In-Mapper Combining 3.2 3.3 3.4 3.5 41 46 3.1.2 Algorithmic Correctness with Local Aggregation

Pairs and Stripes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Computing Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Secondary Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 Relational Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.1 Reduce-Side Join 3.5.2 Map-Side Join 66 67 3.5.3 Memory-Backed Join 64

CONTENTS

iii

3.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4

Inverted Indexing for Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1 4.2 4.3 4.4 4.5 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Inverted Indexing: Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Inverted Indexing: Revised Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Index Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5.1 Byte-Aligned and Word-Aligned Codes 4.5.2 Bit-Aligned Codes 4.5.3 Postings Compression 4.6 4.7 82 84 80

What About Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5

Graph Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91 5.1 5.2 5.3 5.4 5.5 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Parallel Breadth-First Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94 PageRank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Summary and Additional Readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110

6

EM Algorithms for Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1.1 Maximum Likelihood Estimation 6.1.2 A Latent Variable Marble Game 6.1.3 MLE with Latent Variables 6.1.4 Expectation Maximization 6.1.5 An EM Example 6.2 120 123 118 119 115 117

Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.1 Three Questions for Hidden Markov Models 6.2.2 The Forward Algorithm 6.2.3 The Viterbi Algorithm 125 126

.6 EM-Like Algorithms . . . . . . . . . 152 7. . . . . . .3 6. . . . . . . . . . . . . . . . . .1 7. . . . . . . . . . . . . . 147 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Gradient-Based Optimization and Log-Linear Models Summary and Additional Readings. . . . . . . .1 HMM Training in MapReduce 135 EM in MapReduce . . . . . . . . . . . . . . . . . . . . .5 Forward-Backward Training: Summary 6. . . . . . . .4. . . . . . .4 129 133 6. . . . . . . . . . .3 Limitations of MapReduce . . . . . .iv CONTENTS 6.4. . . . .2. . . . . . . . . . .3. . . .4 Parameter Estimation for HMMs 6. . . . . . . . . . . . . . . . . . . .5 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Experiments 144 147 143 139 142 6. 154 MapReduce and Beyond . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Case Study: Word Alignment for Statistical Machine Translation .2. . . . . . . . . . . 152 Alternative Computing Paradigms . . . . . . . . .150 7 Closing Remarks . . . .2 Brief Digression: Language Modeling with MapReduce 6.3 Word Alignment 6. . . . . . . . .2 7. . .5. . . . . . . . . . . . . . .1 Statistical Phrase-Based Translation 6. . 156 . 138 6. . . . . . . . . . .

. . but we focus on two. This observation applies not only to well-established internet companies. Modern information societies are deﬁned by vast repositories of data. and either turn the functionality oﬀ or throw away data after some time.1 CHAPTER 1 Introduction MapReduce [45] is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. this is known as business intelligence. Given this focus. and thus it makes sense to take advantage of the plentiful amounts of data that surround us. what they click on. Knowing what users look at. and analytics. how much time they spend on a web page. but also countless startups and niche players as well. Broadly. with signiﬁcant activity in both industry and academia. First. In fact. this means scaling up to the web. This book is about scalable approaches to processing large amounts of text with MapReduce. how many companies do you know that start their pitch with “we’re going to harvest information on the web and. across a wide range of text processing applications. analyzing. “big data” is a fact of the world. Second. data mining. Today. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. searching. which encompasses a wide range of technologies including data warehousing. whose development was led by Yahoo (now an Apache project). both public and private. Any organization built around gathering. Therefore. more data translates into more eﬀective algorithms. Just think. This represents lost opportunities. ”? Another strong area of growth is the analysis of user behavior data. ﬁltering. or at least a non-trivial fraction thereof. as there is a broadly-held belief that great value lies in insights derived from mining such data. any practical application must be able to scale up to datasets of interest. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop. leads to better business decisions and competitive advantages. or organizing web content must tackle large-data problems: “web-scale” processing is practically synonymous with data-intensive processing. Any operator of a moderately successful website can record user activity and in a matter of weeks (or sooner) be drowning in a torrent of log data. a vibrant software ecosystem has sprung up around Hadoop. it makes sense to start with the most basic question: Why? There are many answers to this question. logging user behavior generates so much data that many organizations simply can’t cope with the volume. For many. and therefore an issue that real-world systems must grapple with. etc. monitoring. .

the Large Synoptic Survey Telescope (LSST) is a wide-ﬁeld instrument that is capable of observing the entire sky every few days. and in the case of bandwidth. Today. Petabyte datasets are rapidly becoming the norm. including the fundamental nature of matter. a blog post1 was written about eBay’s two enormous data warehouses: one with 2 petabytes of user data. Given the tendency for individuals and organizations to continuously ﬁll up whatever capacity is available. by recreating conditions shortly following the Big Bang.web.2 gigapixel primary camera will produce approximately half a petabyte of archive images every month [19]. perhaps 50×. • The advent of next-generation DNA sequencing technology has created a deluge of sequence data that needs to be stored. latency and bandwidth have improved relatively little: in the case of latency.2 CHAPTER 1.html .cern.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day. In April 2009. designed to probe the mysteries of the universe. When it becomes fully operational. boasting of 2. Disk capacities have grown from tens of megabytes in the mid-1980s to about a couple of terabytes today (several orders of magnitude).3 • Astronomers have long recognized the importance of a “digital observatory” that would support the data needs of researchers across the globe—the Sloan Digital Sky Survey [145] is perhaps the most well known of these projects. the Large Hadron Collider (LHC) near Geneva is the world’s largest particle accelerator. organized. and the trends are clear: our ability to store data is fast overwhelming our ability to process what we store. large-data problems are growing increasingly severe. increases in capacity are outpacing improvements in bandwidth such that our ability to even read back what we store is deteriorating [91]. Shortly thereafter.ch/public/en/LHC/Computing-en. where petabyte-scale datasets are also becoming increasingly common [21]. On the other hand. and the other with 6. many have recognized the importance of data management in many scientiﬁc disciplines. More distressing.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ 2 http://www. perhaps 2× improvement during the last quarter century. Moving beyond the commercial sphere. Facebook revealed2 similarly impressive numbers.dbms2. its 3. Looking into the future. and delivered to scientists for 1 http://www. INTRODUCTION How much data are we talking about? A few examples: Google grew from processing 100 TB of data a day with MapReduce in 2004 [45] to processing 20 PB a day with MapReduce in 2008 [46].com/2009/05/11/facebook-hadoop-and-hive/ 3 http://public.5 petabytes of user data. When the telescope comes online around 2015 in Chile. the LHC will produce roughly 15 petabytes of data a year. growing at about 15 terabytes per day. For example: • The high-energy physics community was already describing experiences with petabyte-scale databases back in 2005 [20].

such as the words and sequences of words themselves. which is useful for local search and pinpointing locations on maps. Scientists are predicting that.5 petabytes in 2008 to 5 petabytes in 2009 [142]. and simulations). where interventions can be speciﬁcally targeted for an individual. such as the grammatical relationship between words. Data are called corpora (singular. classiﬁcation. particularly computer science. which hosts a central repository of sequence data called EMBL-bank. and some method for capturing regularities in the data. sequencing an individual’s genome will be no more complex than getting a blood test today—ushering a new era of personalized medicine. There are three components to this approach: data.3 further study. but touches on other types of data as well (e.g. which has been applied to . The European Bioinformatics Institute (EBI). not merely a luxury or curiosity. and mine massive datasets [72]—this has been hailed as the emerging “fourth paradigm” of science [73] (complementing theory. Increasingly. Aspects of the representations of the data are called features. Although large data comes in a variety of forms. In other areas of academia. the impact of this technology is nothing less than transformative [103]. explore. is to sort text into categories. typically involving algorithms that attempt to capture statistical regularities in data for the purposes of some task or application. experiments. scientiﬁc breakthroughs will be powered by advanced computing capabilities that help researchers manipulate. Examples include: Is this email spam or not spam? Is this word part of an address or a location? The ﬁrst task is easy to understand. this book is primarily concerned with processing large amounts of text. a task known as sentiment analysis or opinion mining [118]. Given the fundamental tenant in modern genetics that genotypes explain phenotypes. while the second task is an instance of what NLP researchers call named-entity detection [138]. Large data is a fact of today’s world and data-intensive processing is fast becoming a necessity. Another example is to automatically situate texts along a scale of “happiness”. Another common application is to rank texts according to some criteria— search is a good example. Recent work in these ﬁelds is dominated by a data-driven. corpus) by NLP researchers and collections by those from the IR community. in the not-so-distant future. representations of the data. or “deep” and more diﬃcult to extract. algorithms or models are applied to capture regularities in the data in terms of the extracted features for some application. empirical approach. which involves ranking documents by relevance to the user’s query.. Finally. One common application. systems and algorithms incapable of scaling to massive real-world datasets run the danger of being dismissed as “toy systems” with limited utility. relational and graph data). has increased storage capacity from 2. The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural language processing (NLP) and information retrieval (IR). which may be “superﬁcial” and easy to extract.

4 CHAPTER 1. As it turns out. they don’t scale well. as we shall see). INTRODUCTION everything from understanding political discourse in the blogosphere to predicting the movement of stock prices. at least in text processing. but no matter). In 2001. delivered somewhat tongue-in-cheek: we should just give up working on algorithms and simply spend our time gathering data (while waiting for computers to become faster so we can process the data). 53. the increase in accuracy was approximately linear in the log of the size of the training data. Superﬁcial word-level features coupled with simple models in most cases trump sophisticated models over deeper features and less data. 92]. data probably matters the most. and not surprisingly. consider the problem of answering short. using this task as the speciﬁc example. a question answering (QA) system would directly return the answer: John Wilkes Booth. There is a growing body of evidence. Around 2001. you can simply search for the phrase “shot Abraham Lincoln” on the web and look for what appears to its left. This problem gained interest in the late 1990s. But why can’t we have our cake and eat it too? Why not both sophisticated models and deep features applied to lots of data? Because inference over sophisticated models and extraction of deep features are often computationally intensive. Banko and Brill [14] published what has become a classic paper in natural language processing exploring the eﬀects of training data size on classiﬁcation accuracy. Furthermore. the accuracy of diﬀerent algorithms converged. Consider a simple task such as determining the correct usage of easily confusable words such as “than” and “then” in English. This resulted in an even more controversial recommendation. with increasing amounts of training data. and then apply the classiﬁer to new instances of the problem (say. algorithms). Or better yet. One can view this as a supervised machine learning problem: we can train a classiﬁer to disambiguate between the options. all that matters is the amount of data you have. researchers discovered a far simpler approach to answering such questions based on pattern matching [27. Suppose you wanted the answer to the above question. since people make mistakes. look through multiple instances . when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. as part of a grammar checker). Across many diﬀerent algorithms. Training data is fairly easy to come by—we can just gather a large corpus of texts and assume that most writers make correct choices (the training data may be noisy. such that pronounced diﬀerences in eﬀectiveness observed on smaller datasets basically disappeared at scale. found that more data led to better accuracy. fact-based questions such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the user would then have to sort through. As another example. features. that of the three components discussed above (data. This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms really don’t matter. They explored several classiﬁcation algorithms (the exact ones aren’t important.

as we will discuss in Section 6. it is interesting to observe the evolving deﬁnition of large over the years. which ultimately yielded better language models. and dealt with a corpus containing a billion words. In 2007. However. the probability of a sequence of words can be decomposed into the product of n-gram probabilities. an enormous number of parameters must still be estimated from a training corpus: potentially V n parameters. Nevertheless. researchers have developed a number of smoothing techniques [35.5 of this phrase and tally up the words that appear to the left. 5 As in. by the chain rule. both in terms of intrinsic and application-speciﬁc metrics.e. such that pattern-matching techniques suﬃce to extract answers accurately. estimated from a large corpus of texts. a simpler technique on more data beat a more sophisticated technique on less data. most n-grams—in any language. Thus. Since there are inﬁnitely many possible strings. the web). where V is the number of words in the vocabulary. It capitalizes on the insight that in a very large text collection (i. even with lots and lots of training data! Most modern language models make the Markov assumption: in a n-gram language model. Brants et al. it was simpler and could be trained on more data. They are useful in a variety of applications. 79]. the conditional probability of a word is given by the n − 1 previous words. A language model is a probability distribution that characterizes the likelihood of observing a particular sequence of words. That is. This simple strategy works surprisingly well. 102. an aside.4). such as speech recognition (to determine what the speaker is more likely to have said) and machine translation (to determine which of possible translations is the most ﬂuent.4 Their experiments compared a state-of-the-art approach known as Kneser-Ney smoothing [35] with another technique the authors aﬀectionately referred to as “stupid backoﬀ”. stupid backoﬀ didn’t work as well as Kneser-Ney smoothing on smaller corpora. To cope with this sparseness. even English—will never have been seen. which all share the basic idea of moving probability mass from observed to unseen events in a principled manner. and has become known as the redundancy-based approach to question answering. so stupid it couldn’t possibly work. Smoothing approaches vary in effectiveness. language modeling is a more challenging task than simply keeping track of which strings were seen how many times: some number of likely strings will never be encountered. Banko and Brill’s paper in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation. 4 As . Even if we treat every word on the web as the training corpus from which to estimate the n-gram probabilities. answers to commonly-asked questions will be stated in obvious ways. [25] described language models trained on up to two trillion words.. Yet another example concerns smoothing in web-scale language models [25]. and probabilities must be assigned to all of them.5 Not surprisingly.

that verbs appear before objects in English. In some ways.6 Why is this so? It boils down to the fact that language in the wild. translates into more eﬀective algorithms and systems. This. 1. in summary.1 COMPUTING IN THE CLOUDS For better or for worse. the more accurate a description we have of language itself. which is the exact opposite of the data-driven approach.6 CHAPTER 1. This is exactly what MapReduce does. three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Eﬀectiveness of Data [65]. why large data? In some ways. is the best description of itself. There are of course rules that govern the formation of words and sentences—for example. Let’s start with the obvious observation: dataintensive processing is beyond the capability of any individual machine and requires clusters—which means that large-data problems are fundamentally about organizing computations on dozens. Unlike. it is often diﬃcult to untangled MapReduce and large-data processing from the broader discourse on cloud computing. in turn. 7 On Exactitude in Science [23]. or even thousands of machines. and that subjects and verbs must agree in number in many languages—but real-world language is aﬀected by a multitude of other factors as well: people invent new words and phrases all the time. etc. But the second answer is even more compelling. Data represent the rising tide that lifts all boats—more data lead to better algorithms and systems for solving real-world problems. he would say. the interaction of subatomic particles. let’s tackle the how. cloud computing 6 This title was inspired by a classic article titled The Unreasonable Eﬀectiveness of Mathematics in the Natural Sciences [155]. hundreds. In the same way. the ﬁrst answer is similar to the reason people climb mountains: because they’re there. True. A similar exchange appears in Chapter XI of Sylvie and Bruno Concluded by Lewis Carroll (1893). the more observations we gather about language use. just like human behavior in general. Now that we’ve addressed the why. human use of language is not constrained by succinct. This is somewhat ironic in that the original article lauded the beauty and elegance of mathematical models in capturing natural phenomena. say. there is substantial promise in this new paradigm of computing. groups of individuals write within a shared context. The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story about a ﬁctional society in which the art of cartography had gotten so advanced that their maps were as big as the lands they were describing. . universal “laws of grammar”. So. but unwarranted hype by the media and popular sources threatens its credibility in the long run. is messy. authors occasionally make mistakes.7 The world. and the rest of this book is about the how. INTRODUCTION Recently.

COMPUTING IN THE CLOUDS 7 is simply brilliant marketing. The accumulation of vast quantities of user data creates large-data problems. 31. of course. they start with diﬀerent assumptions. In this context. To give two concrete examples: a social-networking site analyzes connections in the enormous globe-spanning graph of friendships to recommend new connections. [15]) is used by the cloud provider to allocate available physical resources and enforce isolation between multiple users that may be sharing the 8 8 What is the diﬀerence between cloud computing and grid computing? Although both tackle the fundamental problem of how best to bring computational resources to bear on large and diﬃcult problems. webbased email services such as Gmail. each having claimed to be the best thing since sliced bread. in exactly the manner as described [68]. the cloud simply refers to the servers that power these sites. So what exactly is cloud computing? This is one of those questions where ten experts will give eleven diﬀerent answers.. everything that used to be called web applications has been rebranded to become “cloud applications”. many of which are suitable for MapReduce. in fact. like electricity or natural gas. the user is paying for access to virtual machine instances that run a standard operating system such as Linux. As the name implies. Before clouds. The second is. grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct but cooperative organizations. and before grids. and applications such as Google Docs. just to name a few examples). Google.1. In fact. countless papers have been written simply to attempt to deﬁne the term (e. grid computing tends to deal with tasks that are coarser-grained. and must deal with the practicalities of a federated environment.g.g. and in truth isn’t very diﬀerent from this antiquated form of computing. a “cloud user” can dynamically provision any amount of computing resources from a “cloud provider” on demand and only pay for what is consumed. 149]. which includes what we have previously called “Web 2. At the most superﬁcial level. Here we oﬀer up our own thoughts and attempt to explain how cloud computing relates to MapReduce and data-intensive processing. Virtualization technology (e. ..9 Another important facet of cloud computing is what’s more precisely known as utility computing [129. a well-known user of Hadoop. and user data is said to reside “in the cloud”. 9 The ﬁrst example is Facebook.0” sites. there were grids.g. [9. Under this model. Grid computing has adopted a middleware-based approach for tackling many of these challenges. The idea harkens back to the days of time-sharing machines. e.. This includes social-networking services such as Facebook. An online email service analyzes messages and user behavior to optimize ad selection and placement. the idea behind utility computing is to treat computing resource as a metered service. These are all largedata problems that have been tackled with MapReduce.1. anything running inside a browser that gathers and stores user-generated content now qualiﬁes as an example of cloud computing. video-sharing sites such as YouTube. 31]. In practical terms. there were vector supercomputers. which uses MapReduce to continuously improve existing algorithms and to devise new algorithms for ad selection and placement. Whereas clouds are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number of datacenters controlled by a single organization. verifying credentials across multiple administrative domains. As a result.

Both users and providers beneﬁt in the utility computing model. In the same way that insurance works by aggregating risk and redistributing it. provisioned resources can be released. Eucalyptus [111]. more resources can be seamlessly allocated from the cloud without an interruption in service. A cloud provider oﬀering customers access to virtual machine instances is said to be oﬀering infrastructure as a service. INTRODUCTION same hardware. this business also makes sense because large datacenters beneﬁt from economies of scale and can be run more eﬃciently than smaller datacenters. which is itself a new take on the age-old idea of outsourcing. They also gain the important property of elasticity—as demand for computing resources grow. Prior to the advent of utility computing. In the world of utility computing. Resource consumption is measured in some equivalent of machine-hours and users are charged in increments thereof. overall trends in aggregate demand should be smooth and predictable. Although demand may ﬂuctuate signiﬁcantly for each user. the user has full control over the resources and can use them for arbitrary computation. However. However. which allows the cloud provider to adjust capacity over time with less risk of either oﬀering too much (resulting in ineﬃcient use of capital) or too little (resulting in unsatisﬁed customers). From the utility provider point of view. coping with unexpected spikes in demand was fraught with challenges: under-provision and run the risk of service interruptions. or IaaS for short. but a number of other cloud providers populate a market that is becoming increasingly crowded. This lowers the barrier to entry for data-intensive processing and makes MapReduce much more accessible. Amazon Web Services currently leads the way and remains the dominant player. utility providers aggregate the computing demands for a large number of users. this may be too low level for many users. Enter platform as a service (PaaS). As demand falls. A generalization of the utility computing concept is “everything as a service”. Once one or more virtual machine instances have been provisioned. or over-provision and tie up precious capital in idle machines that are depreciating. thereby freeing up physical resources that can be redirected to other users. Users are freed from upfront capital investments necessary to build datacenters and substantial reoccurring costs in maintaining them. not everyone with large-data problems can aﬀord to purchase and maintain clusters. but what direct relevance does this have for MapReduce? The connection is quite simple: processing large amounts of data with MapReduce requires access to clusters with suﬃcient capacity. Platform is used generically to refer to any set of well-deﬁned . Most systems are based on proprietary infrastructure. that is available open source.8 CHAPTER 1. from an unpredicted spike in customers. for example. Virtual machines that are no longer needed are destroyed. This is where utility computing comes in: clusters of suﬃcient size can be provisioned only when the need arises. and users pay only as much as is required to solve their problems. but there is at least one. which is a rebranding of what used to be called hosted services in the “pre-cloud” era. Increased competition will beneﬁt cloud users.

e.1. even hundreds) and a large amount of shared memory (hundreds or even thousands of gigabytes) is not cost eﬀective. The argument applies similarly to PaaS and SaaS. the scaling “out” approach) is preferred over a small number of high-end servers (i..e. 1. Nevertheless. or interest in running datacenters. all of these ideas have been discussed in the computer science literature for some time (some for decades). The latter approach of purchasing symmetric multi-processing (SMP) machines with a large number of processor sockets (dozens.. expertise. or otherwise maintain basic services such as the storage layer or the programming environment.2. a machine with twice as many processors is often signiﬁcantly more than twice as expensive). IaaS is an abstraction over raw physical hardware—an organization might lack the capital. and MapReduce is certainly not the ﬁrst to adopt these ideas. a leader in customer relationship management (CRM) software. since the costs of such machines do not scale linearly (i.e. a large number of commodity low-end servers (i. Other examples include outsourcing an entire organization’s email to a third party. which provides the backend datastore and API for anyone to build highly-scalable web applications. The cloud allows seamless expansion of operations without the need for careful planning and supports scales that may otherwise be diﬃcult or cost-prohibitive for an organization to achieve. In the same vein. just like MapReduce. the engineers at Google deserve tremendous credit for pulling these various threads together and demonstrating the power of these ideas on a scale previously unheard of. To be fair. not “up”. BIG IDEAS 9 services on top of which users can build applications. At an even higher level. and therefore pays a cloud provider to do so on its behalf. cloud providers can oﬀer software as a service (SaaS). upgrade. the scaling “up” approach). deploy content. the low-end server market overlaps with the . Google maintains the infrastructure. but scale and elasticity play important roles as well. Cloud services. What does this proliferation of services have to do with MapReduce? No doubt that “everything as a service” is driven by desires for greater business eﬃciencies. patch. Scale “out”. etc. the MapReduce programming model is a powerful abstraction that separates the what from the how of data-intensive processing. In this section. as exempliﬁed by Salesforce. which is commonplace today. On the other hand.2 BIG IDEAS Tackling large-data problems requires a distinct approach that sometimes runs counter to traditional models of computing. we discuss a number of “big ideas” behind MapReduce. This class of services is best exempliﬁed by Google App Engine. freeing the user from having to backup. represents the search for an appropriate level of abstraction and beneﬁcial divisions of labor.. For data-intensive workloads.

and therefore most existing implementations of the MapReduce programming model are designed around clusters of low-end commodity servers. the comparison is more accurately between a smaller cluster of high-end machines and a larger cluster of low-end machines (network communication is unavoidable in both cases). and conclude that a cluster of low-end servers approaches the performance of the equivalent cluster of high-end servers—the small performance gap is insuﬃcient to justify the price premium of the high-end servers. Excluding storage costs. The ﬁrst component measures how much of a building’s incoming power is actually delivered to computing equipment. cooling. power distribution ineﬃciencies). of course. [67. The Transaction Processing Council (TPC) is a neutral. non-proﬁt organization whose mission is to establish objective database benchmarks.10 CHAPTER 1. energy eﬃciency has become a key issue in building warehouse-scale computers for large-data processing. only one component of the total cost of delivering computing capacity. a low-end server platform is about four times more cost eﬃcient than a high-end shared memory platform from the same vendor. Barroso and H¨lzle o model these two approaches under workloads that demand more or less communication. etc. air handling) and electrical infrastructure (e. etc. how much is lost to the building’s mechanical systems (e.g. Datacenter eﬃciency is typically factored into three separate components that can be independently measured and optimized [18]. it is important to factor in operational costs when deploying a scale-out solution based on large numbers of commodity servers. and correspondingly.g. interchangeable components. well-deﬁned relational processing applications. 18]. cooling. The third component captures how much of the power delivered . As a result. cooling fans. The second component measures how much of a server’s incoming power is lost to the power supply. Capital costs in acquiring servers is. Operational costs are dominated by the cost of electricity to power the servers as well as other aspects of datacenter operations that are functionally related to power: power distribution. the price/performance advantage of the low-end server increases to about a factor of twelve. Barroso and H¨lzle’s recent treatise of what they dubbed “warehouse-scale como puters” [18] contains a thoughtful analysis of the two approaches. and economies of scale. What if we take into account the fact that communication between nodes in a high-end SMP machine is orders of magnitude faster than communication between nodes in a commodity network-based cluster? Since workloads today are beyond the capability of any single machine (no matter how powerful).. Benchmark data submitted to that organization are probably the closest one can get to a fair “apples-to-apples” comparison of cost and performance for speciﬁc. Based on TPC-C benchmark results from late 2007. For data-intensive applications. INTRODUCTION high-volume desktop computing market. the conclusion appears to be clear: scaling “out” is superior to scaling “up”. which has the eﬀect of keeping prices low due to competition.. Therefore.

Datacenters suﬀer from both planned outages (e. Just as important. where energy consumption would be proportional to load. A well-designed. but yet retain the ability to power up (nearly) instantaneously in response to demand. connectivity loss. etc. As servers go down. Of the three components of datacenter eﬃciency. such that an idle processor would (ideally) consume no power.000-server cluster would still experience roughly 10 failures a day. evidence suggests that scaling out remains more attractive than scaling up. Even with these reliable servers. That is. For example. This means that any large-scale service that is distributed across a large cluster (either a user-facing application or a computing platform like MapReduce) must cope with hardware failures as an intrinsic aspect of its operation [66]. who provide detailed cost models for typical modern datacenters. A simple calculation suﬃces to demonstrate: let us suppose that a cluster is built from reliable machines with a mean-time between failures (MTBF) of 1000 days (about three years). a broken server that has been repaired . As a result. datacenter eﬃciency is a topic that is beyond the scope of this book. in large clusters disk failures are common [123] and RAM experiences more errors than one might expect [135]. the ﬁrst two are relatively straightforward to objectively quantify. even factoring in operational costs. For more details. Adoption of industry best-practices can help datacenter operators achieve state-of-the-art eﬃciency.000 days (about thirty years) were achievable at realistic costs (which is unlikely). Even then. One important issue that has been identiﬁed is the non-linearity between load and power draw.) is actually used to perform useful computations. disk. without notice.g. Barroso and H¨lzle have advocated for research o and development in energy-proportional machines.1. a server at 10% utilization may draw slightly more than half as much power as a server at 100% utilization (which means that a lightly-loaded server is much less eﬃcient than a heavily-loaded server). At warehouse scale. etc.000-server cluster would still experience one failure daily. That is. A survey of ﬁve thousand Google servers over a six-month period shows that servers operate most of the time at between 10% and 50% utilization [17].).. which is an energyineﬃcient operating region. but commonplace. other cluster nodes should seamlessly step in to handle the load. BIG IDEAS 11 to computing components (processor.. For the sake of argument. However. is much more diﬃcult to measure. let us suppose that a MTBF of 10. system maintenance and hardware upgrades) and unexpected outages (e. power failure. a 10. consult Barroso and H¨lzle [18] o and Hamilton [67]. failures are not only inevitable. RAM. Although we have provided a brief overview here. The third component.g. a server may fail at any time. and overall performance should gracefully degrade as server failures pile up.2. Assume failures are common. however. a 10. fault-tolerant service must cope with failures up to a point without impacting the quality of service—failures should not result in inconsistencies or indeterminism from the user perspective.

First. if one simply reads the entire database and rewrites all the records (mutating those that need updating).12 CHAPTER 1. As an alternative to moving data around. Given reasonable assumptions about disk latency and throughput. To the extent possible. Jacobs [76] provides real-world benchmarks in his discussion of large-data problems. literally. The distributed ﬁle system is responsible for managing the data over which MapReduce operates. MapReduce assumes an architecture where processors and storage (disk) are co-located. A simple scenario10 poignantly illustrates the large performance gap between sequential operations and random seeks: assume a 1 terabyte database containing 1010 100-byte records.. Move processing to the data. the cost diﬀerential between traditional magnetic disks and solid-state disks remains substantial: large-data will for the most part remain on mechanical drives. In such a setup. at least in the near future. and instead organize computations so that data is processed sequentially. Mature implementations of the MapReduce programming model are able to robustly cope with failures through a number of mechanisms such as automatic task restarts on diﬀerent cluster nodes. Data-intensive processing by deﬁnition means that the relevant datasets are too large to ﬁt in memory and must be held on disk. Second. which means that the separation of compute and storage creates a bottleneck in the network. it is commonplace for a supercomputer to have “processing nodes” and “storage nodes” linked together by a high-capacity interconnect. order-of-magnitude diﬀerences in performance between sequential and random access still remain. it is more eﬃcient to move the processing around. As a result. although solid-state disks have substantially faster seek times. Sequential data access is. a back-of-the-envelop calculation will show that updating 1% of the records (by accessing and then mutating each record) will take about a month on a single machine. we can take advantage of data locality by running code on the processor directly attached to the block of data we need. it is desirable to avoid random data access. more detail.g. INTRODUCTION should be able to seamlessly rejoin the service without manual reconﬁguration by the administrator. orders of magnitude faster than random data access. all computations are organized into long streaming operations that 10 Adapted 11 For from a post by Ted Dunning on the Hadoop mailing list. On the other hand. Process data sequentially and avoid random access. That is.11 The development of solid-state drives is unlikely the change this balance for at least two reasons. . the process would ﬁnish in under a work day on a single machine. MapReduce is primarily designed for batch processing over large datasets. In traditional high-performance computing (HPC) applications (e. for climate or nuclear simulations). Many data-intensive workloads are not very processor-demanding. Seek times for random disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast and platters can only spin so rapidly.

a corner case of an algorithm that requires special treatment). The upshot is that the developer is freed from having to worry about system-level details (e. while the second is exclusively the responsibility of the execution framework or “runtime”. The challenges in writing distributed software are greatly compounded—the programmer must manage details across several threads.. code is guaranteed to behave as expected.g. as long as the developer expresses computations in the programming model. but the truth remains: concurrent programs are notoriously diﬃcult to reason about and even harder to debug.g. all else being equal.. Second.2. which leads to a number of recommendations about a programmer’s environment (e. let us sketch the behavior of an ideal algorithm..g. The programming model speciﬁes simple and well-deﬁned interfaces between a small number of components. This imposes a high cognitive load and requires intense concentration. For data-intensive processing. locking of data structures. deadlocks. comfortable furniture. MapReduce maintains a separation of what computations are to be performed and how those computations are actually carried out on a cluster of machines. in terms of resources: given a cluster twice 12 See also DeWitt and Gray [50] for slightly diﬀerent deﬁnitions in terms of speedup and scaleup. and therefore is easy for the programmer to reason about. etc.1. The advantage is that the execution framework only needs to be designed once and veriﬁed for correctness—thereafter.). in terms of data: given twice the amount of data.12 First.. and other well-known problems. accessing data in unpredictable patterns. According to many guides on the practice of software engineering written by experienced industry professionals. data starvation issues in the processing pipeline..g. This gives rise to race conditions. etc.). the biggest headache in distributed programming is that code runs concurrently in unpredictable orders. Hide system-level details from the application developer. quiet oﬃce. it goes without saying that scalable algorithms are highly desirable. The ﬁrst is under the control of the programmer. the same algorithm should take at most twice as long to run. Many aspects of MapReduce’s design explicitly trade latency for throughput. large monitors. Programmers are taught to use low-level devices such as mutexes and to apply high-level “design patterns” such as producer–consumer queues to tackle these challenges. BIG IDEAS 13 take advantage of the aggregate bandwidth of many disks in a cluster. one of the key reasons why writing code is diﬃcult is because the programmer must simultaneously keep track of many details in short term memory—ranging from the mundane (e. or machines. We can deﬁne scalability along at least two dimensions.g. Seamless scalability. MapReduce addresses the challenges of distributed programming by providing an abstraction that isolates the developer from system-level details (e. processes. no more debugging race conditions and addressing lock contention) and can instead focus on algorithm or application design. Of course. variable names) to the sophisticated (e. As an aspiration. .

a MapReduce algorithm remains ﬁxed. Finally. for example. As a result. algorithms with the characteristics sketched above are. Furthermore. unobtainable. This is because complex tasks cannot be chopped into smaller pieces and allocated in a linear fashion. Furthermore. in many circumstances. But what happens when the amount of data doubles in the near future. scaling “out” argument). One of the fundamental assertions in Fred Brook’s classic The Mythical Man-Month [28] is that adding programmers to a project behind schedule will only make it fall further behind. the price of a machine does not scale linearly with the amount of available memory beyond a certain point (once again. an ideal algorithm would maintain these desirable scaling characteristics across a wide range of settings: on data ranging from gigabytes to petabytes. Nevertheless. In the domain of text processing. We introduce the idea of the “tradeable machine hour”. The truth is that most current algorithms are far from the ideal. the ideal algorithm would exhibit these desired behaviors without requiring any modiﬁcations whatsoever. these fundamental limitations shouldn’t prevent us from at least striving for the unobtainable. and it is the responsibility of the execution framework to execute the algorithm. Perhaps the most exciting aspect of MapReduce is that it represents a small step toward algorithms that behave in the ideal manner discussed above. and then doubles again shortly thereafter? Simply buying more memory is not a viable solution. Although Brook’s observations are primarily about software engineers and the software development process. Quite simply. the same is also true of algorithms: increasing the degree of parallelization also increases communication costs. most algorithms today assume that data ﬁts in memory on a single machine. the scaling “up” vs. INTRODUCTION the size. on clusters consisting of a few to a few thousand machines. the same algorithm should take no more than half as long to run. algorithms that require holding intermediate data in memory on a single machine will simply break on suﬃciently-large datasets—moving from a single machine to a cluster architecture requires fundamentally diﬀerent algorithms (and reimplementations).14 CHAPTER 1. as the amount of data is growing faster than the price of memory is falling. this is a fair assumption. If running an algorithm on a particular dataset takes 100 machine hours. then we should be able to ﬁnish in an hour on a cluster . and is often illustrated with a cute quote: “nine women cannot have a baby in one month”. and beyond a certain point. as a play on Brook’s classic title. greater eﬃciencies gained by parallelization are entirely oﬀset by increased communication requirements. Recall that the programming model maintains a clear separation between what computations need to occur with how those computations are actually orchestrated on a cluster. The algorithm designer is faced with diminishing returns. to approach the ideal scaling characteristics discussed above. the MapReduce programming model is simple enough that it is actually possible. Other than for embarrassingly parallel problems. of course. For the most part. Amazingly. not even tuning of parameters.

all of that changed around the middle of the ﬁrst decade of this century. has not happened. which drove innovations in distributed computing such as MapReduce—ﬁrst by Google. they have been wrong. The manner in which the semiconductor industry had been exploiting Moore’s Law simply ran out of opportunities for improvement: faster clocks. and the result has been that parallel processing and distributed systems have largely been conﬁned to a small segment of the market and esoteric upper-level electives in the computer science curriculum. At around the same time. where a 100-machine cluster running for one hour would cost the same as a 10-machine cluster running for ten hours. However. This marked the beginning of an entirely new strategy and the dawn of the multi-core era [115]. large-data problems began popping up everywhere. relatively few organizations had data-intensive processing needs that required large clusters: a handful of internet companies and perhaps a few dozen large corporations. Yet. Couple that with the inherent challenges of concurrency. this radical shift in hardware architecture was not matched at that time by corresponding advances in how software could be easily designed for these new processors (but not for lack of trying [104]). however. Data-intensive processing needs became widespread. we witnessed the growth of large-data problems. computer scientists have predicted that the dawn of the age of parallel computing was “right around the corner” and that sequential processing would soon fade into obsolescence (consider. WHY IS THIS DIFFERENT? 15 13 of 100 machines. superscalar architectures. 14 Guess when this was written? You may be surprised. In the late 1990s and even during the beginning of the ﬁrst decade of this century. until very recently. the above quote). memory.3 WHY IS THIS DIFFERENT? “Due to the rapidly decreasing cost of processing. rise of user-generated web content. save the needs of a few (scientists simulating molecular interactions or nuclear reactions. at least for some applications. 1. this isn’t so far from the truth. parallel processing became an important issue at the forefront of everyone’s mind—it represented the only way forward. and communication. everything changed. or use a cluster of 10 machines to complete the same task in ten hours.). Unfortunately. and then by Yahoo 13 Note that this idea meshes well with utility computing. and other tricks of the trade reached a point of diminishing returns that did not justify continued investment. Nevertheless. etc.” — Leslie Valiant [148]14 For several decades. But then. for example. it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally intensive domains. This. deeper pipelines. Through a combination of many diﬀerent factors (falling prices of disks. .3. The relentless progress of Moore’s Law for several decades has ensured that most of the world’s problems could be solved by single-processor machines. With MapReduce.1. for example).

the von Neumann model has served us well: Hardware designers focused on eﬃcient implementations of the von Neumann model and didn’t have to think much about the actual software that would run on the machines. the PRAM has been recently prototyped in hardware [153]. MapReduce can be viewed as the ﬁrst breakthrough in the quest for new abstractions that allow us to organize computations. Such a view places too much burden on the software developer to eﬀectively take advantage of available computational resources—it simply is the wrong level of abstraction. To be fair. but technological solutions for addressing them are widely accessible. the computer scientists are right—the age of parallel computing has begun.15 In the model. Anyone can download the open source Hadoop implementation of MapReduce. both in terms of multiple cores in a chip and multiple machines in a cluster (each of which often has multiple cores). Why is MapReduce important? In practical terms.16 CHAPTER 1. Similarly. This in turn created more demand: when organizations learned about the availability of eﬀective data analysis tools for large datasets. is the PRAM [77. an arbitrary number of processors. MapReduce represents the ﬁrst widely-adopted step away from the von Neumann model that has served as the foundation of computer science over the last half plus century. Valiant called this a bridging model [148]. operate synchronously on a shared input to produce some output. which dates back several decades. a conceptual bridge between the physical implementation of a machine and the software that is to be executed on that machine. Finally. INTRODUCTION and the open source community. Until recently. 60]. not over individual machines. MapReduce is important in how it has changed the way we organize computations at a massive scale. MapReduce is certainly not the ﬁrst model of parallel computation that has been proposed. and be happily processing terabytes upon terabytes of data within the week. Today. Today. the datacenter is the computer [18. but over entire clusters. As Barroso puts it. 119]. 15 More than a theoretical model. Other models include LogP [43] and BSP [148]. the software industry developed software targeted at the model without worrying about the hardware details. the von Neumann model isn’t suﬃcient anymore: we can’t treat a multi-core processor or a large cluster as an agglomeration of many von Neumann machine instances communicating over some interconnect. The most prevalent model in theoretical computer science. But beyond that. sharing an unboundedly large memory. however. and software engineers were able to develop applications in high-level languages that exploited those processors. it provides a very eﬀective tool for tackling large-data problems. they began instrumenting various business processes to gather even more data—driven by the belief that more data leads to deeper insights and greater competitive advantages. not only are large-data problems ubiquitous. The result was extraordinary growth: chip designers churned out successive generations of increasingly powerful processors. pay a modest fee to rent a cluster from a utility cloud provider. .

inevitably. 16 Nevertheless. Although our presentation most closely follows the Hadoop open-source implementation of MapReduce. none of these previous models have enjoyed the success that MapReduce has in terms of adoption and in terms of impact on the daily lives of millions of users. for example. as we can’t meaningfully answer this question before thoroughly understanding what MapReduce can and cannot do well. we refer the reader to Tom White’s excellent book.16 MapReduce is the most successful abstraction over large-scale computational resources we have seen to date. but suﬀers from limitations as well. This is exactly the purpose of this book: let us now begin our exploration. this book is explicitly not about Hadoop programming. etc. “Hadoop: The Deﬁnitive Guide”. They. This critique applies to MapReduce: it makes certain large-data problems easier. published by O’Reilly [154]. as anyone who has taken an introductory computer science course knows. impossible (in the case where the detail suppressed by the abstraction is exactly what the user cares about). A ﬁnal word before we get started. [82] demonstrated that a large class of PRAM algorithms can be eﬃciently simulated via MapReduce. it is important to understand the relationship between MapReduce and existing models so that we can bring to bear accumulated knowledge about parallel algorithms.4. .1. For those aspects. not quite yet. So if MapReduce is only the beginning. 1. This book is about MapReduce algorithm design. This means that MapReduce is not the ﬁnal word. . discuss APIs. but rather the ﬁrst in a new class of programming models that will allow us to more eﬀectively organize computations at a massive scale. what’s next beyond MapReduce? We’re getting ahead of ourselves.4 WHAT THIS BOOK IS NOT Actually. . are imperfect—making certain tasks easier but others more diﬃcult. WHAT THIS BOOK IS NOT 17 For reasons that are beyond the scope of this book. Karloﬀ et al. abstractions manage complexity by hiding details and presenting well-deﬁned behaviors to users of those abstractions. However. and sometimes. particularly for text processing (and related) applications. command-line invocations for running jobs. We don’t for example.

)? • How do we ensure that the workers get the data they need? • How do we coordinate synchronization among the diﬀerent workers? • How do we share partial results from one worker that is needed by another? • How do we accomplish all of the above in the face of software errors and hardware faults? In traditional parallel or distributed programming environments. due to available resources. Language extensions. The basic idea is to partition a large problem into smaller subproblems. the developer needs to explicitly address many (and sometimes. a fundamental concept in computer science that is introduced very early in typical undergraduate curricula. but they are far from being suﬃciently mature to solve real world problems. For example. how do we decompose the problem so that the smaller tasks can be executed in parallel? • How do we assign tasks to workers distributed across a potentially large number of machines (while keeping in mind that some workers are better suited to running some tasks than others. multiple processors in a machine. like 1 We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift. In shared memory programming. To the extent that the sub-problems are independent [5].g.. all) of the above issues. the developer needs to explicitly coordinate access to shared data structures through synchronization primitives such as mutexes. Intermediate results from each individual worker are then combined to yield the ﬁnal output. However. they can be tackled in parallel by diﬀerent workers—threads in a processor core. cores in a multi-core processor. the details of their implementations are varied and complex. locality constraints. e. or many machines in a cluster.1 The general principles behind divide-and-conquer algorithms are broadly applicable to a wide range of problems in many diﬀerent application domains. . the following are just some of the issues that need to be addressed: • How do we break up a large problem into smaller tasks? More speciﬁcally. etc. to explicitly handle process synchronization through devices such as barriers.18 CHAPTER 2 MapReduce Basics The only feasible approach to tackling large-data problems today is to divide and conquer. and to remain ever vigilant for common problems such as deadlocks and race conditions.

organizing and coordinating large amounts of computation is only part of the challenge. as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them. We start in Section 2. MapReduce provides a means to distribute computation without burdening the programmer with the details of distributed computing (but at a diﬀerent level of granularity). Section 2. One of the most signiﬁcant advantages of MapReduce is that it provides an abstraction that hides many system-level details from the programmer. developers are still burdened to keep track of how resources are made available to workers. the programmer must devote a signiﬁcant amount of attention to low-level system details.2 introduces the basic programming model. it is far more eﬃcient. and eﬃcient manner. Section 2. The complex task of managing storage in such a processing environment is typically handled by a distributed ﬁle system that sits underneath MapReduce.org/ 3 http://www.gov/mpi/ . As we mentioned in Chapter 1. Therefore. Large-data processing by deﬁnition requires bringing data and code together for computation to occur—no small feat for datasets that are terabytes and perhaps petabytes in size! MapReduce addresses this challenge by providing a simple abstraction for the developer.3 discusses the role of the execution framework in actually running MapReduce programs (called jobs). even with these extensions. However. which provide greater control over data ﬂow. This is operationally realized by spreading data across the local disks of nodes in a cluster and running processes on nodes that hold the data. robust. instead of moving large amounts of data around. Tying everything together. a complete cluster architecture is described in Section 2. if possible. focusing on mappers and reducers.3 provide logical abstractions that hide details of operating system synchronization and communications primitives. a developer can focus on what computations need to be performed. Like OpenMP and MPI.1 with an overview of functional programming.4 ﬁlls in additional details by introducing partitioners and combiners.5 covers this in detail. MapReduce would not be practical without a tightly-integrated distributed ﬁle system that manages the data being processed. to move the code to the data. these frameworks are mostly designed to tackle processor-intensive problems and have only rudimentary support for dealing with very large amounts of input data.anl. from which MapReduce draws its inspiration.6 before the chapter ends with a summary.19 OpenMP for shared memory parallelism. Section 2.openmp.mcs. However. transparently handling most of the details behind the scenes in a scalable. Additionally. Section 2. When using existing parallel computing approaches for large-data computation. 2 2 http://www. or libraries implementing the Message Passing Interface (MPI) for cluster-level parallelism. This chapter introduces the MapReduce programming model and the underlying distributed ﬁle system. which detracts from higher-level problem solving.

20

CHAPTER 2. MAPREDUCE BASICS

f

f

f

f

f

g

g

g

g

g

Figure 2.1: Illustration of map and fold, two higher-order functions commonly used together in functional programming: map takes a function f and applies it to every element in a list, while fold iteratively applies a function g to aggregate results.

2.1

FUNCTIONAL PROGRAMMING ROOTS

MapReduce has its roots in functional programming, which is exempliﬁed in languages such as Lisp and ML.4 A key feature of functional languages is the concept of higherorder functions, or functions that can accept other functions as arguments. Two common built-in higher order functions are map and fold, illustrated in Figure 2.1. Given a list, map takes as an argument a function f (that takes a single argument) and applies it to all elements in a list (the top part of the diagram). Given a list, fold takes as arguments a function g (that takes two arguments) and an initial value: g is ﬁrst applied to the initial value and the ﬁrst item in the list, the result of which is stored in an intermediate variable. This intermediate variable and the next item in the list serve as the arguments to a second application of g, the results of which are stored in the intermediate variable. This process repeats until all items in the list have been consumed; fold then returns the ﬁnal value of the intermediate variable. Typically, map and fold are used in combination. For example, to compute the sum of squares of a list of integers, one could map a function that squares its argument (i.e., λx.x2 ) over the input list, and then fold the resulting list with the addition function (more precisely, λxλy.x + y) using an initial value of zero. We can view map as a concise way to represent the transformation of a dataset (as deﬁned by the function f ). In the same vein, we can view fold as an aggregation operation, as deﬁned by the function g. One immediate observation is that the application of f to each item in a list (or more generally, to elements in a large dataset)

4 However,

there are important characteristics of MapReduce that make it non-functional in nature—this will become apparent later.

2.1. FUNCTIONAL PROGRAMMING ROOTS

21

can be parallelized in a straightforward manner, since each functional application happens in isolation. In a cluster, these operations can be distributed across many diﬀerent machines. The fold operation, on the other hand, has more restrictions on data locality—elements in the list must be “brought together” before the function g can be applied. However, many real-world applications do not require g to be applied to all elements of the list. To the extent that elements in the list can be divided into groups, the fold aggregations can also proceed in parallel. Furthermore, for operations that are commutative and associative, signiﬁcant eﬃciencies can be gained in the fold operation through local aggregation and appropriate reordering. In a nutshell, we have described MapReduce. The map phase in MapReduce roughly corresponds to the map operation in functional programming, whereas the reduce phase in MapReduce roughly corresponds to the fold operation in functional programming. As we will discuss in detail shortly, the MapReduce execution framework coordinates the map and reduce phases of processing over large amounts of data on large clusters of commodity machines. Viewed from a slightly diﬀerent angle, MapReduce codiﬁes a generic “recipe” for processing large datasets that consists of two stages. In the ﬁrst stage, a user-speciﬁed computation is applied over all input records in a dataset. These operations occur in parallel and yield intermediate output that is then aggregated by another user-speciﬁed computation. The programmer deﬁnes these two types of computations, and the execution framework coordinates the actual processing (very loosely, MapReduce provides a functional abstraction). Although such a two-stage processing structure may appear to be very restrictive, many interesting algorithms can be expressed quite concisely— especially if one decomposes complex algorithms into a sequence of MapReduce jobs. Subsequent chapters in this book focus on how a number of algorithms can be implemented in MapReduce. To be precise, MapReduce can refer to three distinct but related concepts. First, MapReduce is a programming model, which is the sense discussed above. Second, MapReduce can refer to the execution framework (i.e., the “runtime”) that coordinates the execution of programs written in this particular style. Finally, MapReduce can refer to the software implementation of the programming model and the execution framework: for example, Google’s proprietary implementation vs. the open-source Hadoop implementation in Java. And in fact, there are many implementations of MapReduce, e.g., targeted speciﬁcally for multi-core processors [127], for GPGPUs [71], for the CELL architecture [126], etc. There are some diﬀerences between the MapReduce programming model implemented in Hadoop and Google’s proprietary implementation, which we will explicitly discuss throughout the book. However, we take a rather Hadoop-centric view of MapReduce, since Hadoop remains the most mature and accessible implementation to date, and therefore the one most developers are likely to use.

22

CHAPTER 2. MAPREDUCE BASICS

2.2

MAPPERS AND REDUCERS

Key-value pairs form the basic data structure in MapReduce. Keys and values may be primitives such as integers, ﬂoating point values, strings, and raw bytes, or they may be arbitrarily complex structures (lists, tuples, associative arrays, etc.). Programmers typically need to deﬁne their own custom data types, although a number of libraries such as Protocol Buﬀers,5 Thrift,6 and Avro7 simplify the task. Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary datasets. For a collection of web pages, keys may be URLs and values may be the actual HTML content. For a graph, keys may represent node ids and values may contain the adjacency lists of those nodes (see Chapter 5 for more details). In some algorithms, input keys are not particularly meaningful and are simply ignored during processing, while in other cases input keys are used to uniquely identify a datum (such as a record id). In Chapter 3, we discuss the role of complex keys and values in the design of various algorithms. In MapReduce, the programmer deﬁnes a mapper and a reducer with the following signatures: map: (k1 , v1 ) → [(k2 , v2 )] reduce: (k2 , [v2 ]) → [(k3 , v3 )]

The convention [. . .] is used throughout this book to denote a list. The input to a MapReduce job starts as data stored on the underlying distributed ﬁle system (see Section 2.5). The mapper is applied to every input key-value pair (split across an arbitrary number of ﬁles) to generate an arbitrary number of intermediate key-value pairs. The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs.8 Implicit between the map and reduce phases is a distributed “group by” operation on intermediate keys. Intermediate data arrive at each reducer in order, sorted by the key. However, no ordering relationship is guaranteed for keys across diﬀerent reducers. Output key-value pairs from each reducer are written persistently back onto the distributed ﬁle system (whereas intermediate key-value pairs are transient and not preserved). The output ends up in r ﬁles on the distributed ﬁle system, where r is the number of reducers. For the most part, there is no need to consolidate reducer output, since the r ﬁles often serve as input to yet another MapReduce job. Figure 2.2 illustrates this two-stage processing structure. A simple word count algorithm in MapReduce is shown in Figure 2.3. This algorithm counts the number of occurrences of every word in a text collection, which may be the ﬁrst step in, for example, building a unigram language model (i.e., probability

5 http://code.google.com/p/protobuf/ 6 http://incubator.apache.org/thrift/ 7 http://hadoop.apache.org/avro/ 8 This

characterization, while conceptually accurate, is a slight simpliﬁcation. See Section 2.6 for more details.

Reducers are applied to all values associated with the same key. .2.] do sum ← sum + c Emit(term t.3: Pseudo-code for the word count algorithm in MapReduce. 1: 2: 3: 4: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. counts [c1 . c2 . c2 . Between the map and reduce phases lies a barrier that involves a large distributed sort and group by. which generate an arbitrary number of intermediate key-value pairs. The mapper emits an intermediate key-value pair for each word in a document.]) sum ← 0 for all count c ∈ counts [c1 . . . Mappers are applied to all input key-value pairs. . .2: Simpliﬁed view of MapReduce. doc d) for all term t ∈ doc d do Emit(term t. MAPPERS AND REDUCERS 23 A α B β C γ D δ E ε F ζ mapper mapper mapper mapper a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reducer reducer reducer X 5 Y 7 Z 9 Figure 2. The reducer sums up all counts for each word. count sum) Figure 2. count 1) class Reducer method Reduce(term t. . .2.

Mappers and reducers are objects that implement the Map and Reduce methods. tokenizes the document. The values are arbitrarily ordered. Later in Section 3. and each ﬁle will contain roughly the same number of words. Another diﬀerence: in Google’s implementation the programmer is not allowed to change the key in the reducer. and the reducer can emit an arbitrary number of output key-value pairs (with diﬀerent keys). Google’s implementation allows the programmer to specify a secondary sort key for ordering the values (if desired)—in which case values associated with each key would be presented to the developer’s reduce code in sorted order. To provide a bit more implementation detail: pseudo-code provided in this book roughly mirrors how MapReduce programs are written in Hadoop. There are some diﬀerences between the Hadoop implementation of MapReduce and Google’s implementation. Final output is written to the distributed ﬁle system. The output can be examined by the programmer or used as input to another MapReduce program. . but the execution framework (see next section) makes the ﬁnal determination based on the physical layout of the data (more details in Section 2. Input key-values pairs take the form of (docid. the reducer is presented with a key and an iterator over all values associated with the particular key.9 In Hadoop. one ﬁle per reducer. and emits ﬁnal keyvalue pairs with the word as the key. where the former is a unique identiﬁer for the document. and emits an intermediate key-value pair for every word: the word itself serves as the key. and the count as the value. respectively. the programmer can precisely specify the number of reduce tasks. doc) pairs stored on the distributed ﬁle system. the programmer provides a hint on the number of map tasks to run. and the Reduce method is called once per intermediate key. In conﬁguring a MapReduce job. and the integer one serves as the value (denoting that we’ve seen the word once).5 and Section 2. Therefore. The partitioner. The MapReduce execution framework guarantees that all values associated with the same key are brought together in the reducer. and the latter is the text of the document itself. We will return to discuss the details 9 Personal communication. we simply need to sum up all counts (ones) associated with each word.4 we discuss how to overcome this limitation in Hadoop to perform secondary sorting.4. That is. which we discuss later in Section 2. In Hadoop. MAPREDUCE BASICS distribution over words in a collection). In Hadoop. controls the assignment of words to reducers. The mapper takes an input key-value pair. In contrast with the number of map tasks. in our word count algorithm. Words within each ﬁle will be sorted by alphabetical order. there is no such restriction. The reducer does exactly this. The situation is similar for the reduce phase: a reducer object is initialized for each reduce task. Jeﬀ Dean. a mapper object is initialized for each map task (associated with a particular sequence of key-value pairs called an input split) and the Map method is called on each key-value pair by the execution framework. the reducer output key must be exactly the same as the reducer input key.6).24 CHAPTER 2.

in which case the program simply sorts and groups mapper output. of course. e. In other cases (see Section 4. most potential synchronization problems are avoided since internal state is private only to individual mappers and reducers. Finally. and they need not be of the same type as the input key-value pairs. mappers and reducers can have side eﬀects.5). Since many mappers and reducers are run in parallel. our focus is on algorithm design and conceptual understanding—not actual Hadoop programming. and the distributed ﬁle system is a shared global resource. One strategy is to write a temporary ﬁle that is renamed upon successful completion of the mapper or reducer [45].5). For that. In addition to the “canonical” MapReduce processing ﬂow. preserving state across multiple inputs is central to the design of many MapReduce algorithms (see Chapter 3). it may be useful for mappers or reducers to have external side eﬀects. Similarly.10 In general. MapReduce programs can contain no reducers. it may be unwise for a mapper to query an external SQL database. running identity 10 Unless. While the correctness of such algorithms may be more diﬃcult to guarantee (since the function’s behavior depends not only on the current input but on previous inputs). which is dependent on an understanding of the distributed ﬁle system (covered in Section 2. such as writing ﬁles to the distributed ﬁle system. However. parse a large text collection or independently analyze a large number of images. special care must be taken to ensure that such operations avoid synchronization conﬂicts. in which case mapper output is directly written to disk (one ﬁle per mapper).g. For embarrassingly parallel problems.4 and Section 6. although in some cases it is useful for the mapper to implement the identity function and simply pass input key-value pairs to the reducers. reducers can emit an arbitrary number of ﬁnal key-value pairs.6. . Although not permitted in functional programming. This is a powerful and useful feature: for example. this would be a common pattern. Such algorithms can be understood as having side eﬀects that only change state that is internal to the mapper or reducer.2. Similarly. The converse—a MapReduce program with no mappers—is not possible. To reiterate: although the presentation of algorithms in this book closely mirrors the way they would be implemented in Hadoop. the database itself is highly scalable. we would recommend Tom White’s book [154]. What are the restrictions on mappers and reducers? Mappers and reducers can express arbitrary computations over their inputs. This has the eﬀect of sorting and regrouping the input for reduce-side processing. since that would introduce a scalability bottleneck on the number of map tasks that could be run in parallel (since they might all be simultaneously querying the database). and they can diﬀer in type from the intermediate key-value pairs.2. mappers can emit an arbitrary number of intermediate key-value pairs. in some cases it is useful for the reducer to implement the identity function. one must generally be careful about use of external resources since multiple mappers or reducers may be contending for those resources. For example. other variations are also possible. MAPPERS AND REDUCERS 25 of Hadoop job execution in Section 2..

With Google’s MapReduce implementation. making it necessary for the scheduler to maintain some sort of a task queue and to track the progress of running tasks so that waiting tasks can be assigned to nodes as they become available. distributed. 160]. Hadoop has been integrated with existing MPP (massively parallel processing) relational databases. on clusters ranging from a single node to a few thousand nodes. this is called the jobtracker) and execution framework (sometimes called the “runtime”) takes care of everything else: it transparently handles all other aspects of distributed code execution. a map task may be responsible for processing a certain block of input key-value pairs (called an input split in Hadoop). A MapReduce program. Although in the most common case.. Speciﬁc responsibilities include: Scheduling. from diﬀerent users). The developer submits the job to the submission node of a cluster (in Hadoop.6 for more details). It is not uncommon for MapReduce jobs to have thousands of individual tasks that need to be assigned to nodes in the cluster. . In large jobs. in some cases MapReduce jobs may not consume any input at all (e. any other system that satisﬁes the proper abstractions can serve as a data source or sink.. Each MapReduce job is divided into smaller units called tasks (see Section 2. HBase is an open-source BigTable clone and has similar capabilities. 2. transparent. a reduce task may handle a portion of the intermediate key space. is frequently used as a source of input and as a store of MapReduce output. referred to as a job. consists of code for mappers and reducers (as well as combiners and partitioners to be discussed in the next section) packaged together with conﬁguration parameters (such as where the input lies and where the output should be stored). a sparse. Another aspect of scheduling involves coordination among tasks belonging to diﬀerent jobs (e. input to a MapReduce job comes from data stored on the distributed ﬁle system and output is written back to the distributed ﬁle system. MAPREDUCE BASICS mappers and reducers has the eﬀect of regrouping and resorting the input data (which is sometimes useful). How can a large..3 THE EXECUTION FRAMEWORK One of the most important idea behind MapReduce is separating the what of distributed processing from the how. Also. policy-driven fashion? There has been some recent work along these lines in the context of Hadoop [131.g. For example. BigTable [34]. which allows a programmer to write MapReduce jobs over database rows and dump output into a new database table. computing π) or may only consume a small amount of data (e. input parameters to many instances of processorintensive simulations running in parallel). Finally. shared resource support several users simultaneously in a predictable. similarly. the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently.g.g.26 CHAPTER 2. persistent multidimensional sorted map.

however. the scheduler starts tasks on the node that holds a particular block of data (i. this issue is inexplicably intertwined with scheduling and relies heavily on the design of the underlying distributed ﬁle system. and Google has reported that speculative execution can improve job running times by 44% [45]. One cause of stragglers is ﬂaky hardware: for example. the map phase of a job is only as fast as the slowest map task. since each copy of the reduce task needs to pull data over the network. Better local aggregation. the common wisdom is that the technique is more helpful for map tasks than reduce tasks. to share intermediate results or otherwise exchange state information. that speculative execution cannot adequately address another common cause of stragglers: skew in the distribution of values associated with intermediate keys (leading to reduce stragglers). the more general point remains—in order for computation to occur.2. In MapReduce. discussed in the next chapter. synchronization is accomplished by a barrier between the map and reduce phases of processing. which means that the task or tasks responsible for processing the most frequent few elements will run much longer than the typical task. In general. Synchronization.11 To achieve data locality.. since inter-rack bandwidth is signiﬁcantly less than intra-rack bandwidth. The phrase data distribution is misleading. Similarly. we need to somehow feed data to the code. Data/code co-location. Note. and the necessary data will be streamed over the network. new tasks will be started elsewhere. Due to the barrier between the map and reduce tasks. Zaharia et al. Recall that MapReduce may receive its input from other sources. and the framework simply uses the result of the ﬁrst task attempt to ﬁnish. . which is accomplished by a large distributed 11 In the canonical case. Although in Hadoop both map and reduce tasks can be speculatively executed. or tasks that take an usually long time to complete. If this is not possible (e. synchronization refers to the mechanisms by which multiple concurrently running processes “join up”. THE EXECUTION FRAMEWORK 27 Speculative execution is an optimization that is implemented by both Hadoop and Google’s MapReduce implementation (called “backup tasks” [45]).g.3. Intermediate key-value pairs must be grouped by key. for example. not the data. since one of the key ideas behind MapReduce is to move the code.e. An important optimization here is to prefer nodes that are on the same rack in the datacenter as the node holding the relevant data block. [161] presented diﬀerent execution strategies in a recent paper. the completion time of a job is bounded by the running time of the slowest reduce task. With speculative execution. However. This has the eﬀect of moving code to the data. a machine that is suﬀering from recoverable errors may become signiﬁcantly slower. on its local drive) needed by the task. that is. a node is already running too many tasks). In text processing we often observe Zipﬁan distributions. an identical copy of the same task is executed on a diﬀerent machine.. In MapReduce. is one possible solution to this problem. the speed of a MapReduce job is sensitive to what are known as stragglers. As a result.

And that’s just hardware. The MapReduce execution framework must thrive in this hostile environment. Within each reducer. the partitioner speciﬁes the task to which an intermediate key-value pair must be copied. Error and fault handling. This is an important departure from functional programming: in a fold operation. Large-data problems have a penchant for uncovering obscure corner cases in code that is otherwise thought to be bug-free. not the exception. etc.g. In contrast. In large clusters. Note that the reduce computation cannot start until all the mappers have ﬁnished emitting key-value pairs and all intermediate key-value pairs have been shuﬄed and sorted. No software is bug free—exceptions must be appropriately trapped... since the execution framework cannot otherwise guarantee that all values associated with the same key have been gathered.28 CHAPTER 2. Furthermore.4 PARTITIONERS AND COMBINERS We have thus far presented a simpliﬁed view of MapReduce. disk failures are common [123] and RAM experiences more errors than one might expect [135]. Since MapReduce was explicitly designed around low-end commodity servers. Partitioners are responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers. the reducer in MapReduce receives all values associated with the same key at once. The MapReduce execution framework must accomplish all the tasks above in an environment where errors and faults are the norm. The . 2. and recovered from. power failure. However. the aggregation function g is a function of the intermediate value and the next item in the list—which means that values can be lazily generated and aggregation can begin as soon as values are available. and therefore the process is commonly known as “shuﬄe and sort”. A MapReduce job with m mappers and r reducers involves up to m × r distinct copy operations.). MAPREDUCE BASICS sort involving all the nodes that executed map tasks and all the nodes that will execute reduce tasks. the runtime must be especially resilient. In other words. keys are processed in sorted order (which is how the “group by” is implemented). There are two additional elements that complete the programming model: partitioners and combiners. This necessarily involves copying intermediate data over the network. system maintenance and hardware upgrades) and unexpected outages (e. connectivity loss. logged. Datacenters suﬀer from both planned outages (e.g. it is possible to start copying intermediate key-value pairs over the network to the nodes running the reducers as soon as each mapper ﬁnishes—this is a common optimization and implemented in Hadoop. since each mapper may have intermediate output going to every reducer. any suﬃciently large dataset will contain corrupted data or records that are mangled beyond a programmer’s imagination—resulting in errors that one would never think to check for or trap.

. but before the user reducer code runs.3. proper use of combiners can spell the diﬀerence between an impractical algorithm and an eﬃcient algorithm. In many cases. Each combiner operates in isolation and therefore does not have access to intermediate output from other mappers.4. addition or multiplication).e. this means that a combiner may be invoked zero. One solution is to perform local aggregation on the output of each mapper. One can think of combiners as “mini-reducers” that take place on the output of the mappers.2. one. combiners in Hadoop may actually be invoked in the reduce phase.g.2 discusses this in more detail. This imbalance in the amount of data associated with each key is relatively common in many text processing applications due to the Zipﬁan distribution of word occurrences. however. after key-value pairs have been copied over to the reducer. that the partitioner only considers the key and ignores the value—therefore.. which emits a key-value pair for each word in the collection. prior to the shuﬄe and sort phase. all these key-value pairs need to be copied across the network. Furthermore. . i. reducers and combiners are not interchangeable. As a result. one cannot assume that a combiner will have the opportunity to process all values associated with the same key. We can motivate the need for combiners by considering the word count algorithm in Figure 2. Note. In addition. combiners must be carefully written so that they can be executed in these diﬀerent environments.12 In cases where an operation is both associative and commutative (e. but the keys and values must be of the same type as the mapper output (same as the reducer input). The combiner in MapReduce supports such an optimization. This is clearly ineﬃcient. This assigns approximately the same number of keys to each reducer (dependent on the quality of the hash function). In general. which focuses on various techniques for local aggregation. however. the execution framework reserves the right to use combiners at its discretion.. This topic will be discussed in Section 3. a roughly-even partitioning of the key space may nevertheless yield large diﬀerences in the number of key-values pairs sent to each reducer (since diﬀerent keys may have diﬀerent numbers of associated values). and so the amount of intermediate data will be larger than the input collection itself. In reality.1. or multiple times. The combiner is provided keys and values associated with each key (the same types as the mapper output keys and values). Combiners are an optimization in MapReduce that allow for local aggregation before the shuﬄe and sort phase. i. With this modiﬁcation (assuming the maximum amount of local aggregation possible). Critically. reducers can directly serve as combiners. the number of intermediate key-value pairs will be at most the number of unique words in the collection times the number of mappers (and typically far smaller because each mapper may not encounter every word).1. Section 3. PARTITIONERS AND COMBINERS 29 simplest partitioner involves computing the hash value of the key and then taking the mod of that value with the number of reducers. The combiner can emit any number of key-value pairs.e. to compute a local count for a word over all the documents processed by the mapper. It suﬃces to say for now that 12 A note on the implementation of combiners in Hadoop: by default.

The complete MapReduce model is shown in Figure 2. so while Figure 2. .4 is conceptually accurate.4. resulting in much faster algorithms. Combiners can be viewed as “mini-reducers” in the map phase. partitioners are actually executed before combiners. a complete MapReduce job consists of code for the mapper.13 Therefore. illustrating combiners and partitioners in addition to mappers and reducers. 13 In Hadoop. Partitioners determine which reducer is responsible for a particular key. it doesn’t precisely describe the Hadoop implementation. The partitioner determines which reducer will be responsible for processing a particular key. along with job conﬁguration parameters.30 CHAPTER 2. reducer. and the execution framework uses this information to copy the data to the right location during the shuﬄe and sort phase. The execution framework handles everything else. a combiner can signiﬁcantly reduce the amount of data that needs to be copied over the network. and partitioner. MAPREDUCE BASICS A α B β C γ D δ E ε F ζ mapper pp mapper pp mapper pp mapper pp a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combiner combiner combiner combiner a 1 b 2 c 9 a 5 c 2 b 7 c 8 p partitioner p partitioner p partitioner p partitioner Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reducer reducer reducer X 5 Y 7 Z 9 Figure 2. Output of the mappers are processed by the combiners.4: Complete view of MapReduce. combiner. which perform local aggregation to cut down on the number of intermediate key-value pairs.

a switch with ten times the capacity is usually more than ten times more expensive). the link between the compute nodes and the storage becomes a bottleneck. distributed ﬁle systems are not new [74. but network-attached storage (NAS) and storage area networks (SAN) are common. In high-performance computing (HPC) and many traditional cluster architectures. Although MapReduce doesn’t necessarily require the distributed ﬁle system. and then write back the results (with perhaps intermediate checkpointing for long-running processes). in the open-source world.. 32. ﬁle to block mapping. we have mostly focused on the processing aspect of data-intensive processing. HDFS (Hadoop Distributed File System) is an open-source implementation of GFS that supports Hadoop. of course. load the data into memory.g. But as compute capacity grows. GPFS or PVFS) can serve as a replacement for HDFS.2. one could invest in higher performance but more expensive networks (e. THE DISTRIBUTED FILE SYSTEM 31 2. At that point. location of blocks. when properly tuned or modiﬁed for MapReduce workloads [146. The distributed ﬁle system adopts a master–slave architecture in which the master maintains the ﬁle namespace (metadata. 7. it is diﬃcult to realize many of the advantages of the programming model without a storage substrate that behaves much like the DFS.g. . and access permissions) and the slaves manage the actual 14 However. however. The Google File System (GFS) [57] supports Google’s proprietary implementation of MapReduce. [57] in the original GFS paper.g.). This. but it is important to recognize that without data.5 THE DISTRIBUTED FILE SYSTEM So far. 147. 10 gigabit Ethernet) or special-purpose interconnects such as InﬁniBand (even more expensive). but DFS blocks are signiﬁcantly larger than block sizes in typical single-machine ﬁle systems (64 MB by default). 6]. is not a new idea. The MapReduce distributed ﬁle system builds on previous work but is speciﬁcally adapted to large-data processing workloads. directory structure. Alternatively. the processing cycle remains the same at a high level: the compute nodes fetch input from storage. and often even separate networks). and therefore departs from previous architectures in certain respects (see discussion by Ghemawat et al. more compute capacity is required for processing.. this is not a cost-eﬀective solution. Regardless of the details. In most cases.14 Of course. Implementations vary widely.. 133]. The main idea is to divide user data into blocks and replicate those blocks across the local disks of nodes in the cluster. there is nothing to compute on. one could abandon the separation of computation and storage as distinct components in a cluster. there is evidence that existing POSIX-based distributed cluster ﬁle systems (e. process the data. As dataset sizes increase. storage is viewed as a distinct and separate component from computation. remains an experimental use case. as the price of networking equipment increases non-linearly with performance (e. The distributed ﬁle system (DFS) that underlies MapReduce adopts exactly this approach.5. Blocking data. supercomputers often have dedicated subsystems for handling storage (separate nodes.

all data transfer occurs directly between clients and datanodes. the master is called the GFS master. which datanode). Linux). In HDFS.32 CHAPTER 2. although for most basic ﬁle operations GFS and HDFS work much the same way. namenode and datanode may refer to physical machines in a cluster. communications with the namenode only involves transfer of metadata. Instead.. In large clusters. respectively. An important feature of the design is that data is never moved through the namenode. the three replicas are spread across diﬀerent physical racks. In Hadoop. The client then contacts the datanode to retrieve the data. data blocks. byte range) block data HDFS datanode Linux file system HDFS datanode Linux file system … … Figure 2. Replicating blocks across physical machines also increases oppor15 To be precise. so HDFS is resilient towards two common failure scenarios: individual datanode crashes and failures in networking equipment that bring an entire rack oﬄine.e. The architecture of HDFS is shown in Figure 2. the same roles are ﬁlled by the namenode and datanodes.15 This book adopts the Hadoop terminology.5. and the slaves are called GFS chunkservers. an application client wishing to read a ﬁle (or a portion thereof) must ﬁrst contact the namenode to determine where the actual data is stored.. HDFS stores three separate copies of each data block to ensure both reliability. block id) /foo/bar File namespace block 3df2 instructions to datanode datanode state (block id. In response to the client request. block location) (file name.g. or they may refer to daemons running on those machines providing the relevant services.5: The architecture of HDFS. so HDFS lies on top of the standard OS stack (e. The namenode (master) is responsible for maintaining the ﬁle namespace and directing clients to datanodes (slaves) that actually hold data blocks containing user data. Blocks are themselves stored on standard single-machine ﬁle systems. MAPREDUCE BASICS HDFS namenode Application HDFS Client (block id. availability. and performance. redrawn from a similar diagram describing GFS [57]. By default. In GFS. . the namenode returns the relevant block id and the location where the block is held (i.

HDFS does not immediately reclaim the available physical storage. this is a manually-invoked process. When a ﬁle is deleted. due to disk or machine failures or to connectivity losses due to networking equipment failures).g. it will direct the creation of new replicas.2. The namenode is in periodic communication with the datanodes to ensure proper replication of all the blocks: if there aren’t enough replicas (e. This leads to better load balancing and more even disk utilization.. location of blocks. The namenode is responsible for maintaining the ﬁle namespace. and allocates blocks on suitable datanodes for write operations. • Coordinating ﬁle operations. 17 In Hadoop. • Maintaining overall health of the ﬁle system. In the most recent release of Hadoop as of this writing (release 0.5. Finally.16 if there are too many replicas (e. To create a new ﬁle and write data to HDFS.2). The namenode allocates a new block on a suitable datanode. which includes metadata. The namenode directs application clients to datanodes for read operations. From the initial datanode. These data are held in memory for fast access and all mutations are persistently logged. data is further propagated to additional replicas. extra copies are discarded. which is a feature already present in GFS. directory structure. the namenode is also responsible for rebalancing the ﬁle system. The namenode is in periodic contact with the datanodes via heartbeat messages to ensure the integrity of the system.g. 16 Note that the namenode coordinates the replication process. . certain datanodes may end up holding more blocks than others. the namenode directs the creation of additional copies. ﬁle to block mapping. a repaired node rejoins the cluster). but data transfer occurs directly from datanode to datanode. If the namenode observes that a data block is under-replicated (fewer copies are stored on datanodes than the desired replication factor). There are current plans to oﬃcially support ﬁle appends in the near future. blocks are lazily garbage collected. the HDFS namenode has the following responsibilities: • Namespace management. and access permissions.17 During the course of normal operations. In summary.20. which updates the ﬁle namespace after checking permissions and making sure the ﬁle doesn’t already exist. since multiple copies yield more opportunities to exploit locality. rather. the application client ﬁrst contacts the namenode. THE DISTRIBUTED FILE SYSTEM 33 tunities to co-locate data and processing in the scheduling of MapReduce jobs.. rebalancing involves moving blocks from datanodes with more blocks to datanodes with fewer blocks. All data transfers occur directly between clients and datanodes. ﬁles are immutable—they cannot be modiﬁed after creation. and the application is directed to stream data directly to it.

mapping over many small ﬁles will yield as many map tasks as there are ﬁles. As a result. they were designed with a number of assumptions about the operational environment. Understanding these choices is critical to designing eﬀective MapReduce algorithms: • The ﬁle system stores a relatively modest number of large ﬁles. but rather support only a subset of possible ﬁle operations. second. there is no default mechanism in Hadoop that allows a mapper to process multiple ﬁles. This simpliﬁes the design of the distributed ﬁle system. both HDFS and GFS do not implement any form of data caching. As a result. since the distributed ﬁle system is built on top of a standard operating system such as Linux. 18 According to Dhruba Borthakur in a post to the Hadoop mailing list on 6/8/2008. Due to the common-case workload. The deﬁnition of “modest” varies by the size of the deployment. This results in two potential problems: ﬁrst. each block in HDFS occupies about 150 bytes of memory on the namenode. in terms of resolving inconsistent states and optimizing the layout of data structures.19 • Applications are aware of the characteristics of the distributed ﬁle system. mappers in a MapReduce job use individual ﬁles as a basic unit for splitting input data. One rationale for this decision is that each application knows best how to handle data speciﬁc to that application. • Workloads are batch oriented. This exactly describes the nature of MapReduce jobs.34 CHAPTER 2. respectively. but in HDFS multi-gigabyte ﬁles are common (and even encouraged). the startup costs of mappers may become signiﬁcant compared to the time spent actually processing input key-value pairs. There are several reasons why lots of small ﬁles are to be avoided. which in turn inﬂuenced the design of the systems. there is still OS-level caching. dominated by long streaming reads and large sequential writes. high sustained bandwidth is more important than low latency. which are batch operations on large amounts of data. this may result in an excessive amount of across-the-network copy operations during the “shuﬄe and sort” phase (recall that a MapReduce job with m mappers and r reducers involves up to m × r distinct copy operations). this presents an upper bound on both the number of ﬁles and blocks that can be supported.18 Large multi-block ﬁles represent a more eﬃcient use of namenode memory than many single-block ﬁles (each of which consumes less space than a single block size). Neither HDFS nor GFS present a general POSIX-compliant API. . MAPREDUCE BASICS Since GFS and HDFS were speciﬁcally designed to support Google’s proprietary and the open-source implementation of MapReduce. for example. 19 However. and in essence pushes part of the data management onto the end application. At present. In addition. Since the namenode must hold all ﬁle metadata in memory.

HDFS is designed around a number of self-monitoring and self-healing mechanisms to robustly cope with common failure modes. this single point of failure is not as severe a limitation as it may appear—with diligent monitoring of the namenode. As a result. another can step in) at the cost of consistency. the real tradeoﬀ is between consistency and availability.20 • The system is built from unreliable but inexpensive commodity components. The open source environment and the fact that many organizations already depend on Hadoop for production systems virtually guarantees that more eﬀective solutions will be developed over time. Finally. availability. If the master (HDFS namenode or GFS master) goes down. there are existing plans to integrate Kerberos into Hadoop/HDFS. A single-master design trades availability for consistency and signiﬁcantly simpliﬁes implementation. 105]). a warm standby namenode that can be quickly switched over when the primary namenode fails. mean time between failure measured in months are not uncommon for production deployments. failures are the norm rather than the exception. and for the most part avoids loadinduced crashes. but HDFS explicitly assumes a datacenter environment where only authorized users have access. some discussion is necessary to understand the single-master design of HDFS and GFS. which trivially guarantees that the ﬁle system will never be in an inconsistent state. Recall that no data is ever moved through the namenode and that all communication between clients and datanodes involve only metadata. and partition tolerance is impossible—this is Brewer’s so-called CAP Theorem [58].5. not to mention requiring a more complex implementation (cf. Because of this. [4. File permissions in HDFS are only meant to prevent unintended operations and can be easily circumvented. In practice. This weakness is mitigated in part by the lightweight nature of ﬁle system operations. THE DISTRIBUTED FILE SYSTEM 35 • The ﬁle system is deployed in an environment of cooperative users. An alternative design might involve multiple masters that jointly manage the ﬁle namespace—such an architecture would increase availability (if one goes down. simultaneously providing consistency. . There is no discussion of security in the original GFS paper. Furthermore. The single-master design of GFS and HDFS is a well-known weakness. Since partitioning is unavoidable in large-data systems. the namenode rarely is the bottleneck. the entire ﬁle system becomes unavailable. 20 However.2. the Hadoop community is well-aware of this problem and has developed several reasonable workarounds—for example. since if the master goes oﬄine. the entire ﬁle system and all MapReduce jobs running on top of it will grind to a halt. It has been demonstrated that in large-scale distributed systems.

has empty task slots). the job submission node (called the jobtracker). which is the single point of contact for a client wishing to execute a MapReduce job. The jobtracker monitors the progress of running MapReduce jobs and is responsible for coordinating the execution of the mappers and reducers. which is responsible for actually running user code. which consists of three separate components: the HDFS master (called the namenode). The job submission node runs the jobtracker. depends on many factors: the number of mappers speciﬁed by the programmer serves as a hint to the execution framework. Each of the slave nodes runs a tasktracker for executing map and reduce tasks and a datanode daemon for serving HDFS data.6. the architecture of a complete Hadoop cluster is shown in Figure 2. If a tasktracker is available to run tasks (in Hadoop parlance. Tasktrackers periodically send heartbeat messages to the jobtracker that also doubles as a vehicle for task allocation. The number of reduce tasks is equal to the number of reducers speciﬁed by the programmer. MAPREDUCE BASICS namenode namenode daemon job submission node jobtracker tasktracker t kt k datanode daemon Linux file system tasktracker t kt k datanode daemon Linux file system tasktracker t kt k datanode daemon Linux file system … slave node slave node … slave node … Figure 2. on the other hand. The number of map tasks.6 HADOOP CLUSTER ARCHITECTURE Putting everything together. for serving HDFS data. The HDFS namenode runs the namenode daemon.36 CHAPTER 2. but the actual number of tasks depends on both the number of input ﬁles and the number of HDFS data blocks occupied by those ﬁles. 2. Typically. A Hadoop MapReduce job is divided up into a number of map tasks and reduce tasks. The bulk of a Hadoop cluster consists of slave nodes (only three of which are shown in the ﬁgure) that run both a tasktracker. and many slave nodes (three shown here). the return acknowledgment of the tasktracker heartbeat contains task allocation information. Each map task is assigned a sequence of input key-value .6: Architecture of a complete Hadoop cluster. and a datanode daemon. these services run on two separate machines. although in smaller clusters they are often co-located.

In Hadoop. the Map method is called (by the execution framework) on all key-value pairs in the input split. for each intermediate key in the partition (deﬁned by the partitioner). it is possible to preserve state across multiple intermediate keys (and associated values) within a single reduce task. Each reducer object is instantiated for every reduce task. it becomes necessary to stream input key-value pairs across the network. so that the mapper will be processing local data. . too. will be important in the design of MapReduce algorithms. The alignment of input splits with HDFS block boundaries simpliﬁes task scheduling. Since large clusters are organized into racks. This means that mappers can read in “side data”. etc. dictionaries. In scheduling map tasks. the execution framework repeatedly calls the Reduce method with an intermediate key and an iterator over all values associated with that key. actual job execution is a bit more complex. providing an opportunity to load state. After initialization.2. This. as we will see in the next chapter. The programming model also guarantees that intermediate keys will be presented to the Reduce method in sorted order. Since these method calls occur in the context of the same Java object. the mapper object provides an opportunity to run programmer-speciﬁed termination code. called an input split in Hadoop. The life-cycle of this object begins with instantiation. After all key-value pairs in the input split have been processed. If it is not possible to run a map task on local data. HADOOP CLUSTER ARCHITECTURE 37 pairs. Since this occurs in the context of a single object. this property is critical in the design of MapReduce algorithms and will be discussed in the next chapter. static data sources. Once again. the execution framework strives to at least place map tasks on a rack which has a copy of the data block. mappers are Java objects with a Map method (among others). Although conceptually in MapReduce one can think of the mapper being applied to all input key-value pairs and the reducer being applied to all values associated with the same key. the jobtracker tries to take advantage of data locality—if possible. map tasks are scheduled on the slave node that holds the input split. Input splits are computed automatically and the execution framework strives to align them to HDFS block boundaries so that each map task is associated with a single data block. After initialization. with far greater intra-rack bandwidth than inter-rack bandwidth. The actual execution of reducers is similar to that of the mappers. where a hook is provided in the API to run programmer-speciﬁed code. it is possible to preserve state across multiple input key-value pairs within the same map task—this is an important property to exploit in the design of MapReduce algorithms. A mapper object is instantiated for every map task by the tasktracker.6. The Hadoop API provides hooks for programmer-speciﬁed initialization and termination code.

which is a tightly-integrated component of the MapReduce environment. and combiners. partitioners. starting with its roots in functional programming and continuing with a description of mappers. reducers. Given this basic understanding.7 SUMMARY This chapter provides a basic overview of the MapReduce programming model. Signiﬁcant attention is also given to the underlying distributed ﬁle system. we now turn our attention to the design of MapReduce algorithms. . MAPREDUCE BASICS 2.38 CHAPTER 2.

Nevertheless. parallel and distributed algorithms in general). processes running on separate nodes in a cluster must. for example: • Where a mapper or reducer runs (i. this also means that any conceivable algorithm that a programmer wishes to develop must be expressed in terms of a small number of rigidly-deﬁned components that must ﬁt together in very speciﬁc ways. on which node in the cluster)..39 CHAPTER 3 MapReduce Algorithm Design A large part of the power of MapReduce comes from its simplicity: in addition to preparing the input data. and optionally. Other than embarrassingly-parallel problems. All other aspects of execution are handled transparently by the execution framework—on clusters ranging from a single node to a few thousand nodes. the combiner and the partitioner. the programmer does have a number of techniques for controlling execution and managing the ﬂow of data in MapReduce. These examples illustrate what can be thought of as “design patterns” for MapReduce. However. the programmer has little control over many aspects of execution. to distribute partial results from nodes that produced them to the nodes that will consume them. It may not appear obvious how a multitude of algorithms can be recast into this programming model. at some point in time. primarily through examples. The purpose of this chapter is to provide. Two of these design patterns are used in the scalable inverted indexing algorithm we’ll present later in Chapter 4. the reducer. Beyond that. over datasets ranging from gigabytes to petabytes. • Which input key-value pairs are processed by a speciﬁc mapper. mappers and reducers run in isolation without any mechanisms for direct communication. concepts presented here will show up again in Chapter 5 (graph processing) and Chapter 6 (expectation-maximization algorithms). In summary. Synchronization is perhaps the most tricky aspect of designing MapReduce algorithms (or for that matter. • When a mapper or reducer begins or ﬁnishes. which instantiate arrangements of components and speciﬁc techniques designed to handle frequently-encountered situations across a variety of problem domains. Within a single MapReduce job.e. there is only one opportunity for cluster-wide synchronization—during the shuﬄe and sort stage where intermediate key-value pairs are copied from the mappers to the reducers and grouped by key. they are: • Which intermediate key-value pairs are processed by a speciﬁc reducer. come together—for example. a guide to MapReduce algorithm design. the programmer needs only to implement the mapper. Furthermore. .

One must often decompose complex algorithms into a sequence of jobs. Many algorithms are iterative in nature. and therefore the set of keys that will be encountered by a particular reducer. is linear scalability: an algorithm running on twice the amount of data should take only twice as long. The ability to control the partitioning of the key space. This chapter explains how various techniques to control code execution and data ﬂow can be applied to design algorithms in MapReduce. the convergence check itself cannot be easily expressed in MapReduce. which we dub “pairs” and “stripes”. The focus is both on scalability—ensuring that there are no inherent bottlenecks as algorithms are applied to increasingly larger datasets—and eﬃciency—ensuring that algorithms do not needlessly consume resources and thereby reducing the cost of parallelization. as well as the “in-mapper combining” design pattern. which requires orchestrating data so that the output of one job becomes the input to the next. • Section 3. These two approaches are useful in a large class of problems that require keeping track of joint events across a large number of observations. The gold standard.2 uses the example of building word co-occurrence matrices on large text corpora to illustrate two common design patterns. 5. The ability to execute user-speciﬁed initialization code at the beginning of a map or reduce task. and the ability to execute user-speciﬁed termination code at the end of a map or reduce task. Often. It is important to realize that many algorithms cannot be easily expressed as a single MapReduce job. and therefore the order in which a reducer will encounter particular keys. The standard solution is an external (nonMapReduce) program that serves as a “driver” to coordinate MapReduce iterations. an algorithm running on twice the number of nodes should only take half as long. The proper use of combiners is discussed in detail.40 CHAPTER 3. The ability to construct complex data structures as keys and values to store and communicate partial results. MAPREDUCE ALGORITHM DESIGN 1. 4. The chapter is organized as follows: • Section 3.1 introduces the important concept of local aggregation in MapReduce and strategies for designing eﬃcient algorithms that minimize the amount of partial results that need to be copied across the network. Similarly. requiring repeated execution until some convergence criteria—graph algorithms in Chapter 5 and expectation-maximization algorithms in Chapter 6 behave in exactly this way. 2. The ability to preserve state in both mappers and reducers across multiple input or intermediate keys. of course. The ability to control the sort order of intermediate keys. 3. .

1 LOCAL AGGREGATION In the context of data-intensive distributed processing. in Hadoop. a reducer needs to compute an aggregate statistic on a set of elements before individual elements can be processed.1.2. . 3. This may seem counter-intuitive: how can we compute an aggregate statistic on a set of elements before encountering elements of that set? As it turns out. 3.3 shows how co-occurrence counts can be converted into relative frequencies using a pattern known as “order inversion”.1 repeats the pseudo-code of the basic algorithm.1. with the exception of embarrassingly-parallel problems. Figure 3. clever sorting of special key-value pairs enables exactly this. it is often possible to substantially reduce both the number and size of key-value pairs that need to be shuﬄed from the mappers to the reducers. • Section 3.1 COMBINERS AND IN-MAPPER COMBINING We illustrate various techniques for local aggregation using the simple word count example presented in Section 2. Often. The sequencing of computations in the reducer can be recast as a sorting problem. the single most important aspect of synchronization is the exchange of intermediate results. the aggregate statistic can be computed in the reducer before the individual elements are encountered. and memory-backed joins.5 covers the topic of performing joins on relational datasets and presents three diﬀerent approaches: reduce-side. this necessarily involves transferring data over the network. We call this technique “value-to-key conversion”. local aggregation of intermediate results is one of the keys to eﬃcient algorithms. from the processes that produced them to the processes that will ultimately consume them. where pieces of intermediate data are sorted into exactly the order that is required to carry out a series of computations. intermediate results are written to local disk before being sent over the network. reducers sum up the partial counts to arrive at the ﬁnal count. • Section 3. In a cluster environment. Through use of the combiner and by taking advantage of the ability to preserve state across multiple inputs. Normally.4 provides a general solution to secondary sorting. which is quite simple: the mapper emits an intermediate key-value pair for each term observed. For convenience. In MapReduce. this would require two passes over the data. reductions in the amount of intermediate data translate into increases in algorithmic eﬃciency. Since network and disk latencies are relatively expensive compared to other operations. which is the problem of sorting values associated with a key in the reduce phase.3. but with the “order inversion” design pattern. LOCAL AGGREGATION 41 • Section 3. Furthermore. with the term itself as the key and a value of one. map-side.

. . the combiners aggregate term counts across the documents processed by each map task. c2 . This results in a reduction in the number of intermediate key-value pairs that need to be shuﬄed across the network—from the order of total number of terms in the collection to the order of the number of unique terms in the collection. count sum) Figure 3. .e. In this example.1: Pseudo-code for the basic word count algorithm in MapReduce (repeated from Figure 2.]) sum ← 0 for all count c ∈ counts [c1 . if the combiners take advantage of all opportunities for local aggregation.2 (the mapper is modiﬁed but the reducer remains the same as in Figure 3. An associative array (i. there are two additional factors to consider. 1 More precisely. the algorithm would generate at most m × V intermediate key-value pairs.3). . especially for long documents. count 1) class Reducer method Reduce(term t. .42 CHAPTER 3. .4. Given that some words appear frequently within a document (for example. The ﬁrst technique for local aggregation is the combiner. most terms will not be observed by most mappers (for example. Map in Java) is introduced inside the mapper to tally up term counts within a single document: instead of emitting a key-value pair for each term in the document.] do sum ← sum + c Emit(term t. terms that occur only once will by deﬁnition only be observed by one mapper). already discussed in Section 2. c2 . a document about dogs is likely to have many occurrences of the word “dog”). this version emits a key-value pair for each unique term in the document. combiners in Hadoop are treated as optional optimizations. Combiners provide a general mechanism within the MapReduce framework to reduce the amount of intermediate data generated by the mappers—recall that they can be understood as “mini-reducers” that process the output of mappers.. counts [c1 .1 and therefore is not repeated). . MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 4: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. since every term could have been observed in every mapper. this can yield substantial savings in the number of intermediate key-value pairs emitted. However. so there is no guarantee that the execution framework will take advantage of all opportunities for partial aggregation. On the other hand.1 An improvement on the basic algorithm is shown in Figure 3. doc d) for all term t ∈ doc d do Emit(term t. where m is the number of mappers and V is the vocabulary size (number of unique terms in the collection). Due to the Zipﬁan nature of term distributions.

6.3 (once again.1. The workings of this algorithm critically depends on the details of how map and reduce tasks in Hadoop are executed. discussed in Section 2. In this case. the semantics of the combiner is underspeciﬁed in MapReduce. we are in essence incorporating combiner functionality directly inside the mapper. doc d) H ← new AssociativeArray for all term t ∈ doc d do H{t} ← H{t} + 1 for all term t ∈ H do Emit(term t. There are two main advantages to using this design pattern: First. Recall.2 This is a suﬃciently common design pattern in MapReduce that it’s worth giving it a name. which is an API hook for user-speciﬁed code. a (Java) mapper object is created for each map task. Since it is possible to preserve state across multiple calls of the Map method (for each input key-value pair).2: Pseudo-code for the improved MapReduce word count algorithm that uses an associative array to aggregate term counts on a per-document basis. which is responsible for processing a block of input keyvalue pairs.3. LOCAL AGGREGATION 1: 2: 3: 4: 5: 6: 7: 43 class Mapper method Map(docid a. We’ll see later on how this pattern can be applied to a variety of problems. That is. Prior to processing any input key-value pairs. emission of intermediate data is deferred until the Close method in the pseudo-code. as illustrated in the variant of the word count algorithm in Figure 3. the mapper’s Initialize method is called. Recall that this API hook provides an opportunity to execute user-speciﬁed code after the Map method has been applied to all input key-value pairs of the input data split to which the map task was assigned. we initialize an associative array for holding term counts. This basic idea can be taken one step further. and emit key-value pairs only when the mapper has processed all documents. only the mapper is modiﬁed).1. . In contrast. it provides control over when local aggregation occurs and how it exactly takes place. so that we can refer to the pattern more conveniently throughout the book. count H{t}) Tally counts for entire document Figure 3. we can continue to accumulate partial term counts in the associative array across multiple documents. However. Reducer is the same as in Figure 3. 2 Leaving aside the minor complication that in Hadoop. “in-mapper combining”. There is no need to run a separate combiner. combiners can be run in the reduce phase also (when merging intermediate key-value pairs from diﬀerent map tasks). With this technique. in practice it makes almost no diﬀerence either way. since all opportunities for local aggregation are already exploited.

such indeterminism is unacceptable. One reason for this is the additional overhead associated with actually materializing the key-value pairs. In contrast. doc d) for all term t ∈ doc d do H{t} ← H{t} + 1 method Close for all term t ∈ H do Emit(term t. This process involves unnecessary object creation and destruction (garbage collection takes time). MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 4: 5: 6: 7: 8: 9: class Mapper method Initialize H ← new AssociativeArray method Map(docid a. Hadoop makes no guarantees on how many times the combiner is applied. but don’t actually reduce the number of key-value pairs that are emitted by the mappers in the ﬁrst place. Second. which is exactly why programmers often choose to perform their own local aggregation in the mappers. this isn’t a big deal.1.3: Pseudo-code for the improved MapReduce word count algorithm that demonstrates the “in-mapper combining” design pattern. or that it is even applied at all. Reducer is the same as in Figure 3. only to be “compacted” by the combiners. or not at all (or even in the reduce phase). in-mapper combining will typically be more eﬃcient than using actual combiners. drawbacks to the in-mapper combining pattern. Preserving state across multiple input instances means that algorithmic behavior may depend on the order in which input key-value pairs are encountered. Ultimately. intermediate key-value pairs are still generated on a per-document basis. since pragmatic concerns for eﬃciency often trump theoretical “purity”. the mappers will generate only those key-value pairs that need to be shuﬄed across the network to the reducers. but there are practical consequences as well. perhaps multiple times. however. count H{t}) Tally counts across documents Figure 3. This creates the potential for ordering-dependent bugs. with in-mapper combining. object serialization and deserialization (when intermediate key-value pairs ﬁll the in-memory buﬀer holding map outputs and need to be temporarily spilled to disk).44 CHAPTER 3. With the algorithm in Figure 3. There are. which has the option of using it. The combiner is provided as a semantics-preserving optimization to the execution framework. Combiners reduce the amount of intermediate data that is shuﬄed across the network. For example. and furthermore. since state is being preserved across multiple input key-value pairs. it breaks the functional programming underpinnings of MapReduce. which are diﬃcult to debug on large datasets in the general case (although the correctness of in-mapper . In some cases (although not in this particular example). First.2.

4 One common solution to limiting memory usage when using the in-mapper combining technique is to “block” input key-value pairs and “ﬂush” in-memory data structures periodically. the memory footprint is bound by the vocabulary size. In MapReduce algorithms. In practice. after all. the mapper may run out of memory. 4 A few more details: note what matters is that the partial term counts encountered within particular input split ﬁts into memory. and the number of key-value pairs that are emitted by each individual map task. Jeﬀ Dean). However. . the algorithm in Figure 3.3. where T is the number of tokens in the collection. the mapper could keep track of its own memory footprint and ﬂush intermediate key-value pairs once memory usage has crossed a certain threshold. Heap’s Law relates the vocabulary size V to the collection size as follows: V = kT b . beyond which the associative array holding the partial term counts will no longer ﬁt in memory. In the word count example. either the block size or the memory usage threshold needs to be determined empirically: with too large a value. since it is theoretically possible that a mapper encounters every term in the collection. As an alternative. In the word count example. LOCAL AGGREGATION 45 combining for word count is easy to demonstrate). It critically depends on having suﬃcient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split. a well-known result in information retrieval. This is straightforwardly implemented with a counter variable that keeps track of the number of input key-value pairs that have been processed. in Hadoop physical memory is split between multiple tasks that may be running on a node concurrently. it is diﬃcult to coordinate resource consumption eﬀectively.1. In both approaches. one will often want to increase the input split size to limit the growth of the number of map tasks (in order to reduce the number of distinct copy operations necessary to shuﬄe intermediate data over the network).3 Therefore. p. one often encounters diminishing returns in performance gains with increasing buﬀer sizes. The idea is simple: instead of emitting intermediate data only after every key-value pair has been processed. local aggregation is eﬀective because 3 In more detail. Heap’s Law. such that it is not worth the eﬀort to search for an optimal buﬀer size (personal communication. there is a fundamental scalability bottleneck associated with the in-mapper combining pattern. Furthermore. emit partial results after processing every n key-value pairs. come from having multiple values associated with the same key (whether one uses combiners or employs the in-mapper combining pattern). but since the tasks are not aware of each other. opportunities for local aggregation may be lost. the distribution of keys themselves. 81). the extent to which eﬃciency can be increased through local aggregation depends on the size of the intermediate key space. however. but with too small a value. these tasks are all competing for ﬁnite resources. Opportunities for aggregation. accurately models the growth of vocabulary size as a function of the collection size—the somewhat surprising fact is that the vocabulary size never stops growing. Second.5 ([101]. as collection sizes increase.3 will scale only up to a point. Typical values of the parameters k and b are: 30 ≤ k ≤ 100 and b ∼ 0.

which is highly ineﬃcient. which alleviates the reduce straggler problem. In any MapReduce program. the reducer that’s responsible for computing the count of ‘the’ will have a lot more work to do than the typical reducer. Consider what would happen if we did: the combiner would compute the mean of an arbitrary subset .46 CHAPTER 3. 3. Figure 3. This algorithm will indeed work. however. In cases where the reduce computation is both commutative and associative. which would be useful for understanding user demographics. the reducer input key-value type must match the mapper output key-value type: this implies that the combiner input and output key-value types must match the mapper output key-value type (which is the same as the reducer input key-value type).4 shows the pseudo-code of a simple algorithm for accomplishing this task that does not involve combiners. The mean is then emitted as the output value in the reducer (with the input string as the key)..g. With local aggregation (either combiners or in-mapper combining).1. without local aggregation. the reducer cannot be used as a combiner in this case. and therefore will likely be a straggler. which simply passes all input key-value pairs to the reducers (appropriately grouped and sorted). MAPREDUCE ALGORITHM DESIGN many words are encountered multiple times within a map task. Unlike in the word count example. Since combiners in Hadoop are viewed as optional optimizations. where keys represent user ids and values represent some measure of activity such as elapsed time for a particular session—the task would correspond to computing the mean session length on a per-user basis. and we wish to compute the mean of all integers associated with the same key (rounded to the nearest integer). This information is used to compute the mean once all values are processed. we do not ﬁlter frequently-occurring words: therefore. Zipﬁan) distribution of values associated with intermediate keys. In the general case. A real-world example might be a large user log from a popular website.2 ALGORITHMIC CORRECTNESS WITH LOCAL AGGREGATION Although use of combiners can yield dramatic reductions in algorithm running time.1: it requires shuﬄing all key-value pairs from mappers to reducers across the network. Consider a simple example: we have a large dataset where input keys are strings and input values are integers. The reducer keeps track of the running sum and the number of integers encountered. the reducer can also be used (unmodiﬁed) as the combiner (as is the case with the word count example). we substantially reduce the number of values associated with frequently-occurring terms. Local aggregation is also an eﬀective technique for dealing with reduce stragglers (see Section 2. the correctness of the algorithm cannot depend on computations performed by the combiner or depend on them even being run at all.3) that result from a highly-skewed (e. care must be taken in applying them. In our word count example. but suﬀers from the same drawbacks as the basic word count algorithm in Figure 3. We use an identity mapper. combiners and reducers are not interchangeable.

. 3. integer ravg ) Figure 3. one special case in which using reducers as combiners would produce the correct result: if each combiner computed the mean of equal-size subsets of the values. such a scenario is highly unlikely. However. all keys and values in our algorithms have been primitives (string.5. Up until now. integer r) class Reducer method Reduce(string t. pairs of partial sums and counts can be aggregated to arrive at the mean. . we know that: Mean(1. . since such ﬁne-grained control over the combiners is impossible in MapReduce. but we have added a combiner that partially aggregates results by computing the numeric components necessary to arrive at the mean. this approach would not produce the correct result. .] do sum ← sum + r cnt ← cnt + 1 ravg ← sum/cnt Emit(string t.).5 So how might we properly take advantage of combiners? An attempt is shown in Figure 3. The combiner receives each string and the associated list of integer values. As a concrete example. 2. there are no prohibitions in MapReduce for more complex types. and emitted as the output of the combiner. from which it computes the sum of those values and the number of integers encountered (i. 5)) In general. 4. The sum and count are packaged into a pair.3. Thrift. . . the count). of values associated with the same key.]) sum ← 0 cnt ← 0 for all integer r ∈ integers [r1 . 6 In Hadoop. integers [r1 . 5) = Mean(Mean(1. integers. r2 . and the reducer would compute the mean of those values. Therefore. this represents a key technique in MapReduce algorithm design that we introduced at the beginning of this is. Mean(3. In the reducer. in fact.. 4. LOCAL AGGREGATION 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 9: 47 class Mapper method Map(string t.6 and. either custom types or types deﬁned using a library such as Protocol Buﬀers. the mean of means of arbitrary subsets of a set of numbers is not the same as the mean of the set of numbers. however. 5 There .4: Pseudo-code for the basic MapReduce algorithm that computes the mean of values associated with the same key. However. etc.e. with the same string as the key. The mapper remains the same. 2). integer r) Emit(string t. r2 . or Avro.1.

c2 ) . (s2 . r2 . We will frequently encounter complex keys and values throughput the rest of this book. Unfortunately. . Recall from our previous discussion that Hadoop makes no guarantees on . To understand why this restriction is necessary in the programming model. cnt)) class Reducer method Reduce(string t. c1 ). which also must be the same as the mapper output type and the reducer input type. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 1: 2: 3: 4: 5: 6: 7: 8: 9: class Mapper method Map(string t. . . this algorithm will not work. So let us remove the combiner and see what happens: the output value type of the mapper is integer. integers [r1 .]) sum ← 0 cnt ← 0 for all pair (s. that the combiner is run exactly once. . Recall that combiners must have the same input and output key-value type. c1 ). This is clearly not the case. pair (sum. integer r) Emit(string t.] do sum ← sum + r cnt ← cnt + 1 Emit(string t. pairs [(s1 . remember that combiners are optimizations that cannot change the correctness of the algorithm. so the reducer expects to receive a list of integers as values. . c) ∈ pairs [(s1 .48 CHAPTER 3. . r2 . .5: Pseudo-code for an incorrect ﬁrst attempt at introducing combiners to compute the mean of values associated with each key. c2 ) . . and more speciﬁcally. The mismatch between combiner input and output key-value types violates the MapReduce programming model. integer ravg ) Separate sum and count Figure 3. .]) sum ← 0 cnt ← 0 for all integer r ∈ integers [r1 . .] do sum ← sum + s cnt ← cnt + c ravg ← sum/cnt Emit(string t. integer r) class Combiner method Combine(string t. chapter. But the reducer actually expects a list of pairs! The correctness of the algorithm is contingent on the combiner running on the output of the mappers. (s2 .

c1 ). and emits pairs with updated sums and counts. c1 ). c2 ) . (s2 .6: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with each key.6. one. c1 ). pairs [(s1 . integer r) Emit(string t. Let us verify the correctness of this algorithm by repeating the previous exercise: What would happen if no combiners were run? With no combiners.]) sum ← 0 cnt ← 0 for all pair (s. . . (s2 . the mappers would send pairs (as values) directly to the reducers. . with an additional division at the very end). cnt)) class Reducer method Reduce(string t. The combiner separately aggregates the partial sums and the partial counts (as before). This algorithm correctly takes advantage of combiners. c2 ) . This violates the MapReduce programming model. There would be as many intermediate pairs as there were input key-value pairs. it could be zero. c2 ) . 1)) class Combiner method Combine(string t. this algorithm transforms a non-associative operation (mean of numbers) into an associative operation (element-wise sum of a pair of numbers. integer ravg ) Figure 3. . and each of those would consist of an integer . or multiple times. c1 ). how many times combiners are called. The reducer is similar to the combiner. Another stab at the algorithm is shown in Figure 3. .3.]) sum ← 0 cnt ← 0 for all pair (s.] do sum ← sum + s cnt ← cnt + c Emit(string t. pair (sum. . pair (r.1. c2 ) . In the mapper we emit as the value a pair consisting of the integer and one—this corresponds to a partial count over one instance. (s2 . . the algorithm is correct. (s2 . . In essence.] do sum ← sum + s cnt ← cnt + c ravg ← sum/cnt Emit(string t. except that the mean is computed at the end. and this time. c) ∈ pairs [(s1 . c) ∈ pairs [(s1 . LOCAL AGGREGATION 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 1: 2: 3: 4: 5: 6: 7: 8: 9: 49 class Mapper method Map(string t. pairs [(s1 .

Moving partial aggregation from the combiner directly into the mapper is subjected to all the tradeoﬀs and caveats discussed earlier this section.7.6 and one.6.2 PAIRS AND STRIPES One common approach for synchronization in MapReduce is to construct complex keys and values in such a way that data necessary for a computation are naturally brought together by the execution framework. illustrating the in-mapper combining design pattern. but in this case the memory footprint of the data structures for holding intermediate data is likely to be modest. Building on previously . similar to before. and hence the mean would be correct. in Figure 3. Only the mapper is shown here.. no matter how many times they run.e. Finally. the value is a pair consisting of the sum and count. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: class Mapper method Initialize S ← new AssociativeArray C ← new AssociativeArray method Map(string t. making this variant algorithm an attractive option. pair) that is passed from mapper to combiner to reducer. pair (S{t}. the reducer can emit ﬁnal key-value pairs of a diﬀerent type. Note that although the output key-value type of the combiner must be the same as the input key-value type of the reducer. the reducer is the same as in Figure 3. Intermediate key-value pairs are emitted only after the entire input split has been processed. Now add in the combiners: the algorithm would remain correct. Inside the mapper. the partial sums and counts associated with each string are held in memory across input key-value pairs. 3.7: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with each key. C{t})) Figure 3. The reducer is exactly the same as in Figure 3. We ﬁrst touched on this technique in the previous section. since the combiners merely aggregate partial sums and counts to pass along to the reducers.50 CHAPTER 3. integer r) S{t} ← S{t} + r C{t} ← C{t} + 1 method Close for all term t ∈ S do Emit(term t. in the context of “packaging” partial sums and counts in a complex value (i. we present an even more eﬃcient algorithm that exploits the in-mapper combining pattern. The reducer would still arrive at the correct sum and count.

More importantly.. a large body of work in lexical semantics based on distributional proﬁles of words. This task is quite common in text processing and provides the starting point to many other algorithms. where n is the size of the vocabulary.. the vocabulary size). the co-occurrence matrix of a corpus is a square n × n matrix where n is the number of unique words in the corpus (i. which for real-world English corpora can be hundreds of thousands of words. concepts presented here are also used in Chapter 6 when we discuss expectation-maximization algorithms. a large retailer might analyze point-of-sale transaction records to identify correlated product purchases (e. a co-occurrence matrix M where mij is the count of how many times word i was immediately succeeded by word j would usually not be symmetric. an intelligence analyst might wish to identify associations between re-occurring ﬁnancial transactions that are otherwise unrelated. Note that the upper and lower triangles of the matrix are identical since co-occurrence is a symmetric relation.g. A cell mij contains the number of times word wi co-occurs with word wj within a speciﬁc context—a natural unit such as a sentence. for unsupervised sense clustering [136]. or a certain window of m words (where m is an application-dependent parameter).. It is obvious that the space requirement for the word co-occurrence problem is O(n2 ).3. this problem represents a speciﬁc instance of the task of estimating distributions of discrete joint events from a large number of observations.g. For example.32 and 1. this section introduces two common design patterns we have dubbed “pairs” and “stripes” that exemplify this strategy. though in the general case relations between words need not be symmetric. .e.7 The computation of the word co-occurrence matrix is quite simple if the entire matrix 7 The size of the vocabulary depends on the deﬁnition of a “word” and techniques (if any) for corpus preprocessing. The algorithms discussed in this section could be adapted to tackle these related problems. Formally.##). for computing statistics such as pointwise mutual information [38].. As a running example. dating back to Firth [55] and Harris [69] in the 1950s and 1960s. 94]. which would assist in inventory management and product placement on store shelves. automatic thesaurus construction [137] and stemming [157]). PAIRS AND STRIPES 51 published work [54. we focus on the problem of building word co-occurrence matrices from large corpora. For example. and more generally. a very common task in statistical natural language processing for which there are nice MapReduce solutions. One common strategy is to replace all rare words (below a certain frequency) with a “special” token such as <UNK> (which stands for “unknown”) to model out-of-vocabulary words. e.2. or a document. The task also has applications in information retrieval (e. Beyond text processing. Indeed. such that 1. and other related ﬁelds such as text mining. which might provide a clue in thwarting terrorist activity. or even billions of words in web-scale collections. Similarly. paragraph. a common task in corpus linguistics and statistical natural language processing. problems in many application domains share similar characteristics. Another technique involves replacing numeric digits with #.g. customers who buy this tend to also buy that).19 both map to the same token (#.

The MapReduce execution framework guarantees that all associative arrays with the same key will be brought together in the reduce phase of processing.e. The ﬁnal associative array is emitted with the same word as the key. co-occurrence information is ﬁrst stored in an associative array. values in the . a na¨ implementation on a single machine can be very slow as memory is paged to ıve disk. Like the pairs approach. The stripes representation is much more compact. and therefore the execution framework has less sorting to perform.e.9. An alternative approach. The MapReduce execution framework guarantees that all values associated with the same key are brought together in the reducer.8. The stripes approach also generates fewer and shorter intermediate keys. it is clear that there are inherent scalability limitations.52 CHAPTER 3. The mapper emits key-value pairs with words as keys and corresponding associative arrays as values. We describe two MapReduce algorithms for this task that can scale to large corpora. This is straightforwardly accomplished by two nested loops: the outer loop iterates over all words (the left element in the pair). MAPREDUCE ALGORITHM DESIGN ﬁts into memory—however. document ids and the corresponding contents make up the input key-value pairs. However. This algorithm illustrates the use of complex keys in order to coordinate distributed computations. co-occurring word pairs are generated by two nested loops. dubbed the “stripes” approach. the count) as the value. which is then emitted as the ﬁnal key-value pair. where each associative array encodes the co-occurrence counts of the neighbors of a particular word (i. Each pair corresponds to a cell in the word co-occurrence matrix. In contrast to the pairs approach. As usual. However. Although compression techniques can increase the size of corpora for which word co-occurrence matrices can be constructed on a single machine. accumulating counts that correspond to the same cell in the co-occurrence matrix. in this case the reducer simply sums up all the values associated with the same cooccurring word pair to arrive at the absolute count of the joint event in the corpus. each ﬁnal key-value pair encodes a row in the co-occurrence matrix. since with pairs the left element is repeated for every co-occurring word pair. Pseudo-code for the ﬁrst algorithm. and the inner loop iterates over all neighbors of the ﬁrst word (the right element in the pair). It is immediately obvious that the pairs algorithm generates an immense number of key-value pairs compared to the stripes approach. The reducer performs an element-wise sum of all associative arrays with the same key. the major diﬀerence is that instead of emitting intermediate key-value pairs for each co-occurring word pair. dubbed the “pairs” approach. The neighbors of a word can either be deﬁned in terms of a sliding window or some other contextual unit such as a sentence. its context). is shown in Figure 3. The mapper processes each input document and emits intermediate key-value pairs with each co-occurring word pair as the key and the integer one (i. Thus... denoted H. is presented in Figure 3. in the case where the matrix is too big to ﬁt in memory.

. H2 . u). .8: Pseudo-code for the “pairs” approach for computing word co-occurrence matrices from large corpora. c2 . . H3 . count s) Sum co-occurrence counts Figure 3. . stripes [H1 . . PAIRS AND STRIPES 53 1: 2: 3: 4: 5: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. . . .2. . Stripe H) Tally words co-occurring with w class Reducer method Reduce(term w.] do Sum(Hf . H3 . H2 .3. stripe Hf ) Element-wise sum Figure 3. c2 .] do s←s+c Emit(pair p.]) Hf ← new AssociativeArray for all stripe H ∈ stripes [H1 .]) s←0 for all count c ∈ counts [c1 . . H) Emit(term w.9: Pseudo-code for the “stripes” approach for computing word co-occurrence matrices from large corpora. count 1) Emit count for each co-occurrence class Reducer method Reduce(pair p. doc d) for all term w ∈ doc d do for all term u ∈ Neighbors(w) do Emit(pair (w. . . counts [c1 . 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. . doc d) for all term w ∈ doc d do H ← new AssociativeArray for all term u ∈ Neighbors(w) do H{u} ← H{u} + 1 Emit(Term w.

We have implemented both algorithms in Hadoop and applied them to a corpus of 2.8 Prior to working 8 This was a subset of the English Gigaword corpus (version 3) distributed by the Linguistic Data Consortium (LDC catalog number LDC2007T07). the in-mapper combining optimization discussed in the previous section can also be applied. The size of the associative array is bounded by the vocabulary size. combiners with the stripes approach have more opportunities to perform local aggregation because the key space is the vocabulary— associative arrays can be merged whenever a word is encountered multiple times by a mapper. For common terms. does not suﬀer from this limitation. which approach is faster? Here. at any point in time. since the mapper may run out of memory to store partial counts before all documents are processed. Therefore. The pairs approach. since the respective operations in their reducers (addition and element-wise sum of associative arrays) are both commutative and associative. memory management will also be more complex than in the simple word count example. as the sizes of corpora increase. we present previouslypublished results [94] that empirically answered this question. each associative array is small enough to ﬁt into memory—otherwise. However.7 GB. on the other hand. for the stripes approach. In contrast.27 million documents from the Associated Press Worldstream (APW) totaling 5. since it does not need to hold intermediate data in memory. memory paging will signiﬁcantly impact performance. which is itself unbounded with respect to corpus size (recall the previous discussion of Heap’s Law). Similarly. Both algorithms can beneﬁt from the use of combiners. MAPREDUCE ALGORITHM DESIGN stripes approach are more complex. necessitating some mechanism to periodically emit key-value pairs (which further limits opportunities to perform partial aggregation). but certainly for terabyte-sized and petabyte-sized corpora that will be commonplace tomorrow. the above caveats remain: there will be far fewer opportunities for partial aggregation in the pairs approach due to the sparsity of the intermediate key space. It is important to consider potential scalability bottlenecks of either algorithm. However. The stripes approach makes the assumption that. necessitating some mechanism to periodically ﬂush in-memory structures. which is far larger—counts can be aggregated only when the same co-occurring word pair is observed multiple times by an individual mapper (which is less likely than observing multiple occurrences of a word. For both algorithms. the associative array may grow to be quite large.54 CHAPTER 3. as in the stripes case). . and come with more serialization and deserialization overhead than with the pairs approach. the key space in the pairs approach is the cross of the vocabulary with itself. The sparsity of the key space also limits the eﬀectiveness of in-memory combining. the modiﬁcation is suﬃciently straightforward that we leave the implementation as an exercise for the reader. this will become an increasingly pressing issue—perhaps not for gigabyte-sized corpora. Given this discussion.

and the user is charged only for the instance-hours consumed. Virtualized computational units in EC2 are called instances.2.1). The circles plot the relative size and speedup of the EC2 experiments.e. PAIRS AND STRIPES 55 with Hadoop. All tokens were then replaced with unique integers for a more eﬃcient encoding. this was reduced to 1. After the combiners.2 GB. Figure 3.6 billion intermediate key-value pairs totaling 31. with a co-occurrence window size of two. followed by tokenization and stopword removal using standard tools from the Lucene search engine.69 million ﬁnal key-value pairs (the number of rows in the co-occurrence matrix). On the other hand.. the stripes approach provided more opportunities for combiners to aggregate intermediate results.8 million key-value pairs remained. The mappers in the pairs approach generated 2. These experiments were performed on a Hadoop cluster with 19 slave nodes. refer back to our discussion of utility computing in Section 1.10 also shows that both algorithms exhibit highly desirable scaling characteristics—linear in the amount of input data. This is conﬁrmed by a linear regression with an R2 value close to one. This is conﬁrmed by a linear regression applied to the running time data. These experiments were made possible by Amazon’s EC2 service. doubling the cluster size makes the job twice as fast). the reducers emitted a total of 142 million ﬁnal key-value pairs (the number of non-zero cells in the co-occurrence matrix).7).11 (left) shows the running time of the stripes algorithm (on the same corpus.1 GB. which yields an R2 value close to one.3. Results demonstrate that the stripes approach is much faster than the pairs approach: 666 seconds (∼11 minutes) compared to 3758 seconds (∼62 minutes) for the entire corpus (improvement by a factor of 5. Viewed abstractly. The reducers emitted a total of 1. Figure 3. the mappers in the stripes approach generated 653 million intermediate key-value pairs totaling 48. on varying cluster sizes. An additional series of experiments explored the scalability of the stripes approach along another dimension: the size of the cluster. which allows users to rapidly provision clusters of varying sizes for limited durations (for more information. thus greatly reducing network traﬃc in the shuﬄe and sort phase. In the end. Running times are shown with solid squares. with same setup as before). with respect to the 20-instance cluster.10 compares the running time of the pairs and stripes approach on diﬀerent fractions of the corpus. Figure 3. the corpus was ﬁrst preprocessed as follows: All XML markup was removed. After the combiners. As expected. Figure 3. which quantiﬁes the amount of intermediate data transferred across the network. from 20 slave “small” instances all the way up to 80 slave “small” instances (along the x-axis). These results show highly desirable linear scaling characteristics (i. This .1 billion key-value pairs. each with two single-core processors and two disks. the pairs and stripes algorithms represent two diﬀerent approaches to counting co-occurring events from a large number of observations. only 28.11 (right) recasts the same results to illustrate scaling characteristics.

11: Running time of the stripes algorithm on the APW corpus with Hadoop clusters of diﬀerent sizes from EC2 (left).992 500 0 0 20 40 60 80 100 Percentage of the APW corpus 2 "stripes" approach "pairs" approach R = 0.56 CHAPTER 3. 5000 4000 Running time (seconds) Relative speedup 4x R2 = 0.997 3x 3000 2000 2x 1000 1x 0 10 20 30 40 50 60 70 80 90 1x 2x 3x 4x Size of EC2 cluster (number of slave instances) Relative size of EC2 cluster Figure 3. Scaling characteristics (relative speedup) in terms of increasing Hadoop cluster size (right). These experiments were performed on a Hadoop cluster with 19 slaves. MAPREDUCE ALGORITHM DESIGN 4000 3500 Running time (seconds) 3000 2500 2000 1500 1000 R = 0. .10: Running time of the “pairs” and “stripes” algorithms for computing word cooccurrence matrices on diﬀerent fractions of the APW corpus.999 2 Figure 3. each with two single-core processors and two disks.

these two design patterns are broadly useful and frequently observed in a variety of applications. To conclude. this is equivalent to the pairs approach. f (wj |wi ).9. (wi . The pairs approach individually records each co-occurring event. it is worth noting that the pairs and stripes approaches represent endpoints along a continuum of possibilities.. ·) indicates the number of times a particular co-occurring word pair is observed in the corpus. For this reason. what proportion of the time does wj appear in the context of wi ? This can be computed using the following equation: f (wj |wi ) = N (wi . and . cell mij contains the number of times word wi co-occurs with word wj within a speciﬁc context. via hashing). Computing relative frequencies with the stripes approach is straightforward.g. 3. 2) . . this is equivalent to the standard stripes approach. b). where n = |V | (the vocabulary size). This would be a reasonable solution to the memory limitations of the stripes approach. wj ) w N (wi . That is.. it suﬃces to sum all those counts to arrive at the marginal (i.3 COMPUTING RELATIVE FREQUENCIES Let us build on the pairs and stripes algorithms presented in the previous section and continue with our running example of constructing the word co-occurrence matrix M for a large corpus. Therefore. divided by what is known as the marginal (the sum of the counts of the conditioning variable co-occurring with anything else). N (·. A simple remedy is to convert absolute counts into relative frequencies. COMPUTING RELATIVE FREQUENCIES 57 general description captures the gist of many algorithms in ﬁelds as diverse as text processing. 1). The drawback of absolute counts is that it doesn’t take into account the fact that some words appear more frequently than others. where |V | is the vocabulary size. while the stripes approach records all co-occurring events with respect a conditioning event. so that words co-occurring with wi would be divided into b smaller “sub-stripes”. In the case of b = |V |. In the reducer. A middle ground might be to record a subset of the co-occurring events with respect to a conditioning event. w N (wi . We might divide up the entire vocabulary into b buckets (e.3. since each of the sub-stripes would be smaller. Recall that in this large square n × n matrix. counts of all words that co-occur with the conditioning variable (wi in the above example) are available in the associative array. associated with ten separate keys.3. w ) (3. This implementation requires minimal modiﬁcation to the original stripes algorithm in Figure 3. In the case of b = 1. (wi . and then divide all the joint counts by the marginal to arrive at the relative frequency for all words.1) Here. Word wi may co-occur frequently with wj simply because one of the words is very common. . w )).e. We need the count of the joint event (word co-occurrence). data mining. (wi . and bioinformatics.

(dog. compute the relative frequencies. Note that. Inside the reducer. although it requires the coordination of several mechanisms in MapReduce. we can easily detect if all pairs associated with the word we are conditioning on (wi ) have been encountered. Given this ordering. aardvark) and (dog. for example. the programmer can deﬁne the sort order of keys so that data needed earlier is presented . wj ) as the key and the count as the value. as with before. in essence building the associative array in the stripes approach. modulo the number of reducers. To produce the desired behavior. the reducer receives (wi . which can be explicitly controlled by the programmer. we must deﬁne the sort order of the pair so that keys are ﬁrst sorted by the left word.58 CHAPTER 3. we can buﬀer in memory all the words that co-occur with wi and their counts. To make this work. such an algorithm is indeed possible. we must deﬁne a custom partitioner that only pays attention to the left word. At that point we can go back through the in-memory buﬀer. If it were possible to somehow compute (or otherwise obtain access to) the marginal in the reducer before processing the joint counts. the raw byte representation is used to compute the hash value. MAPREDUCE ALGORITHM DESIGN illustrates the use of complex data structures to coordinate distributed computations in MapReduce. Is there a way to modify the basic pairs approach so that this advantage is retained? As it turns out. The insight lies in properly sequencing data presented to the reducer. As a result. This algorithm will indeed work. there is no guarantee that. zebra) are assigned to the same reducer. Fortunately. one can use the MapReduce execution framework to bring together all the pieces of data required to perform a computation. That is. but it suﬀers from the same drawback as the stripes approach: as the size of the corpus grows. unfortunately. and at some point there will not be suﬃcient memory to store all co-occurring words and their counts for the word we are conditioning on. and then by the right word. The notion of “before” and “after” can be captured in the ordering of key-value pairs. This. and then emit those results in the ﬁnal key-value pairs. the reducer can preserve state across multiple keys. There is one more modiﬁcation necessary to make this algorithm work. so does that vocabulary size. For computing the co-occurrence matrix. does not happen automatically: recall that the default partitioner is based on the hash value of the intermediate key. That is. Through appropriate structuring of keys and values. the partitioner should partition based on the hash of the left word only. the advantage of the pairs approach is that it doesn’t suﬀer from any memory bottlenecks. as in the mapper. this algorithm also assumes that each associative array ﬁts into memory. For a complex key. From this alone it is not possible to compute f (wj |wi ) since we do not have the marginal. We must ensure that all pairs with the same left word are sent to the same reducer. How might one compute relative frequencies with the pairs approach? In the pairs approach. the reducer could simply divide the joint counts by the marginal to compute the relative frequencies.

∗) (dog. ∗) are ordered before any other key-value pairs where the left word is wi . This illustrates the application of the order inversion design pattern.1.1. we must make sure that the special key-value pairs representing the partial marginal contributions are processed before the normal key-value pairs representing the joint counts.3. zebra) (doge. as with before we must also properly deﬁne the partitioner to pay attention to only the left word in each pair. (dog. Associated with this key will be a list of values representing partial joint counts from the map phase (two separate values in this case). ∗).3. . each of which represents a partial marginal contribution from the map phase (assume here either combiners or in-mapper combining. COMPUTING RELATIVE FREQUENCIES 59 key (dog. Alternatively. . the reducer will encounter a series of keys representing joint counts. w ). each mapper emits a keyvalue pair with the co-occurring word pair as the key.1] [1] [2. The reducer accumulates these counts to arrive at the marginal. i. ∗).e. the in-mapper combining pattern can be used to even more eﬃciently aggregate marginal counts. However. With the data properly sequenced. w ) = 1267 Figure 3. . values [6327.. Recall that in the basic pairs algorithm. aardvark) (dog. First. we still need to compute the marginal counts. the reducer is presented with the special key (dog.. In the reducer. ∗) . we modify the mapper so that it additionally emits a “special” key of the form (wi . .. so the values represent partially aggregated counts). At this point. w ) = 42908 f (aardvark|dog) = 3/42908 f (aardwolf|dog) = 1/42908 f (zebra|dog) = 5/42908 compute marginal: w N (doge. . . the reducer can directly compute the relative frequencies. To compute relative frequencies. w N (dog. Through use of combiners. 8514. ∗) and a number of values.1] [682. After (dog. In addition. A concrete example is shown in Figure 3. The reducer holds on to this value as it processes subsequent keys.] [2. with a value of one.12. that represents the contribution of the word pair to the marginal. the number of times dog and aardvark co-occur in the entire collection.. these partial marginal counts will be aggregated before being sent to the reducers. aardvark). .. to the reducer before data that is needed later.12: Example of the sequence of key-value pairs presented to the reducer in the pairs algorithm for computing relative frequencies. aardwolf) .] compute marginal: w N (dog. This is accomplished by deﬁning the sort order of the keys so that pairs with the special symbol of the form (wi . Summing these counts will yield the ﬁnal joint count. which lists the sequence of key-value pairs that a reducer might encounter. let’s say the ﬁrst of these is the key (dog.

To summarize. ∗). It is so named because through proper coordination.4 SECONDARY SORTING MapReduce sorts intermediate key-value pairs by the keys during the shuﬄe and sort phase. The key insight is to convert the sequencing of computations into a sorting problem. the reducer resets its internal state and starts to accumulate the marginal all over again. simple arithmetic suﬃces to compute the relative frequency. 3. which is very convenient if computations inside the reducer rely on sort order (e. However. the order inversion design pattern described in the previous section). an aggregate statistic) before processing the data needed for that computation. the speciﬁc application of the order inversion design pattern for computing relative frequencies requires the following: • Emitting a special key-value pair for each co-occurring word pair in the mapper to capture its contribution to the marginal.g. All subsequent joint counts are processed in exactly the same manner. Observe that the memory requirement for this algorithm is minimal. No buﬀering of individual co-occurring word counts is necessary. MAPREDUCE ALGORITHM DESIGN since the reducer already knows the marginal. which we call “order inversion”. • Controlling the sort order of the intermediate key so that the key-value pairs representing the marginal contributions are processed by the reducer before any of the pairs representing the joint word co-occurrence counts. we can access the result of a computation in the reducer (for example. since only the marginal (an integer) needs to be stored. • Deﬁning a custom partitioner to ensure that all pairs with the same left word are shuﬄed to the same reducer.60 CHAPTER 3. • Preserving state across multiple keys in the reducer to ﬁrst compute the marginal based on the special key-value pairs and then dividing the joint counts by the marginals to arrive at the relative frequencies. and therefore we have eliminated the scalability bottleneck of the previous algorithm. In most cases. occurs surprisingly often and across applications in many domains. we can present data to the reducer in the order necessary to perform the proper computations. This greatly cuts down on the amount of partial results that the reducer needs to hold in memory. an algorithm requires data in some ﬁxed order: by controlling how keys are sorted and how the key space is partitioned. . As we will see in Chapter 4. This design pattern. When the reducer encounters the next special key-value pair (doge. this design pattern is also used in inverted index construction to properly set compression parameters for postings lists..

m1 . with the rest of each record as the value: This would bring all readings from the same sensor together in the reducer. r21823 ) (t2 . Consider the example of sensor data from a scientiﬁc experiment: there are m sensors each taking readings on continuous basis. there is a general purpose solution. A MapReduce program to accomplish this might map over the raw data and emit the sensor id as the intermediate key. r14209 ) (t1 . Hadoop.g. r76042 ) . which we call the “value-to-key conversion” design pattern. instead of emitting the sensor id as the key. since in many applications we wish to ﬁrst group together data one way (e.. However. r80521 ) . unfortunately. m1 . we would emit the sensor id and the timestamp as a composite key: (m1 . m2 . m3 . This is a common problem. does not have this capability built in. However. A dump of the sensor data might look something like the following. (t2 . SECONDARY SORTING 61 what if in addition to sorting by key. and then sort within the groupings another way (e.. which guarantees that values arrive in sorted order. it should be apparent by now that any in-memory buﬀering of data introduces a potential scalability bottleneck. The basic idea is to move part of the value into the intermediate key to form a composite key. (t1 . since MapReduce makes no guarantees about the ordering of values associated with the same key. one or more complex records. r98347 ) Suppose we wish to reconstruct the activity at each individual sensor over time.g. the sensor readings will not likely be in temporal order.3.4. What if we are working with a high frequency sensor or sensor readings over a long period of time? What if the sensor readings themselves are large complex objects? This approach may not scale in these cases—the reducer would run out of memory trying to buﬀer all values associated with the same key. In the above example. we also need to sort by value? Google’s MapReduce implementation provides built-in functionality for (optional) secondary sorting. m2 . where m is potentially a large number. and let the MapReduce execution framework handle the sorting.. m3 . Fortunately. r80521 ) (t1 . or even raw bytes of images).. by time). where rx after each timestamp represents the actual sensor readings (unimportant for this discussion. r66508 ) (t2 . by sensor id). t1 ) → (r80521 ) m1 → (t1 . but may be a series of values. The most obvious solution is to buﬀer all the readings in memory and then sort by timestamp before additional processing.

5 RELATIONAL JOINS One popular application of Hadoop is data-warehousing. value-to-key conversion) is where sorting is performed. Properly orchestrated. It makes more sense to take advantage of functionality that is already present with value-to-key conversion. but customers Hadoop provides API hooks to deﬁne “groups” of intermediate keys that should be processed together in the reducer. .9 The basic tradeoﬀ between the two approaches discussed above (buﬀer and inmemory sort vs. We must also implement a custom partitioner so that all pairs associated with the same sensor are shuﬄed to the same reducer. particularly those optimized for a speciﬁc workload known as online analytical processing (OLAP). sorting.10 With value-to-key conversion. The reducer will need to preserve state and keep track of when readings associated with the current sensor end and the next sensor begin. t1 ) → [(r80521 )] (m1 . data warehouses have been implemented through relational databases. One can explicitly implement secondary sorting in the reducer. in principle. this need not be an in-memory sort. It is entirely possible to implement a disk-based sort within the reducer. Typically. MAPREDUCE ALGORITHM DESIGN The sensor reading itself now occupies the value.62 CHAPTER 3. Traditionally. A number of vendors oﬀer parallel databases. Note that this approach can be arbitrarily extended to tertiary. sorting is oﬄoaded to the MapReduce execution framework. current. which is likely to be faster but suﬀers from a scalability bottleneck. 10 Note that. a data warehouse serves as a vast repository of data. holding everything from sales transactions to product inventories. the key-value pairs will be presented to the reducer in the correct sorted order: (m1 . 3.. query logs) as well as unstructured data. note that sensor readings are now split across multiple keys. although one would be duplicating functionality that is already present in the MapReduce execution framework. but distributed sorting is a task that the MapReduce runtime excels at since it lies at the heart of the programming model. t3 ) → [(r146925 )] . Data warehouses form a foundation for business intelligence applications designed to provide decision support. We must deﬁne the intermediate key sort order to ﬁrst sort by the sensor id (the left element in the pair) and then by the timestamp (the right element in the pair). However. This pattern results in many more keys for the framework to sort. and prospective data can yield competitive advantages in the marketplace. etc. the data is relational in nature. quaternary.. 9 Alternatively. but increasingly data warehouses are used to store semi-structured data (e.g. t2 ) → [(r21823 )] (m1 . It is widely believed that insights gained by mining historical.. In an enterprise setting.

This section focuses specifically on performing relational joins in MapReduce. where k is the key we would like to join on. Parallel databases are often quite expensive— on the order of tens of thousands of dollars per terabyte of user data. Let us suppose that relation S looks something like the following: (k1 .vertica. which they ultimately abandoned in favor of a Hadoop-based solution developed in-house called Hive (which is now an open-source project).. Dewitt and Stonebraker. see Dean and Ghemawat’s counterarguments [47] and recent attempts at hybrid architectures [1]. it is instructive to understand the algorithms that underlie basic relational operations. famously decried MapReduce as “a major step backwards” in a controversial blog post. Given successful applications of Hadoop to data-warehousing and complex analytical queries that are prevalent in such an environment. they ran a series of benchmarks that demonstrated the supposed superiority of column-oriented parallel databases over Hadoop [120. it makes sense to examine MapReduce algorithms for manipulating relational data. Similarly. it is highly unlikely that an analyst interacting with a data warehouse will ever be called upon to write MapReduce programs (and indeed. two well-known ﬁgures in the database community. This section presents three diﬀerent strategies for performing relational joins on two datasets (relations). From an application point of view. Nevertheless. Hadoop is not a database.com/database-innovation/mapreduce-a-major-step-backwards/ . s2 . discussed Facebook’s experiences with scaling up business intelligence applications with Oracle databases. S3 ) . suppose relation T looks something like this: 11 http://databasecolumn. RELATIONAL JOINS 63 ﬁnd that they often cannot cost-eﬀectively scale to the crushing amounts of data an organization needs to deal with today.3. s1 . Over the past few years. However. S2 ) (k3 . Pig [114] is a platform for massive data analytics built on Hadoop and capable of handling structured as well as semi-structured data. We shall refrain here from participating in this lively debate. and the Sn after sn denotes other attributes in the tuple (unimportant for the purposes of the join). generically named S and T . Hammerbacher [68]. There is an ongoing debate between advocates of parallel databases and proponents of MapReduce regarding the merits of both approaches for OLAP-type workloads. It was originally developed by Yahoo. 144]. s3 . sn is a unique id for the tuple. and instead focus on discussing algorithms.5.. for example. Hadoop-based systems such as Hive and Pig present a much higher-level language for interacting with large amounts of data).11 With colleagues. but is now also an open-source project. Hadoop has gained popularity as a platform for data-warehousing. S1 ) (k2 . We should stress here that even though Hadoop has been applied to process relational data.

S64 ).. user id). MAPREDUCE ALGORITHM DESIGN (k1 .. and the tuple itself as the intermediate value. where k is the join key. might represent logs of online activity. in which case k could be interpreted as the primary key (i. Joining these two datasets would allow an analyst. recall that in the 12 Not very important if the intermediate data is compressed. for example.. Since MapReduce guarantees that all values with the same key are brought together. In this case. T97 ). t2 . we present one realistic scenario: S might represent a collection of user proﬁles. The tuples might contain demographic information such as age. In more detail. ad revenue generated. then we know that one must be from S and the other must be from T . etc. T1 ) (k3 .1 REDUCE-SIDE JOIN The ﬁrst approach to relational joins is what’s known as a reduce-side join.64 CHAPTER 3. where at most one tuple from S and one tuple from T share the same join key (but it may be the case that no tuple from S shares the join key with a tuple from T . T99 )] Since we’ve emitted the join key as the intermediate key.. there are three diﬀerent cases to consider. T84 )] → [(s68 .. T3 ) . S81 )] → [(t99 . (t84 . The reducer will be presented keys and lists of values along the lines of the following: k23 k37 k59 k61 . etc. However. and the Tn after tn denotes other attributes in the tuple. . → [(s64 . To make this task more concrete. we can remove it from the value to save a bit of space. Each tuple might correspond to a page view of a particular URL and may contain additional information such as time spent on the page. The ﬁrst and simplest is a one-to-one join. The k in these tuples could be interpreted as the foreign key that associates each individual page view with a user. all tuples will be grouped by the join key—which is exactly what we need to perform the join operation. (s81 . or vice versa). The other dataset. T . t1 . 3.5.12 If there are two values associated with a key. T2 ) (k8 . The idea is quite simple: we map over both datasets and emit the join key as the intermediate key. to break down online activity in terms of demographics. the algorithm sketched above will work ﬁne. This approach is known as a parallel sort-merge join in the database community [134]. tn is a unique id for the tuple. t3 . income. gender. S68 )] → [(t97 .e.

we must deﬁne the partitioner to pay attention to only the join key. pick out the tuple from S. compute aggregates. Let us now consider the one-to-many join.3.. Finally. We can proceed to join the two tuples and perform additional computations (e. Assume that tuples in S have unique join keys (i. the reducer will be presented with keys and values along the lines of the following: (k82 . . as we have seen several times already. so the reducer does nothing. no guarantees are made about value ordering. Since both the join key and the tuple id are present in the intermediate key. instead of simply emitting the join key as the intermediate key. there is no need to buﬀer tuples (other than the single one from S). let us consider the many-to-many join case. The above algorithm will still work. However.13 Whenever the reducer encounters a new join key. we have no idea when the value corresponding to the tuple from S will be encountered. s105 ) → [(S105 )] (k82 . The reducer can hold this tuple in memory and then proceed to cross it with tuples from T in subsequent steps (until a new join key is encountered). but when processing each key in the reducer. so that all composite keys with the same join key arrive at the same reducer. RELATIONAL JOINS 65 basic MapReduce programming model. this means that no tuple in the other dataset shares the join key. so that S is the “one” and T is the “many”. t98 ) → [(T98 )] (k82 . Thus. The easiest solution is to buﬀer all values in memory.g. and the solution lies in the value-to-key conversion design pattern we just presented. it is guaranteed that the associated value will be the relevant tuple from S. t101 ) → [(T101 )] (k82 . and then sort all tuple ids from S before all tuple ids from T . This is a problem that requires a secondary sort.5.. Since the MapReduce execution framework performs the sorting. t137 ) → [(T137 )] .. After applying the value-to-key conversion design pattern. this creates a scalability bottleneck since we may not have suﬃcient memory to hold all the tuples with the same join key. ﬁlter by some other attribute. Consider what happens at the reducer: 13 Once again. Assuming that S is the smaller dataset.e. Two additional changes are required: First. so the ﬁrst value might be from S or from T . In the mapper. and then cross it with every tuple from T to perform the join.. etc.). we instead create a composite key consisting of the join key and the tuple id (from either S or T ). since values are arbitrarily ordered. If there is only one value associated with a key. Second. k is the primary key in S). the above algorithm works as well. we can remove them from the value to save a bit of space. we must deﬁne the sort order of the keys to ﬁrst sort by the join key. not very important if the intermediate data is compressed. we have eliminated the scalability bottleneck.

suppose S and T were both divided into ten ﬁles. we map over one of the datasets (the larger one) and inside the mapper read the corresponding part of the other dataset to perform the merge join. in the map phase of a MapReduce job—hence.2 MAP-SIDE JOIN Suppose we have two datasets that are both sorted by the join key. We can perform a join by scanning through both datasets simultaneously—this is known as a merge join in the database community. t137 ) → [(T137 )] .66 CHAPTER 3. s105 ) → [(S105 )] (k82 . The approach isn’t particularly eﬃcient since it requires shuﬄing both datasets across the network. MAPREDUCE ALGORITHM DESIGN (k82 . All the tuples from S with the same join key will be encountered ﬁrst. This leads us to the map-side join. t98 ) → [(T98 )] (k82 . But is it realistic to expect that the stringent conditions required for map-side joins are satisﬁed? In many cases. a map-side join. Of course. the second ﬁle with S with the second ﬁle of T . 3. For example.. In this case. In practice. We can parallelize this by partitioning and sorting both datasets in the same way. the tuples were sorted by the join key. As the reducer processes each tuple from T . which may include multiple steps. by using a custom 14 Note that this almost always implies a non-local read. The basic idea behind the reduce-side join is to repartition the two datasets by the join key. unless the programmer wishes to repartition the output or perform further processing. (k82 . Therefore. we are assuming that the tuples from S (with the same join key) will ﬁt into memory... . which the reducer can buﬀer in memory.5. A map-side join is far more eﬃcient than a reduce-side join since there is no need to shuﬄe the datasets over the network. If the workﬂow is known in advance and relatively static (both reasonable assumptions in a mature workﬂow).. etc. we can engineer the previous processes to generate output sorted and partitioned in a way that makes eﬃcient map-side joins possible (in MapReduce. yes. Further suppose that in each ﬁle. which is a limitation of this algorithm (and why we want to control the sort order so that the smaller dataset comes ﬁrst).14 No reducer is required. The reason is that relational joins happen within the broader context of a workﬂow. we simply need to merge join the ﬁrst ﬁle of S with the ﬁrst ﬁle of T . it is crossed with all the tuples from S. partitioned in the same manner by the join key. t101 ) → [(T101 )] (k82 . This can be accomplished in parallel. s124 ) → [(S124 )] . the datasets that are to be joined may be the output of previous processes (either MapReduce jobs or other code).

the reducers used to generate data that will participate in a later map-side join must not emit any key but the one they are currently processing. Hadoop permits reducers to emit keys that are diﬀerent from the input key whose values they are processing (that is. of course. The mapper initialization API hook (see Section 3. requires streaming through the other dataset n times. We can choose n so that each partition is small enough to ﬁt in memory. albeit less eﬃcient. there is a family of approaches we call memory-backed joins based on random access probes. and for each input key-value pair. it is always possible to repartition a dataset using an identity mapper and reducer. let’s say S. then the output dataset from the reducer will not necessarily be partitioned in a manner consistent with the speciﬁed partitioner (because the partitioner applies to the input keys rather than the output keys). nor even the same type). This.2 that in Google’s implementation. this incurs the cost of shuﬄing data over the network. 15 In contrast.5. recall from Section 2. For ad hoc data analysis.1. . But of course. There is a ﬁnal restriction to bear in mind when using map-side joins with the Hadoop implementation of MapReduce. . input and output keys need not be the same. the join is performed. reduce-side joins are a more general. If there is. Alternatively. In this situation. ∪ Sn .3. solution. Mappers are then applied to the other (larger) dataset. into n partitions. Consider the case where datasets have multiple keys that one might wish to join on—then no matter how the data is organized. if the output key of a reducer is diﬀerent from the input key. we can load the smaller dataset into memory in every mapper.15 However. The simplest version is applicable when one of the two datasets completely ﬁts in memory on each node. What if neither dataset ﬁts in memory? The simplest solution is to divide the smaller dataset. populating an associative array to facilitate random access to tuples based on the join key. RELATIONAL JOINS 67 partitioner and controlling the sort order of key-value pairs).1) can be used for this purpose. reducers’ output keys must be exactly same as their input keys. We assume here that the datasets to be joined were produced by previous MapReduce jobs. This is known as a simple hash join by the database community [51]. so this restriction applies to keys the reducers in those jobs may emit. map-side joins will require repartitioning of the data. such that S = S1 ∪ S2 ∪ .5. 3. Since map-side joins depend on consistent partitioning and sorting of keys.3 MEMORY-BACKED JOIN In addition to the two previous approaches to joining relational data that leverage the MapReduce framework to bring together tuples that share a common join key. and then run n memory-backed hash joins. the mapper probes the in-memory dataset to see if there is a tuple with the same join key. .

. MAPREDUCE ALGORITHM DESIGN There is an alternative approach to memory-backed joins for cases where neither datasets ﬁt into memory. which provides a scalable solution for secondary sorting. Instead of emitting intermediate output for every input key-value pair. • The related patterns “pairs” and “stripes” for keeping track of joint events from a large number of observations. they are: • “In-mapper combining”. and therefore we’ve dubbed this approach memcached join. which presents a scalability bottleneck. . A distributed key-value store can be used to hold one dataset in memory across multiple machines while mapping over the other. Ultimately. 16 In order to achieve good performance in accessing distributed key-value stores. By moving part of the value into the key.g. we can send the reducer the result of a computation (e. the mapper aggregates partial results across multiple input records and only emits intermediate key-value pairs after some amount of local aggregation is performed. this approach is detailed in a technical report [95]. where the main idea is to convert the sequencing of computations into a sorting problem. an aggregate statistic) before it encounters the data necessary to produce that computation. where the functionality of the combiner is moved into the mapper. In the pairs approach. we present a number of “design patterns” that capture eﬀective solutions to common problems. whereas in the stripes approach we keep track of all events that co-occur with the same event.68 CHAPTER 3. it is often necessary to batch queries before making synchronous requests (to amortize latency over many requests) or to rely on asynchronous requests. Constructing complex keys and values that bring together data necessary for a computation.16 The open-source caching system memcached can be used for exactly this purpose.6 SUMMARY This chapter provides a guide on the design of MapReduce algorithms. The mappers would then query this distributed key-value store in parallel and perform joins if the join keys match. In particular. it requires memory on the order of the size of the event space. For more information. In summary. This is used in all of the above design patterns. • “Order inversion”. • “Value-to-key conversion”. 3. Although the stripes approach is signiﬁcantly more eﬃcient. controlling synchronization in the MapReduce programming model boils down to eﬀective use of the following techniques: 1. Through careful orchestration. we keep track of each joint event separately. we can exploit the MapReduce execution framework itself for sorting.

This is used in in-mapper combining. and for expectation-maximization in Chapter 6. 5. Preserving state across multiple inputs in the mapper and reducer. we will focus on speciﬁc classes of MapReduce algorithms: for inverted indexing in Chapter 4. SUMMARY 69 2. Executing user-speciﬁed initialization and termination code in either the mapper or reducer. For example.6. 3. order inversion. Controlling the partitioning of the intermediate key space. In the next few chapters. This concludes our overview of MapReduce algorithm design. there are many tools at one’s disposal to shape the ﬂow of computation. for graph processing in Chapter 5. This is used in order inversion and value-to-key conversion. Controlling the sort order of intermediate keys. and value-to-key conversion. 4. . It should be clear by now that although the programming model forces one to express algorithms in terms of a small set of rigidly-deﬁned components. This is used in order inversion and value-to-key conversion.3. in-mapper combining depends on emission of intermediate key-value pairs in the map task termination code.

but these are very diﬀerent from retrieval. but even a conservative estimate would place the size at several tens of billions of pages. running smaller-scale index updates at greater frequencies is usually an adequate solution.1 Retrieval. which given a term provides access to the list of documents that contain the term. PDFs. Crawling and indexing share similar characteristics and requirements. considering the amounts of data involved! Nearly all retrieval engines for full-text search today rely on a data structure called an inverted index.g. etc. The web search problem decomposes into three components: gathering web content (crawling). In information retrieval parlance. which requires diﬀerent techniques and algorithms. on the 1 Leaving aside the problem of searching live data streams such a tweets. totaling hundreds of terabytes (considering text alone). users demand results quickly from a search engine—query latencies longer than a few hundred milliseconds will try a user’s patience. Given a user query. the retrieval engine uses the inverted index to score documents that contain the query terms with respect to some ranking model. etc. However. Furthermore. it is usually tolerable to have a delay of a few minutes until content is searchable. but they do not need to operate in real time. bold. PageRank [117]. attributes of the terms in the document (e. or related metrics such as HITS [84] and SALSA [88]).g. Both need to be scalable and eﬃcient. construction of the inverted index (indexing) and ranking documents given a query (retrieval). Fulﬁlling these requirements is quite an engineering feat. Gathering web content and building inverted indexes are for the most part oﬄine problems. taking into account features such as term matches..). How large is the web? It is diﬃcult to compute exactly. objects to be retrieved are generically called “documents” even though in actuality they may be web pages.. news organizations) update their content quite frequently and need to be visited often.. Indexing is usually a batch process that runs periodically: the frequency of refreshes and updates is usually dependent on the design of the crawler.. other sites (e.g.70 CHAPTER 4 Inverted Indexing for Text Retrieval Web search is the quintessential large-data problem. Some sites (e. In real-world applications. which we’ll discuss in Chapter 5. even for rapidly changing sites. . term proximity. appears in title. or even fragments of code.) and present them to the user. as well as the hyperlink structure of the documents (e. the system’s task is to retrieve relevant web objects (web pages. Given an information need expressed as a short query consisting of a few terms. since the amount of content that changes rapidly is relatively small. PowerPoint slides. government regulations) are relatively static. PDF documents.g.

2). Explicitly recognizing this. on DVDs. which leads to a revised version presented in Section 4. A comprehensive treatment of web search is beyond the scope of this chapter.cs.g.1) and introducing the basic structure of an inverted index (Section 4. is an online problem that demands sub-second response time. For researchers who wish to explore web-scale retrieval. and can exhibit “spikey” behavior due to special circumstances (e. The crawler downloads pages in the queue. The chapter concludes with a summary and pointers to additional readings. a breaking news event triggers a large number of searches on the same topic).2 Obtaining access to these standard collections is usually as simple as signing an appropriate data license from the distributor of the collection.1 WEB CRAWLING Before building inverted indexes. paying a reasonable fee. depending on the time of day..5. Index compression is discussed in Section 4. there is the ClueWeb09 collection that contains one billion web pages in ten languages (totaling 25 terabytes) crawled by Carnegie Mellon University in early 2009.cmu. extracts links from those pages to add to the queue. resource consumption for the indexing problem is more predictable. and later.4. A baseline inverted indexing algorithm in MapReduce is presented in Section 4. which is the process of traversing the web by repeatedly following hyperlinks and storing downloaded pages for subsequent processing. however. We point out a scalability bottleneck in that algorithm.lti. we mostly focus on the problem of inverted indexing. an issue we discuss in Section 4.4. the process is quite simple to understand: we start by populating a queue with a “seed” list of pages. many collections today are so large that the only practical method of distribution is shipping hard drives via postal mail. Individual users expect low query latencies. in the 1990s.edu/Data/clueweb09/ 3 As an interesting side note. 4. In academia and for research purposes.3 For real-world web search. and even this entire book. query loads are highly variable. However. this can be relatively straightforward. This chapter begins by ﬁrst providing an overview of web crawling (Section 4.6. On the other hand. which ﬁlls in missing details on building compact index structures. research collections were distributed via postal mail on CD-ROMs. but query throughput is equally important since a retrieval engine must usually serve many users concurrently. one cannot simply assume that the collection is already available. Standard collections for information retrieval research are widely available for a variety of genres ranging from blogs to newswire text. Conceptually. WEB CRAWLING 71 other hand. the task most amenable to solutions in MapReduce. and arranging for receipt of the data. stores the pages for 2 http://boston. it does not provide an adequate solution for the retrieval problem. Furthermore. Since MapReduce is primarily designed for batch-oriented processing. . Electronic distribution became common earlier this decade for collections below a certain size. we must ﬁrst acquire the document collection over which these indexes are to be built. Acquiring web content requires crawling.1.3.

To avoid downloading a page multiple times and to ensure data consistency.) It is desirable during the crawling process to identify near duplicates and select the best exemplar to index. it must prioritize the order in which unvisited pages are downloaded. A web crawler needs to learn these update patterns to ensure that content is reasonably current. a professor in Asia may maintain her website in the local language. The following lists a number of issues that real-world crawlers must contend with: • A web crawler must practice good “etiquette” and not overload web servers. etc. navigation bars. For example. but contain links to publications in English. • Since a crawler has ﬁnite bandwidth and resources. mirrors of frequently-accessed sites such as Wikipedia. Examples include multiple copies of a popular conference paper. in the sense that spammers actively create “link farms” and “spider traps” full of spam pages to trick a crawler into overrepresenting content from a particular site. but with diﬀerent frequency depending on both the site and the nature of the content. • Web content changes. However. eﬀective and eﬃcient web crawling is far more complex. It also needs to be robust with respect to machine failures. Such decisions must be made online and in an adversarial environment. the crawler as a whole needs mechanisms for coordination and load-balancing. and errors of various types. In fact. often geographically distributed. For example. . basically the same page but with diﬀerent ads. and newswire content that is often duplicated. • The web is multilingual. In order to respect these constraints while maintaining good throughput. but not frequent enough leads to stale content. • The web is full of duplicate content. There is no guarantee that pages in one language only link to pages in the same language. it is common practice to wait a ﬁxed amount of time before repeated requests to the same server. The problem is compounded by the fact that most repetitious pages are not exact duplicates but near duplicates (that is. a crawler typically keeps many execution threads running in parallel and maintains many TCP connections (perhaps hundreds) open at the same time. network outages. • Most real-world web crawlers are distributed systems that run on clusters of machines.72 CHAPTER 4. rudimentary web crawlers can be written in a few hundred lines of code. INVERTED INDEXING FOR TEXT RETRIEVAL further processing. Getting the right recrawl frequency is tricky: too frequent means wasted resources. and repeats.

an inverted index consists of postings lists. Or.5 Generally. no additional information is needed in the posting other than the document id. tokenization and stemming) into basic units that are often not words in the linguistic sense. or even the results of additional linguistic processing (for example. 5 It is preferable to start numbering the documents at one since it is not possible to code zero with many common compression schemes used in information retrieval. where n is the total number of documents. to support document ranking based on notions of importance).7 provides pointers to additional readings.}. A postings list is comprised of individual postings. d4 . The above discussion is not meant to be an exhaustive enumeration of issues.g. d11 . it is important to identify the (dominant) language on a page. d23 .2 INVERTED INDEXES In its basic form. . tokenization. to support address searches). For more information. The document ids have no inherent semantic meaning. Section 4. . see a recent survey on web crawling [113]. . many pages contain a mix of text in diﬀerent languages. 4. this information is often stored in the index as well. .}. INVERTED INDEXES 73 Furthermore. d19 . pages that are higher in quality (based. term is preferred over word since documents are processed (e. we assume that documents can be identiﬁed by a unique integer ranging from 1 to n. . . term2 occurs in {d11 . In the example shown in Figure 4. although assignment of numeric ids to documents need not be arbitrary.4. we see that term1 occurs in {d1 . d6 . The most common payload.2. Since document processing techniques (e. . d84 . More complex payloads include positions of every occurrence of the term in the document (to support phrase queries and document scoring based on term proximity). indicating that the term is part of a place name. however.4 The structure of an inverted index is illustrated in Figure 4. d5 . pages from the same domain may be consecutively numbered. In the web context. one associated with each term that appears in the collection. stemming) diﬀer by language. anchor text information (text associated with hyperlinks from other pages to the page in question) is useful in enriching the representation of document content (e. for example. . For example. each of which consists of a document id and a payload—information about occurrences of the term in the document. nothing! For simple boolean retrieval. d11 . In an actual implementation. on PageRank values) might be assigned smaller numeric values so that they appear toward information retrieval parlance. postings are sorted by document id. 4 In . see Section 4.. the existence of the posting itself indicates that presence of the term in the document.}. is term frequency (tf).g.1. [107])..g. . although other sort orders are possible as well. and term3 occurs in {d1 . The simplest payload is. . but rather to give the reader an appreciation of the complexities involved in this intuitively simple task.. alternatively. .1. or the number of times the term occurs in the document. properties of the term (such as if it occurred in the page title or not. d59 .5.

query–document scores must be computed. with the exception of well-resourced. the front of a postings list.5). In the general case. such as a URL. If only term frequency is stored. . Each posting is comprised of a document id and a payload. The size of an inverted index varies.6 postings lists are usually too large to store in memory and must be held on disk. Given a query. INVERTED INDEXING FOR TEXT RETRIEVAL terms term1 term2 term3 … postings d1 d11 d1 … p p p d5 d23 d4 p p p d6 d59 d11 p p p d11 d84 d19 p p p … … … Figure 4. commercial web search engines. which can be accomplished very eﬃciently since the postings are sorted by document id. denoted by p in this case. 6 Google keeps indexes in memory. Generally. Of course.e. Query evaluation. retrieval involves fetching postings lists associated with query terms and traversing the postings to compute the result set. therefore. An inverted index that stores positional information would easily be several times larger than one that does not.. usually in compressed form (more details in Section 4. there are many optimization strategies for query evaluation (both approximate and exact) that reduce the number of postings a retrieval engine must examine. it is possible to hold the entire vocabulary (i. One important aspect of the retrieval problem is to organize disk operations such that random seeks are minimized. However. however. especially with techniques such as front-coding [156]. an auxiliary data structure is necessary to maintain the mapping from integer document ids to some other more meaningful handle. the top k documents are then extracted to yield a ranked list of results for the user. In the simplest case. boolean retrieval involves set operations (union for boolean OR and intersection for boolean AND) on postings lists. At the end (i. Partial document scores are stored in structures called accumulators.1: Simple illustration of an inverted index. Each term is associated with a list of postings.terms term1 term2 term3 … postings d1 d11 d1 … p p p d5 d23 d4 p p p d6 d59 d11 p p p d11 d84 d19 p p p … … … 74 CHAPTER 4. depending on the payload stored in each posting.. An inverted index provides quick access to documents ids that contain a term. dictionary of all the terms) in memory. Either way.e. necessarily involves random disk access and “decoding” of the postings. once all postings have been processed). a well-optimized inverted index can be a tenth of the size of the original document collection.

We begin with the basic inverted indexing algorithm shown in Figure 4. INVERTED INDEXING: BASELINE IMPLEMENTATION 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: 7: 75 class Mapper procedure Map(docid n.]) P ← new List for all posting a. Once the document has been analyzed. f1 .). posting n. n2 .3 INVERTED INDEXING: BASELINE IMPLEMENTATION MapReduce was designed from the very beginning to produce the various data structures involved in web search. our goal is to provide the reader with an overview of the important issues. First. Individual documents are processed in parallel by the mappers. f ∈ postings [ n1 . case folding. postings [ n1 . each document is analyzed and broken down into its component terms. n2 .2. The processing pipeline diﬀers depending on the application and type of document. ‘a’.7 provides references to additional readings. Once again.2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce. removing stopwords (common words such as ‘the’. and stemming (removing aﬃxes from words so that ‘dogs’ becomes ‘dog’).4. f2 . . this brief discussion glosses over many complexities and does a huge injustice to the tremendous amount of research in information retrieval. . a. H{t} ) class Reducer procedure Reduce(term t. . f ) Sort(P ) Emit(term t. including inverted indexes and the web graph. etc. Section 4.] do Append(P. but hides the details of document processing. Lines 4 and 5 in the pseudo-code reﬂect the process of computing term frequencies.3. the execution framework groups postings by term. f1 . and the reducers write postings lists to disk. term frequencies are computed by iterating over all the terms and keeping track of counts. ‘of’. Input to the mapper consists of document ids (keys) paired with the actual content (values). but for web pages typically involves stripping out HTML tags and other elements such as JavaScript code. doc d) H ← new AssociativeArray for all term t ∈ doc d do H{t} ← H{t} + 1 for all term t ∈ H do Emit(term t. . Mappers emit postings keyed by terms. tokenizing. postings P ) Figure 4. . f2 . However. 4.

In the shuﬄe and sort phase. Typically. the mapper then iterates over all terms..g. Since each reducer writes its output in a separate ﬁle in the distributed ﬁle system. but we leave this aside for now (see Section 4. Although as presented here only the term frequency is stored in the posting. denoted by n. in line 7 of the mapper pseudo-code. Without any additional eﬀort by the programmer. Execution of the complete algorithm is illustrated in Figure 4. Such an implementation can be completed as a week-long programming assignment in a course for advanced undergraduates or ﬁrst-year graduate students [83. disk seeks are slow. three mappers. the postings list is ﬁrst compressed. The ﬁnal key-value pairs are written to disk and comprise the inverted index. memory capacity is limited.4 for more details). our ﬁnal index will be split across r ﬁles. the execution framework brings together all the postings that belong in the same postings list.3 with a toy example consisting of three documents.76 CHAPTER 4. Its implementation is similarly concise: the basic algorithm can be implemented in as few as a couple dozen lines of code in Hadoop (with minimal document processing). There is no need to further consolidate these ﬁles. a signiﬁcant fraction of the code is devoted to grouping postings by term. with the document id on the left and the term frequency on the right. The reducer begins by initializing an empty list and then appends all postings associated with the same key (term) to the list. so that given a term. and two reducers. etc.). which simply needs to gather together all the postings and write them to disk. This tremendously simpliﬁes the task of the reducer. we must also build an index to the postings lists themselves for the retrieval engine: this is typically in the form of mappings from term to (ﬁle. In a non-MapReduce indexer. Intermediate key-value pairs (from the mappers) and the ﬁnal key-value pairs comprising the inverted index (from the reducers) are shown in the boxes with dotted lines. a pair consisting of the document id and the term frequency is created. the programmer does not need to worry about any of these issues—most of the heavy lifting is performed by the execution framework. where r is the number of reducers. The mapper then emits an intermediate key-value pair with the term as the key and the posting as the value. Each pair. the MapReduce runtime essentially performs a large. term positions) in the payload. For each term. this algorithm can be easily augmented to store additional information (e. Separately.g. . Postings are shown as pairs of boxes. In MapReduce. the retrieval engine can fetch its postings list by opening the appropriate ﬁle and seeking to the correct byte oﬀset position in that ﬁle. with the term as the key. given constraints imposed by memory and disk (e.. distributed group by of the postings by term. The postings are then sorted by document id. INVERTED INDEXING FOR TEXT RETRIEVAL After this histogram has been built. byte oﬀset) pairs. and the entire postings list is emitted as a value. H{t} in the pseudo-code. represents an individual posting. The MapReduce programming model provides a very concise expression of the inverted indexing algorithm. 93].

4 INVERTED INDEXING: REVISED IMPLEMENTATION The inverted indexing algorithm presented in the previous section serves as a reasonable baseline. Postings are shown as pairs of boxes (docid. However. this need not be an in-memory sort. as collections become larger. blue fish doc 3 one red bird 77 mapper mapper mapper fish one two d1 d1 d1 2 1 1 blue fish red d2 d2 d2 1 2 1 bird one red d3 d3 d3 1 1 1 Shuffle and Sort: aggregate values by keys reducer reducer fish one two d1 d1 d1 2 1 1 d2 d3 2 1 bird blue red d3 d2 d2 1 1 1 d3 1 Figure 4. the reducer ﬁrst buﬀers all postings (line 5 of the reducer pseudo-code in Figure 4. There is a simple solution to this problem. However.2) and then performs an in-memory sort before writing the postings to disk. reducers will run out of memory. INVERTED INDEXING: REVISED IMPLEMENTATION doc 1 one fish. It is entirely possible to implement a disk-based sort within the reducer. postings lists grow longer. one way to overcome the scalability 7 See similar discussion in Section 3. tf). Since the execution framework guarantees that keys arrive at each reducer in sorted order. .4: in principle.3: Simple illustration of the baseline inverted indexing algorithm in MapReduce with three mappers and two reducers. two fish doc 2 red fish. there is a signiﬁcant scalability bottleneck: the algorithm assumes that there is suﬃcient memory to hold all postings associated with the same term.4. Since the basic MapReduce execution framework makes no guarantees about the ordering of values associated with the same key. 4. and at some point in time. postings need to be sorted by document id.7 For eﬃcient retrieval.4.

these ﬁles are then consolidated into a more compact 8 In general.4. and can be written out as “side data” directly to HDFS. This is exactly the value-to-key conversion design pattern introduced in Section 3. The revised MapReduce inverted indexing algorithm is shown in Figure 4. where m is the number of mappers.. n ).. a posting can be directly added to the postings list. tf f ) In other words. allows postings lists to be created with minimal memory usage.e. When processing the terms in each document. other than diﬀerences in the intermediate key-value pairs. Although it is straightforward to express this computation as another MapReduce job. There is one more detail we must address when building inverted indexes. . this task can actually be folded into the inverted indexing process.e. remember that we must deﬁne a custom partitioner to ensure that all tuples with the same term are shuﬄed to the same reducer. in the Close method). The ﬁnal postings list must be written out in the Close method. document lengths are written out to HDFS (i. Since almost all retrieval models take into account document length when computing query– document scores.e..78 CHAPTER 4. We can take advantage of the ability for a mapper to hold state across the processing of multiple documents in the following manner: an in-memory associative array is created to store document lengths. the key is a tuple containing the term and the document id. f ) We emit intermediate key-value pairs of the type instead: (tuple t. this information must also be extracted. payloads can be easily changed: by simply replacing the intermediate value f (term frequency) with whatever else is desired (e. This. The Reduce method is called for each key (i. there is no worry about insuﬃcient memory to hold these data.g. while the value is the term frequency. posting docid. This approach is essentially a variant of the inmapper combining pattern. Since the postings are guaranteed to arrive in sorted order by document id. the document length is known. there will only be one value associated with each key. and by design. they can be incrementally coded in compressed form— thus ensuring a small memory footprint. Finally. As with the baseline algorithm. Instead of emitting key-value pairs of the following type: (term t. INVERTED INDEXING FOR TEXT RETRIEVAL bottleneck is to let the MapReduce runtime do the sorting for us. docid . Document length data ends up in m diﬀerent ﬁles. the entire postings list is emitted. combined with the fact that reducers can hold state across multiple keys. As a detail. t. the programming model ensures that the postings arrive in the correct order.. when all postings associated with the same term have been processed (i. With this modiﬁcation. t = tprev ).4.8 When the mapper ﬁnishes processing input records. The mapper remains unchanged for the most part. For each key-value pair. which is populated as each document is processed. term positional information).

f ) tprev ← t method Close Emit(term t. Otherwise. n . the execution framework is exploited to sort postings so that they arrive sorted by document id in the reducer.Reset() P.4.5 INDEX COMPRESSION We return to the question of how postings are actually compressed and stored on disk. 4. Let us consider the canonical case where each posting consists of a document id and the term frequency. A na¨ implementation might represent the ﬁrst as a 32-bit ıve . MapReduce inverted indexing algorithms are pretty straightforward. Alternatively.4: Pseudo-code of a scalable inverted indexing algorithm in MapReduce.5. tf H{t}) class Reducer method Initialize tprev ← ∅ P ← new PostingsList method Reduce(tuple t. postings P ) Figure 4. By applying the value-to-key conversion design pattern. representation. This chapter devotes a substantial amount of space to this topic because index compression is one of the main diﬀerences between a “toy” indexer and one that works on real-world collections. tf [f ]) if t = tprev ∧ tprev = ∅ then Emit(term t. which will be responsible for writing out the length data separate from the postings lists. postings P ) P. One must then write a custom partitioner so that these special key-value pairs are shuﬄed to a single reducer. n . INDEX COMPRESSION 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 79 class Mapper method Map(docid n.Add( n. doc d) H ← new AssociativeArray for all term t ∈ doc d do H{t} ← H{t} + 1 for all term t ∈ H do Emit(tuple t. document length information can be emitted in special key-value pairs by the mapper.

.] where each posting is represented by a pair in parentheses. INVERTED INDEXING FOR TEXT RETRIEVAL integer9 and the second as a 16-bit integer. To start.294. an integer is encoded in four bytes and holds a value between 0 and 232 − 1. 2). since the largest possible d-gap is one less than the number of documents in the collection. Using this scheme. in general. 2). 1).967. since for the most part they are also small values. . note that 232 − 1 is “only” 4.967. inclusive. since the original document ids can be easily reconstructed from the d-gaps. 3). which is to represent the d-gaps in a way such that it takes less space for smaller numbers. . 1). if the two factors are properly balanced (i. we reduce the amount of space on disk necessary to store data. we can achieve the best of both worlds: smaller and faster. Compression. the entire inverted index would be about as large as the collection itself.295 9 However. in reality the postings would be represented as a long stream of integers. parentheses. Thus. represented with d-gaps. it’s not obvious that we’ve reduced the space requirements either. can be characterized as either lossless or lossy: it’s fairly obvious that loseless compression is required in this context. particularly for coding integers. (37. (2. However.294. We haven’t lost any information. . we can do signiﬁcantly better. it is possible that compression reduces size but also slows processing.295. would be: [(5. Similarly. . 2). 4.] Of course. but at the cost of extra processor cycles that must be spent coding and decoding data. which is much less than even the most conservative estimate of the size of the web. Fortunately. we must actually encode the ﬁrst document id. That is.. (5. 2). a postings list might be encoded as follows: [(5. .80 CHAPTER 4. But to understand how this is done. The ﬁrst trick is to encode diﬀerences between document ids as opposed to the document ids themselves. decoding speed can keep up with disk bandwidth). This na¨ implementation would ıve require six bytes per posting. (12.1 BYTE-ALIGNED AND WORD-ALIGNED CODES In most programming languages. 1).5. we want to apply the same techniques to compress the term frequencies. Therefore. (7. (2. Since the postings are sorted by document ids. Note that all brackets. This means that 1 and 4. since dgaps are always positive (and greater than zero). 3). The above postings list. it is important to understand that all compression techniques represent a time–space tradeoﬀ. (49.e. (51. we need to take a slight detour into compression techniques. . the diﬀerences (called d-gaps) must be positive integers greater than zero. This is where the second trick comes in. and commas are only included to enhance readability. 1). We limit our discussion to unsigned integers. However.

two bytes. divided into four 2-bit values that specify the byte length of each of the following integers. A variant of the varInt scheme was described by Jeﬀ Dean in a keynote talk at the WSDM 2009 conference. the downside of varInt coding is that decoding involves lots of bit operations (masks. which means that 0 ≤ n < 27 can be expressed with 1 byte. Obviously. A simple lookup table based on the preﬁx byte directs the decoder on how to process subsequent bytes to recover the coded integers.google. we have 7 bits per byte for coding the value. the following preﬁx byte: 00.5. Furthermore.4. This scheme can be extended to code arbitrarily-large integers (i.. Experiments reported by Dean suggest that decoding integers with this scheme is more than twice as fast as the basic varInt scheme. the continuation bit sometimes results in frequent branch mispredicts (depending on the actual distribution of d-gaps). Therefore.com/people/jeff/WSDM09-keynote. and 221 ≤ n < 228 with 4 bytes.10 indicates that the following four integers are one byte. However. the ﬁrst consisting of 1 byte.00. Therefore. Each group begins with a preﬁx byte. and three bytes. beyond 4 bytes). Of course. there is never any ambiguity about where one code word ends and the next begins. 128 would be coded as such: 1 1111111.10 The insight is to code groups of four integers at a time. the comma and the spaces are there only for readability. each group of four integers would consume anywhere between 5 and 17 bytes. encoding d-gaps this way doesn’t yield any reductions in size. accessing entire machine words is more eﬃcient than fetching all its bytes separately.e.pdf . it makes sense to store postings in increments 10 http://research. 214 ≤ n < 221 with 3. which slows down processing. As a result. one byte. Variable-length integers are byte-aligned because the code words always fall along byte boundaries. The advantage of this group varInt coding scheme is that values can be decoded with fewer branch mispredicts and bitwise operations. 27 ≤ n < 214 with 2 bytes. INDEX COMPRESSION 81 both occupy four bytes. and the second consisting of 2 bytes. the two numbers: 127. which is set to one in the last byte and zero elsewhere. 0 0000001 1 0000000 The above code contains two code words. A simple approach to compression is to only use as many bytes as is necessary to represent the integer.01. shifts). As a concrete example. As a result. This is known as variable-length integer coding (varInt for short) and accomplished by using the high order bit of every byte as the continuation bit. respectively. In most architectures. For example.

even when fewer bits might suﬃce (the Simple-9 scheme gets around this by packing multiple integers into a 32-bit word. but even then. two 14-bit integers. Therefore. This is stored in the selector bits.82 CHAPTER 4. etc. four bits in each 32-bit word are reserved as a selector. coding 0 ≤ x < 3 with {0. To address this issue. For example. meaning that boundaries can fall anywhere. such that a sequence of bits: 0001101001010100 can be unambiguously segmented into: 00 01 1 01 00 1 01 01 00 and decoded without any additional delimiters. however. INVERTED INDEXING FOR TEXT RETRIEVAL of 16-bit. or 64-bit machine words. 01. is that they must consume multiples of eight bits. An integer x > 0 is coded as x − 1 one bits followed by a zero bit. In fact. i. On the other hand. packs those integers. based on 32-bit words.. Coding works in the opposite way: the algorithm scans ahead to see how many integers can be squeezed into 28 bits. 32-bit. In practice. One of the simplest preﬁx codes is the unary code. since there are no byte boundaries to guide us. bits are often wasted). 01} is not a valid preﬁx code. code words can occupy any number of bits. Anh and Moﬀat [8] presented several wordaligned coding methods. In this coding scheme. In bit-aligned codes.5. one of which is called Simple-9.e. 1. 1} is a valid preﬁx code. and then appropriately decoding each integer. on the other hand.. decoding involves reading a 32-bit word. in which no valid code word is a preﬁx of any other valid code word. coding and decoding bit-aligned codes require processing bytes and appropriately shifting or masking bits (usually more involved than varInt and group varInt coding). examining the selector to see how the remaining 28 bits are packed. some with leftover unused bits. all the way up to twenty-eight 1-bit integers. {00.2 BIT-ALIGNED CODES The advantage of byte-aligned and word-aligned codes is that they can be coded and decoded quickly. there are a variety of ways these 28 bits can be divided to code one or more integers: 28 bits can be used to code one 28-bit integer. The remaining 28 bits are used to code actual integer values. and so we can’t tell if 01 is two code words or one. The downside. and sets the selector bits appropriately. Note that unary codes do not allow the representation . three 9-bit integers (with one bit unused). there are nine diﬀerent ways the 28 bits can be divided into equal parts (hence the name of the technique). 4. One additional challenge with bit-aligned codes is that we need a mechanism to delimit code words. since 0 is a preﬁx of 01. tell where the last ends and the next begins. Now. most bit-aligned codes are socalled preﬁx codes (confusingly. they are also called preﬁx-free codes).

12 The unary component n speciﬁes the number of bits required to code x. The binary component codes x − 23 = 2 in 4 − 1 = 3 bits. some sources describe slightly diﬀerent formulations of the same coding scheme. 1 + log2 x (= n. We can then reconstruct x as 2cu −1 + cb . Unary codes are rarely used by themselves. the γ codes of the ﬁrst ten positive integers are shown in Figure 4. This is the default behavior in many programming languages when casting from a ﬂoating-point type to an integer type. 3. 12 Note that x is the ﬂoor function. γ codes make sense for them and can yield substantial space savings. The extra colon is inserted only for readability. γ. which is a preﬁx code. it’s not part of the ﬁnal code. which although economical for small values. we arrive at 1110:010. which is coded in unary code. A 11 As a note.5: The ﬁrst ten positive integers in unary. becomes ineﬃcient for even moderately large values. Elias γ code is an eﬃcient coding scheme that is widely used in practice. we read a unary code cu . which is 1110. For x < 16. Putting both together. and Golomb (b = 5.4. . and x − 2 log2 x (= r. which is in binary.5. it is easy to unambiguously decode a bit stream of γ codes: First. An integer x > 0 is broken into two components. the length). so. but form a component of other coding schemes.5.11 As an example.5. the remainder). As an example.8 = 3. γ codes occupy less than a full byte. we adopt the conventions used in the classic IR text Managing Gigabytes [156].g. which makes them more compact than variable-length integer codes. which we then read as cb . Unary codes of the ﬁrst ten positive integers are shown in Figure 4. Since term frequencies for the most part are relatively small. e. INDEX COMPRESSION 83 x 1 2 3 4 5 6 7 8 9 10 unary 0 10 110 1110 11110 111110 1111110 11111110 111111110 1111111110 γ 0 10:0 10:1 110:00 110:01 110:10 110:11 1110:000 1110:001 1110:010 Golomb b=5 0:00 0:01 0:10 0:110 0:111 10:00 10:01 10:10 10:110 10:111 b = 10 0:000 0:001 0:010 0:011 0:100 0:101 0:1100 0:1101 0:1110 0:1111 Figure 4. 4 in unary code is 1110. With unary code we can code x in x bits.. which is 010. Here. This tells us that the binary portion is written in cu − 1 bits. and the binary component codes the remainder r in n − 1 bits. of course. consider x = 10: 1 + log2 10 = 4. Working in reverse. 10) codes. For reference. which maps x to the largest integer not greater than x. which is ﬁne since d-gaps and term frequencies should never be zero. of zero.

84

CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

variation on γ code is δ code, where the n portion of the γ code is coded in γ code itself (as opposed to unary code). For smaller values γ codes are more compact, but for larger values, δ codes take less space. Unary and γ codes are parameterless, but even better compression can be achieved with parameterized codes. A good example of this is Golomb code. For some parameter b, an integer x > 0 is coded in two parts: ﬁrst, we compute q = (x − 1)/b and code q + 1 in unary; then, we code the remainder r = x − qb − 1 in truncated binary. This is accomplished as follows: if b is a power of two, then truncated binary is exactly the same as normal binary, requiring log2 b bits. Otherwise, we code the ﬁrst 2 log2 b +1 − b values of r in log2 b bits and code the rest of the values of r by coding r + 2 log2 b +1 − b in ordinary binary representation using log2 b + 1 bits. In this case, the r is coded in either log2 b or log2 b + 1 bits, and unlike ordinary binary coding, truncated binary codes are preﬁx codes. As an example, if b = 5, then r can take the values {0, 1, 2, 3, 4}, which would be coded with the following code words: {00, 01, 10, 110, 111}. For reference, Golomb codes of the ﬁrst ten positive integers are shown in Figure 4.5 for b = 5 and b = 10. A special case of Golomb code is worth noting: if b is a power of two, then coding and decoding can be handled more eﬃciently (needing only bit shifts and bit masks, as opposed to multiplication and division). These are known as Rice codes. Researchers have shown that Golomb compression works well for d-gaps, and is optimal with the following parameter setting: df (4.1) N where df is the document frequency of the term, and N is the number of documents in the collection.13 Putting everything together, one popular approach for postings compression is to represent d-gaps with Golomb codes and term frequencies with γ codes [156, 162]. If positional information is desired, we can use the same trick to code diﬀerences between term positions using γ codes. b ≈ 0.69 × 4.5.3 POSTINGS COMPRESSION Having completed our slight detour into integer compression techniques, we can now return to the scalable inverted indexing algorithm shown in Figure 4.4 and discuss how postings lists can be properly compressed. As we can see from the previous section, there is a wide range of choices that represent diﬀerent tradeoﬀs between compression ratio and decoding speed. Actual performance also depends on characteristics of the collection, which, among other factors, determine the distribution of d-gaps. B¨ttcher u

13 For

details as to why this is the case, we refer the reader elsewhere [156], but here’s the intuition: under reasonable assumptions, the appearance of postings can be modeled as a sequence of independent Bernoulli trials, which implies a certain distribution of d-gaps. From this we can derive an optimal setting of b.

4.5. INDEX COMPRESSION

85

et al. [30] recently compared the performance of various compression techniques on coding document ids. In terms of the amount of compression that can be obtained (measured in bits per docid), Golomb and Rice codes performed the best, followed by γ codes, Simple-9, varInt, and group varInt (the least space eﬃcient). In terms of raw decoding speed, the order was almost the reverse: group varInt was the fastest, followed by varInt.14 Simple-9 was substantially slower, and the bit-aligned codes were even slower than that. Within the bit-aligned codes, Rice codes were the fastest, followed by γ, with Golomb codes being the slowest (about ten times slower than group varInt). Let us discuss what modiﬁcations are necessary to our inverted indexing algorithm if we were to adopt Golomb compression for d-gaps and represent term frequencies with γ codes. Note that this represents a space-eﬃcient encoding, at the cost of slower decoding compared to alternatives. Whether or not this is actually a worthwhile tradeoﬀ in practice is not important here: use of Golomb codes serves a pedagogical purpose, to illustrate how one might set compression parameters. Coding term frequencies with γ codes is easy since they are parameterless. Compressing d-gaps with Golomb codes, however, is a bit tricky, since two parameters are required: the size of the document collection and the number of postings for a particular postings list (i.e., the document frequency, or df). The ﬁrst is easy to obtain and can be passed into the reducer as a constant. The df of a term, however, is not known until all the postings have been processed—and unfortunately, the parameter must be known before any posting is coded. At ﬁrst glance, this seems like a chicken-and-egg problem. A two-pass solution that involves ﬁrst buﬀering the postings (in memory) would suﬀer from the memory bottleneck we’ve been trying to avoid in the ﬁrst place. To get around this problem, we need to somehow inform the reducer of a term’s df before any of its postings arrive. This can be solved with the order inversion design pattern introduced in Section 3.3 to compute relative frequencies. The solution is to have the mapper emit special keys of the form t, ∗ to communicate partial document frequencies. That is, inside the mapper, in addition to emitting intermediate key-value pairs of the following form: (tuple t, docid , tf f ) we also emit special intermediate key-value pairs like this: to keep track of document frequencies associated with each term. In practice, we can accomplish this by applying the in-mapper combining design pattern (see Section 3.1). The mapper holds an in-memory associative array that keeps track of how many documents a term has been observed in (i.e., the local document frequency of the term for

14 However,

(tuple t, ∗ , df e)

this study found less speed diﬀerence between group varInt and basic varInt than Dean’s analysis, presumably due to the diﬀerent distribution of d-gaps in the collections they were examining.

86

CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

the subset of documents processed by the mapper). Once the mapper has processed all input records, special keys of the form t, ∗ are emitted with the partial df as the value. To ensure that these special keys arrive ﬁrst, we deﬁne the sort order of the tuple so that the special symbol ∗ precedes all documents (part of the order inversion design pattern). Thus, for each term, the reducer will ﬁrst encounter the t, ∗ key, associated with a list of values representing partial df values originating from each mapper. Summing all these partial contributions will yield the term’s df, which can then be used to set the Golomb compression parameter b. This allows the postings to be incrementally compressed as they are encountered in the reducer—memory bottlenecks are eliminated since we do not need to buﬀer postings in memory. Once again, the order inversion design pattern comes to the rescue. Recall that the pattern is useful when a reducer needs to access the result of a computation (e.g., an aggregate statistic) before it encounters the data necessary to produce that computation. For computing relative frequencies, that bit of information was the marginal. In this case, it’s the document frequency.

4.6

WHAT ABOUT RETRIEVAL?

Thus far, we have brieﬂy discussed web crawling and focused mostly on MapReduce algorithms for inverted indexing. What about retrieval? It should be fairly obvious that MapReduce, which was designed for large batch operations, is a poor solution for retrieval. Since users demand sub-second response times, every aspect of retrieval must be optimized for low latency, which is exactly the opposite tradeoﬀ made in MapReduce. Recall the basic retrieval problem: we must look up postings lists corresponding to query terms, systematically traverse those postings lists to compute query–document scores, and then return the top k results to the user. Looking up postings implies random disk seeks, since for the most part postings are too large to ﬁt into memory (leaving aside caching and other special cases for now). Unfortunately, random access is not a forte of the distributed ﬁle system underlying MapReduce—such operations require multiple round-trip network exchanges (and associated latencies). In HDFS, a client must ﬁrst obtain the location of the desired data block from the namenode before the appropriate datanode can be contacted for the actual data. Of course, access will typically require a random disk seek on the datanode itself. It should be fairly obvious that serving the search needs of a large number of users, each of whom demand sub-second response times, is beyond the capabilities of any single machine. The only solution is to distribute retrieval across a large number of machines, which necessitates breaking up the index in some manner. There are two main partitioning strategies for distributed retrieval: document partitioning and term partitioning. Under document partitioning, the entire collection is broken up into multiple smaller sub-collections, each of which is assigned to a server. In other words, each

and q3 . Under the . With term partitioning. nine terms) illustrating diﬀerent partitioning strategies: partitioning vertically (1. 2. on the other hand. Suppose the user’s query Q contains three terms.6. Retrieval under term partitioning. Document and term partitioning require diﬀerent retrieval strategies and represent diﬀerent tradeoﬀs. WHAT ABOUT RETRIEVAL? 87 d1 t1 t2 t3 t4 t5 t6 t7 t8 t9 2 1 d2 d3 2 1 d4 d5 d6 d7 d8 3 d9 1 2 5 2 1 1 3 4 partitiona 1 2 partitionb 1 2 1 1 1 2 2 3 4 partitionc 1 partition1 partition2 partition3 Figure 4. This corresponds to partitioning horizontally in Figure 4. q1 . With this architecture. each server is responsible for a subset of the terms for the entire collection.6. document partitioning typically yields shorter query latencies (compared to a single monolithic index with much longer postings lists). merges partial results from each. server holds the complete index for a subset of the entire collection.4. This corresponds to partitioning vertically in Figure 4. since each partition operates independently and traverses postings in parallel. whereas partitioning horizontally (a. q2 . That is. 3) corresponds to document partitioning.6. and then returns the ﬁnal results to the user. on the other hand. requires a very diﬀerent strategy. However. which forwards the user’s query to all partition servers.6: Term–document matrix for a toy collection (nine documents. Retrieval under document partitioning involves a query broker. b. c) corresponds to term partitioning. searching the entire collection requires that the query be processed by every partition server. a server holds the postings for all documents in the collection for a subset of terms.

load-balancing is tricky in a pipelined term-partitioned architecture due to skew in the distribution of query terms. One key advantage of document partitioning is that result quality degrades gracefully with machine failures. but they are generally less discriminating when it comes to which relevant documents appear in their results (out of the set of all relevant documents). INVERTED INDEXING FOR TEXT RETRIEVAL pipelined query evaluation strategy. the broker begins by forwarding the query to the server that holds the postings for q1 (usually the least frequent term). it is known that Google maintains its indexes in memory (although this is certainly not the common case for search engines in general). In general. Partitioning by document quality supports a multi-phase search strategy: the system examines partitions containing high quality documents ﬁrst. the web contains more relevant documents than any user has time to digest: users of course care about getting relevant documents (sometimes.88 CHAPTER 4. Working in a document-partitioned architecture. it is desirable to partition documents by content (perhaps also guided by the distribution of user queries from logs). there are a variety of approaches to dividing up the web into smaller pieces. it is desirable to partition by document quality using one or more classiﬁers. For most queries. The server traverses the appropriate postings list and computes partial query–document scores. it can theoretically increase a system’s throughput due to the far smaller number of total disk seeks required for each user query (compared to document partitioning). However. before ﬁnal results are passed back to the broker and returned to the user. they are happy with a single relevant document). Although this query evaluation strategy may not substantially reduce the latency of any particular query. stored in the accumulators. users might not even be aware that documents are missing. which is that every partition server is involved in every user query. Along one dimension. Furthermore. Along an orthogonal dimension. This also reduces the number of machines that need to be involved in serving a user’s query: the broker can direct queries only to . so that each partition is “well separated” from the others in terms of topical coverage. and this is the strategy adopted by Google [16]. and then to the server for q3 . see [124] for a recent survey on web page classiﬁcation. Note that partitions may be unavailable due to reasons other than machine failure: cycling through diﬀerent partitions is a very simple and non-disruptive strategy for index updates. studies have shown that document partitioning is a better strategy overall [109]. and only backs oﬀ to partitions containing lower quality documents if necessary. Partition servers that are oﬄine will simply fail to deliver results for their subsets of the collection. With suﬃcient partitions. which can create “hot spots” on servers that hold the postings for frequently-occurring query terms. This reduces the number of servers that need to be contacted for a user query. Proper partitioning of the collection can address one major weakness of this architecture. The accumulators are then passed to the server that holds the postings associated with q2 for additional processing.

and additional metadata about the document. a document server takes as input a query and a document id. typically comprising the title and URL of the page. as opposed to forwarding the user query to all the partitions. reliability of service is provided by replication. Abstractly. After all. As a speciﬁc example. a snippet of the source document showing the user’s query terms in context. Application of the value-to-key conversion design pattern (Section 3.4) addressed the issue by oﬄoading the task of sorting postings by document id to the MapReduce execution framework. This creates at least two query routing problems: since it makes sense to serve clients from the closest datacenter. On a large-scale. First. query evaluation can beneﬁt immensely from caching. 4. Search engines take advantage of this with cache servers. the system needs to properly balance load across replicas. which are functionally distinct from all of the components discussed above. Finally. with very frequent queries at the head of the distribution dominating the total number of queries. First. and computes an appropriate result entry. Next. SUMMARY AND ADDITIONAL READINGS 89 the partitions that are likely to contain relevant documents.7 SUMMARY AND ADDITIONAL READINGS Web search is a complex problem that breaks down into three conceptually-distinct components. Therefore. raw retrieval results consist of a ranked list of semantically meaningless document ids. inverted indexes and other auxiliary data structures must be built from the documents. We also surveyed various techniques for integer compression. of individual postings (assuming that the index is not already in memory) and even results of entire queries [13]. the documents collection must be gathered (by crawling the web). functionally distinct from the partition servers holding the indexes. but quickly noticed a scalability bottleneck that stemmed from having to buﬀer postings in memory. which yield postings lists that are both more compact and faster to process. This is made possible by the Zipﬁan distribution of queries. This last task is an online problem that demands both low latency and high throughput. to generate meaningful output for user presentation. Both of these can be considered oﬄine problems. inverted indexing is nothing but a very large distributed sort and group by operation! We began with a baseline implementation of an inverted indexing algorithm.7. This chapter primarily focused on building inverted indexes. recall that postings only store document ids. a service must route queries to the appropriate location. index structures must be accessed and processed in response to user queries to generate search results. Second. There are two ﬁnal components of real-world search engines that are worth discussing. Within a single datacenter.4. both in terms of multiple machines serving the same partition within a single datacenter. It is typically the responsibility of document servers. the problem most suitable for MapReduce. but also replication across geographically-distributed datacenters. one could use Golomb codes for .

A survey article by Zobel and Moﬀat [162] is an excellent starting point on indexing and retrieval algorithms. INVERTED INDEXING FOR TEXT RETRIEVAL compressing d-gaps and γ codes for term frequencies. Another by Baeza-Yates et al. Additional Readings. We showed how the order inversion design pattern introduced in Section 3. a number of general information retrieval textbooks have been recently published [101. A keynote talk at the WSDM 2009 conference by Jeﬀ Dean revealed a lot of information about the evolution of the Google search architecture. however. 42. Finally. 15 http://research. [11] overviews many important issues in distributed retrieval. we provide a few entry points into the literature. While outdated in many other respects. Here. [30] is noteworthy in u having detailed experimental evaluations that compare the performance (both eﬀectiveness and eﬃciency) of a wide range of algorithms and techniques.15 Finally. Our brief discussion of web search glosses over many complexities and does a huge injustice to the tremendous amount of research in information retrieval. the textbook Managing Gigabytes [156] remains an excellent source for index compression techniques. proceedings from those events are perhaps the best starting point for those wishing to keep abreast of publicly-documented developments in the ﬁeld.3 for computing relative frequencies can be used to properly set compression parameters.com/people/jeff/WSDM09-keynote. 30].google.pdf . Of these three.90 CHAPTER 4. the one by B¨ttcher et al. ACM SIGIR is an annual conference and the most prestigious venue for academic information retrieval research.

there may be an edge from a node to itself. Perhaps the best known example is PageRank. . Far more than theoretical curiosities. Similar algorithms are also involved in friend recommendations and expert-ﬁnding in social networks.g. etc.). We assume that both nodes and links may be annotated with additional metadata: as a simple example. theorems and algorithms on graphs can be applied to solve many real-world problems: • Graph search and path planning. resulting in a self loop.91 CHAPTER 5 Graph Algorithms Graphs are ubiquitous in modern society: examples encountered by almost everyone on a daily basis include the hyperlink structure of the web (simply known as the web graph). whenever anyone searches for directions on the web. connections on social networking sites. indicating type of relationship such as “friend” or “spouse”).g. As one of the ﬁrst applications of MapReduce.). We will discuss PageRank in detail later this chapter. dating back to Euler’s paper on the Seven Bridges of K¨nigsberg in 1736. and other cellular products. This chapter focuses on graph algorithms in MapReduce. there might be demographic information (e. Although most of the content has nothing to do with text processing per se. In general. we use node interchangeably with vertex and similarly with link and edge. making graph analysis an important component of many text processing applications. PageRank exempliﬁes a large class of graph algorithms that can be concisely captured in the programming model. ﬂights. In some graphs. Our very own existence is dependent on an intricate metabolic and regulatory network. and today much is known about their properties. in a social network where nodes represent individuals. in others. graphs can be characterized by nodes (or vertices) and links (or edges) that connect pairs of nodes. Path planning problems involving everything from network packets to delivery trucks represent another large class of graph search problems. which can be characterized as a large. which is used in ranking results for web search. etc. complex graph involving interactions between genes.1 These connections can be directed or undirected. social networks (manifest in the ﬂow of email. age.. bus routes. location) attached to the nodes and type information attached to the links (e. Over the past few centuries. such edges are disallowed. 1 Throughout this chapter.. Mathematicians have always been fascinated with graphs. gender. a measure of web page quality based on the structure of hyperlinks. and transportation networks (roads. Search algorithms on graphs are invoked millions of times a day. proteins. graphs o have been extensively studied. phone call patterns. documents frequently exist in the context of some underlying network.

• Maximum ﬂow. etc. Can a large graph be divided into components that are relatively disjoint (for example. including social networks and the migration of Polynesian islanders [64]. the hyperlink structure of the web. for large-scale graph algorithms. or social networks that contain hundreds of millions of individuals.1 with an introduction to graph representations. A minimum spanning tree for a graph G with weighted edges is a tree that contains all vertices of the graph and a subset of edges that minimizes the sum of edge weights.) and network operators grapple with complex versions of these problems on a daily basis. Transportation companies (airlines. and then explore two classic graph algorithms in MapReduce: 2 As a side note. and relationship to cluster structure. We’d like to put MapReduce to work on these challenges. See [158] for a survey. the max ﬂow problem involves computing the amount of “traﬃc” that can be sent from source to sink given various ﬂow capacities deﬁned by edge weights. and many others.92 CHAPTER 5. A common feature of these problems is the scale of the datasets on which the algorithms must operate: for example. algorithms that run on a single machine and depend on the entire graph residing in memory are not scalable. Clearly. based on Valiant’s Bulk Synchronous Parallel model [148]. average distance to other nodes. as measured by inter-component links [59])? Among other applications. A real-world example of this problem is a telecommunications company that wishes to lay optical ﬁber to span a number of destinations at the lowest possible cost (where weights denote costs). • Minimum spanning trees. including metrics based on node in-degree. advertisers trying to promote products. • Identifying “special” nodes. This approach has also been applied to wide variety of problems. Matching problems on such graphs can be used to model job seekers looking for employment or singles looking for dates. a longer description is anticipated in a forthcoming paper [99] . GRAPH ALGORITHMS • Graph clustering. Google recently published a short description of a system called Pregel [98]. • Bipartite graph matching. These special nodes are important to investigators attempting to break up terrorist cells. epidemiologists modeling the spread of diseases. this task is useful for identifying communities in social networks (of interest to sociologists who wish to understand how human relationships form and evolve) and for partitioning large graphs (of interest to computer scientists who seek to better parallelize graph processing).2 This chapter is organized as follows: we begin in Section 5. A bipartite graph is one whose vertices can be divided into two disjoint sets. shipping. In a weighted directed graph with two special nodes called the source and the sink. There are many ways to deﬁne what special means. which contains billions of pages.

Section 5. In the case of graphs with weighted edges. perhaps.2) and PageRank (Section 5. the matrix cells contain edge weights. parallel breadth-ﬁrst search (Section 5.3 For example. each cell contains either a one (indicating an edge). the diagonal might be populated. n5] [n4] [n5] [n1. only half the matrix is used (e.4 discusses a number of general issue that aﬀect graph processing with MapReduce. With undirected graphs. . n4] [n3. but one common deﬁnition is that a sparse graph has O(n) edges.1.1 GRAPH REPRESENTATIONS One common way to represent graphs is with an adjacency matrix.5. 5. Figure 5. such a representation is far from ideal for computer scientists concerned with eﬃcient algorithmic implementations. where n is the number of vertices. Although mathematicians prefer the adjacency matrix representation of graphs for easy manipulation with linear algebra.g.1: A simple directed graph (left) represented as an adjacency matrix (middle) and with adjacency lists (right). there is no precise deﬁnition of sparseness agreed upon by all. cells above the diagonal).. The same is true for the hyperlink structure of the web: each individual web page links to a minuscule portion of all the pages on the 3 Unfortunately. otherwise.1 provides an example of a simple directed graph (left) and its adjacency matrix representation (middle). where the number of actual edges is far smaller than the number of possible edges. or a zero (indicating none). there are n(n − 1) possible “friendships” (where n may be on the order of hundreds of millions). For graphs that allow self loops (a directed edge from a node to itself). in a social network of n individuals. Most of the applications discussed in the chapter introduction involve sparse graphs. Before concluding with a summary and pointing out additional readings. but still far smaller than hundreds of millions). otherwise. n3] 93 n4 adjacency matrix adjacency lists Figure 5.3). the diagonal remains empty. A graph with n nodes can be represented as an n × n square matrix M . However. where a value in cell mij indicates an edge from node ni to node nj . even the most gregarious will have relatively few friends compared to the size of the network (thousands. n2. GRAPH REPRESENTATIONS n2 n1 n1 n1 n2 n3 n5 n3 n4 n5 0 0 0 0 1 n2 1 0 0 0 1 n3 0 1 0 0 1 n4 1 0 1 0 0 n5 0 1 0 1 0 n1 n2 n3 n4 n5 [n2.

. those two nodes will be on the adjacency list of n1 . edges can be associated with costs or weights. Figure 5.4. GRAPH ALGORITHMS web.94 CHAPTER 5. This property of the execution framework can also be used to invert the edges of a directed graph. where one gathers the anchor text of hyperlinks pointing to a particular page. it is natural to operate on outgoing links. the shuﬄe and sort mechanism in MapReduce provides an easy way to group edges by their destination nodes. . most computational implementations of graph algorithms operate over adjacency lists. In this chapter. For example. but computing anything that requires knowledge of the incoming links of a node is diﬃcult. Alternatively. but not the other way around).4 5. although we will return to this issue in Section 5. as we shall see. if i < j. where students are taught the solution using Dijkstra’s algorithm. and more speciﬁcally. Note that certain graph operations are easier on adjacency matrices than on adjacency lists. [107]). since n1 is connected by directed edges to n2 and n4 . Such problems are a staple in undergraduate algorithm courses. with MapReduce? 4 This technique is used in anchor text inversion. by mapping over the nodes’ adjacency lists and emitting key–value pairs with the destination node id as the key and the source node id as the value.e. In the ﬁrst. whereas operations on outgoing links for each node translate into a row scan. There are two options for encoding undirected graphs: one could simply encode each edge twice (if ni and nj are connected. we assume processing of sparse graphs.1 also shows the adjacency list representation of the graph under consideration (on the right). then nj is on the adjacency list of ni .g. by deﬁnition. However. where the task is to ﬁnd shortest paths from a source node to all other nodes in the graph (or alternatively. in which a node is associated with neighbors that can be reached via outgoing edges. With adjacency lists. this famous algorithm assumes sequential processing—how would we solve this problem in parallel. one could order the nodes (arbitrarily or otherwise) and encode edges only on the adjacency list of the node that comes ﬁrst in the ordering (i. The major problem with an adjacency matrix representation for sparse graphs is its O(n2 ) space requirement. However. each appears on each other’s adjacency list). Furthermore. operations on incoming links for each node translate into a column scan on the matrix.. It is common practice to enrich a web page’s standard textual representation with all of the anchor text associated with its incoming hyperlinks (e. in which case the task is to compute lowest-cost or lowest-weight paths).2 PARALLEL BREADTH-FIRST SEARCH One of the most common and well-studied problems in graph theory is the single-source shortest path problem. thus allowing us to compute over incoming edges with in the reducer. As a result. most of the cells are zero.

respectively. a global priority queue of vertices with priorities equal to their distance values d. Note that the algorithm as presented in Figure 5. A sample trace of the algorithm running on a simple graph is shown in Figure 5.2: Pseudo-code for Dijkstra’s algorithm. Also. the algorithm expands the node with the shortest distance and updates distances to all reachable nodes. v) ≥ 0. we see in (b) that n2 and n3 can be reached at a distance of 10 and 5. Dijkstra’s algorithm is shown in Figure 5. As a refresher and also to serve as a point of comparison. this is the source node). After the expansion. We start out in (a) with n1 having a distance of zero (since it’s the source) and all other nodes having a distance of ∞.2. v ∈ V to ∞. except for the source node.3 (example also adapted from CLR). In the ﬁrst iteration (a). E) represented with adjacency lists. which is based on maintaining a global priority queue of nodes with priorities equal to their distances from the source node. w. The actual paths can be recovered by storing “backpointers” for every node indicating a fragment of the shortest path. Dijkstra’s algorithm operates by iteratively selecting the node with the lowest current distance from the priority queue (initially. v) Figure 5. Expanding n3 . The algorithm begins by ﬁrst setting distances to all vertices d[v]. and the source node s. The algorithm maintains Q. s) d[s] ← 0 for all vertex v ∈ V do d[v] ← ∞ Q ← {V } while Q = ∅ do u ← ExtractMin(Q) for all vertex v ∈ u. when all nodes have been considered. Leiserson. At each iteration.2. Nodes we have already considered for expansion are shown in black. the algorithm “expands” that node by traversing the adjacency list of the selected node to see if any of those nodes can be reached with a path of a shorter distance. adapted from Cormen. The algorithm terminates when the priority queue Q is empty. v) then d[v] ← d[u] + w(u. whose distance to itself is zero.5. we see in . and Rivest’s classic algorithms textbook [41] (often simply known as CLR).2 only computes the shortest distances. n1 is selected as the node to expand (indicated by the thicker border). The input to the algorithm is a directed. PARALLEL BREADTH-FIRST SEARCH 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 95 Dijkstra(G. we see in (b) that n3 is the next node selected for expansion. connected graph G = (V. or equivalently. At each iteration. w containing edge distances such that w(u.AdjacencyList do if d[v] > d[u] + w(u.

(c) that the distance to n2 has decreased because we’ve found a shorter path. are n5 . but we’ll relax this restriction later. as a simpliﬁcation let us assume that all edges have unit distance (modeling. nodes that have already been expanded are shown in black. n2 . However. The nodes that will be expanded next. Nodes with thicker borders are those being expanded. for example. with the current distance inside the node. in order. what if there are multiple paths to the same node? Suppose we wish to compute the shortest distance . and n4 . Parts (a)–(e) show the running of the algorithm at each iteration.3: Example of Dijkstra’s algorithm applied to a simple graph with ﬁve nodes. hyperlinks on the web). GRAPH ALGORITHMS n2 n4 1 n2 n4 1 n2 n4 1 ∞ 10 ∞ 10 9 4 6 10 0 n1 5 2 3 ∞ 10 9 4 7 6 8 0 n1 5 2 3 14 9 4 7 6 0 n1 5 2 3 7 ∞ n3 2 ∞ n5 5 n3 2 ∞ n5 5 n3 2 7 n5 (a) n2 n4 1 (b) n2 n4 1 (c) n2 n4 1 8 10 13 10 9 4 6 8 0 n1 5 2 3 9 10 9 4 7 6 8 0 n1 5 2 3 9 9 4 7 6 0 n1 5 2 3 7 5 n3 2 7 n5 5 n3 2 7 n5 5 n3 2 7 n5 (d) (e) (f) Figure 5. Instead. the distance of all nodes directly connected to those is two. with n1 as the source and edge distances as indicated. Imagine water rippling away from a rock dropped into a pond— that’s a good image of how parallel breadth-ﬁrst search works.96 CHAPTER 5. The key to Dijkstra’s algorithm is the priority queue that maintains a globallysorted list of nodes by current distance. This is not possible in MapReduce. as the programming model does not provide a mechanism for exchanging global data. we adopt a brute force approach known as parallel breadth-ﬁrst search. and so on. The intuition behind the algorithm is this: the distance of all nodes connected directly to the source node is one. The algorithm terminates with the end state shown in (f). First. This makes the algorithm easier to understand. where we’ve discovered the shortest distance to all nodes.

there is one more detail required to make the parallel breadth-ﬁrst search algorithm work. then parallel breadthﬁrst search on the global social network would take at most six MapReduce iterations. The shortest distance to n is the distance to ms plus one. where each iteration corresponds to a MapReduce job. directed graph represented as adjacency lists. Before we discuss termination of the algorithm. or the greatest distance between any pair of nodes. with the node id as a key (Figure 5. Distance to each node is directly stored alongside the adjacency list of that node. As with Dijkstra’s algorithm. the node with the shortest distance. We need to “pass along” the graph structure from one iteration to the next. This number is surprisingly small for many real-world problems: the saying “six degrees of separation” suggests that everyone on the planet is connected to everyone else by at most six steps (the people a person knows are one step away. people that they know are two steps away. etc. and update the minimum distance in the node data structure before emitting it as the ﬁnal value. The reducer will select the shortest of these distances and then update the distance in the node data structure. The ﬁrst time we run the algorithm. we assume a connected. all nodes will be discovered with their shortest distances (assuming a fully-connected graph). The key contains the node id of the neighbor.). In the pseudo-code.4. and. It is apparent that parallel breadth-ﬁrst search is an iterative algorithm.5. and the value is the current distance to the node plus one. and initialized to ∞ for all nodes except for the source node. we must distinguish the node data structure from distance values (Figure 5. The algorithm works by mapping over all nodes and emitting a key-value pair for each neighbor on the node’s adjacency list. Each iteration of the algorithm expands the “search frontier” by one hop. . If this is indeed true. we discover all nodes connected to those. reducers will receive keys corresponding to the destination node ids and distances corresponding to all paths leading to that node. 5 Note that in this algorithm we are overloading the value type.4. After shuﬄe and sort. then we must be able to reach all the nodes that are connected to n with distance d + 1. we “discover” all nodes that are connected to the source. lines 5–6 in the reducer). This says: if we can reach node n with a distance d. This is accomplished by emitting the node data structure itself. and so on. The shortest path must go through one of the nodes in M that contains an outgoing edge to n: we need to examine all m ∈ M to ﬁnd ms .2. The second iteration. In the reducer. Pseudo-code for the implementation of the parallel breadth-ﬁrst search algorithm is provided in Figure 5. The ﬁnal output is now ready to serve as input to the next iteration. we use n to denote the node id (an integer) and N to denote the node’s corresponding data structure (adjacency list and current distance).4. The best way to achieve this in Hadoop is to create a wrapper object with an indicator variable specifying what the content is. line 4 in the mapper). eventually. which can either be a distance (integer) or a complex data structure representing a node.5 So how many iterations are necessary to compute the shortest distance to all nodes? The answer is the diameter of the graph. PARALLEL BREADTH-FIRST SEARCH 97 to node n.

e. Typically.AdjacencyList do Emit(nid m.4: Pseudo-code for parallel breath-ﬁrst search in MapReduce: the mappers emit distances to reachable nodes. [d1 . GRAPH ALGORITHMS 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: class Mapper method Map(nid n. as the name suggests. For more serious academic studies of “small world” phenomena in networks. which submits a MapReduce job to iterate the algorithm. there is not a shorter path that goes through a node that hasn’t been discovered). repeats. . d + 1) Emit distances to reachable nodes class Reducer method Reduce(nid m. we iterate the algorithm until there are no more node distances that are ∞. and since all edge distances are one. Hadoop provides a lightweight API for constructs called “counters”. all discovered nodes are guaranteed to have the shortest distances (i. number of corrupt records. and if not.e. Counters can be deﬁned to count the number of nodes that have distances of ∞: at the end of the job. . we refer the reader to a number of publications [61. or anything that the programmer desires. the driver program can access the ﬁnal counter value and check to see if another iteration is necessary.g. Each iteration (one MapReduce job) of the algorithm expands the “search frontier” by one hop. d2 .Distance ← dmin Emit(nid m.. N ) Pass along graph structure for all nodeid m ∈ N. Since the graph is connected. all nodes are reachable.]) dmin ← ∞ M ←∅ for all d ∈ counts [d1 . checks to see if a termination condition has been met. . while the reducers select the minimum of those distances for each destination node. . can be used for counting events that occur during execution. . 2]. The actual checking of the termination condition must occur outside of MapReduce.Distance Emit(nid n. execution of an iterative MapReduce algorithm requires a nonMapReduce “driver” program. node M ) Recover graph structure Look for shorter distance Update shortest distance Figure 5. . d2 . 152. In practical terms. which. number of times a certain condition is met. 62..] do if IsNode(d) then M ←d else if d < dmin then dmin ← d M. node N ) d ← N.98 CHAPTER 5. .

To illustrate. does not guarantee that we’ve discovered the shortest distance to r. as with Dijkstra’s algorithm in the form presented earlier. the path can be straightforwardly recovered. however. the shortest path from source s to node r may go outside the current search frontier. A simpler approach is to emit paths along with distances in the mapper. if the shortest path to p lies within the search frontier. we must now emit d + w where w is the edge distance. sequence of node ids) are relatively short. we have been assuming that all edges are unit distance. as with Dijkstra’s algorithm. we will discover the shortest distance to r.e.. PARALLEL BREADTH-FIRST SEARCH search frontier 99 r s p q Figure 5.6 However. Up until now. . since there may exist a path going through q that we haven’t encountered yet (because it lies outside the search frontier). and in this iteration. Storing “backpointers” at each node. This. The additional space requirements for shuﬄing these data from mappers to reducers are relatively modest. However.4. where s is the source node. instead of emitting d + 1 as the value. since for the most part paths (i. the parallel breadth-ﬁrst search algorithm only ﬁnds the shortest distances.5. No other changes to the algorithm are required. but may not be eﬃcient since the graph needs to be traversed again to reconstruct the path segments. Finally. we’ll eventually cover q and all other nodes along the path from p to q to r—which means that with suﬃcient iterations. we would 6 Note that the same argument does not apply to the unit edge distance case: the shortest path cannot lie outside the search frontier since any such path would necessarily be longer. But how do we know that we’ve found the shortest distance to p? Well.5. in which case we will not ﬁnd the shortest distance to r until the search frontier expands to cover q. must now encode the edge distances as well. as the search frontier expands. not the actual shortest paths. In line 6 of the mapper code in Figure 5. but the termination behavior is very diﬀerent. and that the shortest distance to r so far goes through p. consider the graph fragment in Figure 5. we just “discovered” node r for the very ﬁrst time. Let us relax that restriction and see what changes are required in the parallel breadth-ﬁrst search algorithm.5: In the single source shortest path problem with arbitrary edge distances.2. Assume for the sake of argument that we’ve already discovered the shortest distance to node p. which were previously lists of node ids. The adjacency lists. so that each node will have its shortest path easily accessible at all times. will work.

Fortunately. Once again. At each iteration. updated distances will propagate inside the search frontier. parallel breadth-ﬁrst search in MapReduce can be characterized as a brute force approach that “wastes” a lot of time performing computations whose results are discarded. the algorithm hasn’t 7 Unless the algorithm discovers an instance of the situation described in Figure 5. Eight iterations are required to discover shortest distances to all nodes from n1 . and n5 until the ﬁfth iteration. In practical terms. as in the unit edge distance case. for most real-world graphs. And if it doesn’t. we can use counters to keep track of such events. such extreme cases are rare. Three more iterations are necessary to cover the rest of the graph. and the number of iterations necessary to discover all shortest distances is quite close to the diameter of the graph. the algorithm is simply repeating previous computations. it is not diﬃcult to construct graphs that will elicit this worse-case behavior: Figure 5. the above argument applies. we might need as many iterations as there are nodes in the graph minus one. Similarly. The parallel breadth-ﬁrst search algorithm would not discover that the shortest path from n1 to n6 goes through n3 . The conclusion is that. have already discovered it. in which case. Every time we encounter a shorter distance in the reducer. . we can repeat the same argument for all nodes on the path from s to p. we increment a counter.100 CHAPTER 5.6: A sample graph that elicits worst-case behavior for parallel breadth-ﬁrst search. Compared to Dijkstra’s algorithm on a single processor. with suﬃcient iterations.6 provides an example. n4 . the algorithm attempts to recompute distances to all nodes.5. the driver program reads the counter value and determines if another iteration is necessary. In fact. but in reality only useful work is done along the search frontier: inside the search frontier. we’ll eventually discover all the shortest distances. So exactly how many iterations does “eventually” mean? In the worst case.7 Outside the search frontier. how do we know when to stop iterating in the case of arbitrary edge distances? The algorithm can terminate when shortest distances at every node no longer change. GRAPH ALGORITHMS 1 n6 10 1 n7 n8 1 1 n1 1 1 1 n5 n9 n4 n2 1 n3 Figure 5. with n1 as the source. At the end of each MapReduce iteration.

some form of aggregation). we can think of this as “passing” the results of the computation along outgoing edges.. edge weights). These ineﬃciencies represent the cost of parallelization. computations can only involve a node’s internal state and its local graph structure. intermediate output attached to each node. which is part of some larger node data structure that may contain additional information (variables to store intermediate output. the data structure corresponding to each node is updated and written back to disk. on the other hand.2. where the output of the previous iteration serves as input to the next iteration. For parallel breadth-ﬁrst search. . those nodes on the adjacency lists). However. The process is controlled by a non-MapReduce driver program that checks for termination. Conceptually. • In addition to computations. • Graph algorithms in MapReduce are generally iterative. the graph itself is also passed from the mapper to the reducer. features are attached to edges as well (e. The parallel breadth-ﬁrst search algorithm is instructive in that it represents the prototypical structure of a large class of graph algorithms in MapReduce. As we will see in the next section.. this is made possible by maintaining a global data structure (a priority queue) that holds nodes sorted by distance—this is not possible in MapReduce because the programming model does not provide support for global data that is mutable and accessible by the mappers and reducers.5. • The MapReduce algorithm maps over the node data structures and performs a computation that is a function of features of the node.g. while the reducer computation is the Min function (selecting the shortest path). Every time a node is explored. The results of these computations are emitted as values. so no meaningful work is done. and features of the adjacency list (outgoing edges and their features). is far more eﬃcient. features of the nodes). we’re guaranteed to have already found the shortest path to it. the MapReduce algorithm for PageRank works in much the same way. In many cases. In other words. They share in the following characteristics: • The graph structure is represented with adjacency lists. In the reducer. the mapper computation is the current distance plus edge distance (emitting distances to neighbors). the algorithm receives all partial results that have the same destination node. PARALLEL BREADTH-FIRST SEARCH 101 discovered any paths to nodes there yet. keyed with the node ids of the neighbors (i. and performs another computation (usually.e. In the reducer. Dijkstra’s algorithm.

Let us break down each component of the formula in detail. two other popular algorithms. Formally. But enough about random web surﬁng. Let us consider a page m from the set of pages L(n): a random surfer at m will arrive at n with probability 1/C(m) since a link is selected at random from all outgoing links. A vivid way to illustrate PageRank is to imagine a random web surfer: the surfer visits a page. and repeats ad inﬁnitum. The complete formulation of PageRank includes an additional component. not covered here. A web page n receives PageRank “contributions” from all pages that link to it. The random jump factor α is sometimes called the “teleportation” factor. our web surfer doesn’t just randomly click links.1) where |G| is the total number of nodes (pages) in the graph. α is the random jump factor. the PageRank P of a page n is deﬁned as follows: P (n) = α 1 |G| + (1 − α) P (m) C(m) m∈L(n) (5. PageRank is a probability distribution over nodes in the graph representing the likelihood that a random walk over the link structure will arrive at a particular node. (1 − α) is referred to as the “damping” factor. the surfer clicks on a random link on the page as usual.3 PAGERANK PageRank [117] is a measure of web page quality based on the structure of the hyperlink graph. are SALSA [88] and HITS [84] (also known as “hubs and authorities”). if a high-quality page links to another page. This behavior makes intuitive sense: if PageRank is a measure of page quality. the surfer ignores the links on the page and randomly “jumps” or “teleports” to a completely diﬀerent page. randomly clicks a link on that page. L(n) is the set of pages that link to n. Since the PageRank value of m is the probability that the random surfer will be at m.102 CHAPTER 5. L(n). Although it is only one of thousands of features that is taken into account in Google’s search algorithm. Nodes that have high in-degrees tend to have high PageRank values. note that PageRank is deﬁned recursively—this gives rise to an iterative algorithm we will detail in a bit. and C(m) is the out-degree of node m (the number of links on page m). As it turns out. alternatively. GRAPH ALGORITHMS 5. PageRank represents one particular approach to inferring the quality of a web page based on hyperlink structure. as well as nodes that are linked to by other nodes with high PageRank values. the probability of arriving at n from m is P (m)/C(m). we would expect high-quality pages to contain “endorsements” from many other pages in the form of hyperlinks. however. Before the surfer decides where to go next. To compute the . First. Similarly. a biased coin is ﬂipped—heads. PageRank is a measure of how frequently a page would be encountered by our tireless web surfer. it is perhaps one of the best known and most studied. More precisely. then the second page is likely to be high quality also. Tails.

The end of the ﬁrst iteration is shown in the top right: each node sums up PageRank contributions from its neighbors. left).g. to trap and intentionally deceive users (see.7 shows a toy example that illustrates two iterations of the algorithm. Note that since n1 has only one incoming link. At the beginning of each iteration. or service. and n4 receives all the mass belonging to n3 because n3 isn’t connected to any other node. The algorithm begins by initializing a uniform distribution of PageRank values across nodes. spammers. To conclude the iteration. its updated PageRank value is smaller than before. it “passed along” more PageRank mass than it received. Note that PageRank assumes a community of honest users who are not trying to “game” the measure. or in some cases. partial PageRank contributions are sent from each node to its neighbors connected via outgoing links. As a simpliﬁcation. generated by CGI) that all link to a single page (thereby artiﬁcially inﬂating its PageRank).2. of course. the two contributions need to be combined: with probability α the random surfer executes a random jump.1 each. 63]). and with probability 1 − α the random surfer follows a hyperlink. We can think of this as gathering probability mass passed to a node via its incoming links.3.. For example. i. This is.1 PageRank mass to n2 and 0. n1 sends 0.1 PageRank mass to n4 . we can think of this as spreading probability mass to neighbors via outgoing links. A simple example is a so-called “spider trap”. not true in the real world.) who are trying to manipulate search results—to promote a cause. a node passes its PageRank contributions to other nodes that it is connected to.e. In the beginning of the ﬁrst iteration (top. Since PageRank is a probability distribution. where |G| is the number of nodes in the graph. This algorithm iterates until PageRank values don’t change anymore. since it has three neighbors. Figure 5. The exact same .e. The same occurs for all the other nodes in the graph: note that n5 must split its PageRank mass three ways. nodes with no outgoing edges). etc. PageRank is only one of thousands of features used in ranking web pages. This makes sense in terms of the random surfer model: if the surfer is at n1 with a probability of 0. for example. where an adversarial relationship exists between search engine companies and a host of other organizations and individuals (marketers. Of course. For this reason.5.. [12. activists. we need to sum contributions from all pages that link to n. then the surfer could end up either in n2 or n4 with a probability of 0.. from n3 . We start by presenting an informal sketch. product.e. we also need to take into account the random jump: there is a 1/|G| chance of landing at any particular page. However. each node sums up all PageRank contributions that have been passed to it and computes an updated PageRank score.. α = 0) and further assume that there are no dangling nodes (i. PAGERANK 103 PageRank of n. a inﬁnite chain of pages (e. The fact that PageRank is recursively deﬁned translates into an iterative algorithm which is quite similar in basic structure to parallel breadth-ﬁrst search. This is the summation in the second half of the equation. we ignore the random jump factor for now (i.

. .7: PageRank toy example showing two iterations. all edges that point to the same node).2) 0.066) n2 (0.166 n3 (0.166) 0.166) 0. Left graphs show PageRank values at the beginning of each iteration and how much PageRank mass is passed to each neighbor.033 0. top and bottom. PageRank mass contributions from all incoming edges are summed to arrive at the updated PageRank value for each node.2) 0. we can think of this as passing PageRank mass along outgoing edges. The algorithm maps over the nodes.183) n4 (0 2) (0.2) Figure 5. guaranteeing that we continue to have a valid probability distribution at the end of each iteration.9 for the ﬁrst iteration of the toy graph in Figure 5. process repeats.3) 0.3) n3 (0.1 01 0. and for each node computes how much PageRank mass needs to be distributed to its neighbors (i.066 n5 (0. Right graphs show updated PageRank values at the end of each iteration.2 n3 (0.1 01 n5 (0. In the reducer.166) n5 (0.2 n4 (0.2) 0.7.066) 0.033 0.083 n2 (0.e..1 0.066 0. In the shuﬄe and sort phase. PageRank mass is preserved by the algorithm. Conceptually.3) Iteration 2 n1 (0. the MapReduce execution framework groups values (piece of PageRank mass) passed along the graph edges by destination node (i.3 n4 (0 3) (0. the PageRank values of all nodes sum to one.133) 0. Each piece of the PageRank mass is emitted as the value.8. Pseudo-code of the MapReduce PageRank algorithm is shown in Figure 5.166) n4 (0. keyed by the node ids of the neighbors.3) 0.066 0. At the beginning of each iteration. nodes on the adjacency list). GRAPH ALGORITHMS Iteration 1 It ti n1 (0.1 n2 (0.104 CHAPTER 5.1 0.1 0.2) n5 (0.2) 0.083 n1 (0.1) n2 (0.e.1 n1 (0.383) n3 (0. and the second iteration in our toy example is illustrated by the bottom two graphs. An illustration of the running algorithm is shown in Figure 5. it is simpliﬁed in that we continue to ignore the random jump factor and assume no dangling nodes (complications that we will return to later).

AdjacencyList do Emit(nid m. node M ) Recover graph structure Sum incoming PageRank contributions Figure 5.]) M ←∅ for all p ∈ counts [p1 . . In the reduce phase PageRank contributions are summed up at each destination node. Each MapReduce job corresponds to one iteration of the algorithm.] do if IsNode(p) then M ←p else s←s+p M. node N ) p ← N.5.AdjacencyList| Emit(nid n.8: Pseudo-code for PageRank in MapReduce (leaving aside dangling nodes and the random jump factor). . . .3.PageRank/|N. [p1 .PageRank ← s Emit(nid m. . p) Pass PageRank mass to neighbors class Reducer method Reduce(nid m. In the map phase we evenly divide up each node’s PageRank mass and pass each piece along outgoing edges to neighbors. . PAGERANK 105 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: class Mapper method Map(nid n. . p2 . N ) Pass along graph structure for all nodeid m ∈ N. p2 .

the sum of all the updated PageRank values should remain a valid probability distribution. There are many ways to determine the missing PageRank mass. One simple approach is by instrumenting the algorithm in Figure 5.066 0. n5] n3 [n4] n4 [n5] n5 [n1.1 n1 (0. these might correspond to pages in a crawl that have not been downloaded yet. All PageRank mass emitted by the mappers are accounted for in the reducer: since we begin with the sum of PageRank values across all nodes equal to one. n4] n2 [n3. let us now take into account the random jump factor and dangling nodes: as it turns out both are treated similarly. n2. the graph structure itself must be passed from iteration to iteration. Intermediate values are keyed by node (shown inside the boxes). If we simply run the algorithm in Figure 5. PageRank mass is distributed evenly to nodes on each node’s adjacency list (shown at the very top).8 with counters: whenever the mapper processes a node with an empty adjacency list. Each node data structure is emitted in the mapper and written back out to disk in the reducer. the total PageRank mass will not be conserved.2 0. Having discussed the simpliﬁed PageRank algorithm in MapReduce.Iteration 1 n1 (0. n4] n2 [n3.9: Illustration of the MapReduce PageRank algorithm corresponding to the ﬁrst iteration in Figure 5. [22]).3) n3 (0 166) (0.3) n1 [n2. At the end of the iteration.2) 0. we can access the counter to ﬁnd out how much PageRank .2) 0.066 0. GRAPH ALGORITHMS n4 (0. n3] Figure 5.2) 0. The proper treatment of PageRank mass “lost” at the dangling nodes is to redistribute it across all nodes in the graph evenly (cf.2) n5 (0.1 ( ) 0. In the hyperlink graph of the web.066 n5 (0.1 0.2 n3 (0 2) (0.7.2) CHAPTER 5. In the reduce phase. Dangling nodes are nodes in the graph that have no outgoing edges. n3] Map n2 n4 n3 n5 n4 n5 n1 n2 n3 n1 n2 n2 n3 n3 n4 n4 n5 n5 Reduce n1 [n2. n5] n3 [n4] n4 [n5] n5 [n1.166) 0.066) ( ) n2 (0. i. their adjacency lists are empty.e. it keeps track of the node’s PageRank value in the counter.. all partial PageRank contributions are summed together to arrive at updated values. As with the parallel breadth-ﬁrst search algorithm. The size of each box is proportion to its PageRank value.166) 106 n4 (0.8 on graphs with dangling nodes. During the map phase. since no key-value pairs will be emitted when a dangling node is encountered in the mappers. n2.1 n2 (0.

For each node.3. Either way. we can take into account the random jump factor. a ﬁnal pass in the driver program is needed to sum the mass across all map tasks. which ensures a valid probability distribution. and with probability 1 − α the random surfer arrives via incoming links. Therefore. to take into account. Alternative stopping criteria include running a ﬁxed number of iterations (useful if one wishes to bound algorithm running time) or stopping when the ranks of PageRank values no longer change. counters are 8-byte integers: a simple workaround is to multiply PageRank values by a large constant. Typically. Another approach is to reserve a special key for storing PageRank mass from dangling nodes. and the second to take care of dangling nodes and the random jump factor. At end of each iteration. which is a requirement for the iterative algorithm to work. 8 In Hadoop. computed from before) to the share of the lost PageRank mass that is distributed to each node (m/|G|). we arrive at the amount of PageRank mass lost at the dangling nodes—this then must be redistribute evenly across all nodes. its current PageRank value p is updated to the ﬁnal PageRank value p according to the following formula: p =α 1 |G| + (1 − α) m +p |G| (5. This redistribution process can be accomplished by mapping over all nodes again. i. one iteration of PageRank requires two MapReduce jobs: the ﬁrst to distribute PageRank mass along graph edges. Finally. When the mapper encounters a dangling node. and then cast as an integer. when the PageRank values of nodes no longer change (within some tolerance. for example. The latter is useful for some applications that only care about comparing the PageRank of two arbitrary pages and do not need the actual PageRank values. PAGERANK 8 107 mass was lost at the dangling nodes. Yet another approach is to write out the missing PageRank mass as “side data” for each map task (using the in-mapper combining technique for aggregation). at the end of each iteration. we take into account the random jump factor: with probability α the random surfer arrives via jumping. its PageRank mass is emitted with the special key. At the same time.5. the PageRank values of all nodes sum up to one. the PageRank driver program must check to see if convergence has been reached. PageRank is iterated until convergence. Putting everything together. we end up with exactly the same data structure as the beginning. and |G| is the number of nodes in the entire graph. the reducer must be modiﬁed to contain special logic for handling the missing PageRank mass. ﬂoating point precision errors). Also. Rank stability is obtained faster than the actual convergence of values. We add the PageRank mass from link traversal (p.. Note that this MapReduce job requires no reducers.2) where m is the missing PageRank mass.e. .

which explains why all the algorithms we have discussed assume sparse graphs. GRAPH ALGORITHMS In absolute terms. the proceedings of a workshop series on Adversarial Information Retrieval (AIRWeb) provide great starting points into the literature. The passing of partial results along a graph edge is accomplished by the shuﬄing and sorting provided by the MapReduce execution framework. convergence on the global graph is possible. but generally. the results of which are “passed” to its neighbors. Since MapReduce clusters are the interested reader. strategies developed by web search companies to combat link spam are proprietary (and closelyguarded secrets. fewer than one might expect. 10 Of course. In the original PageRank paper [117]. On today’s web.10 This restriction gives rise to the structure of many graph algorithms in MapReduce: local computation is performed on each node. A full discussion of the escalating “arms race” between search engine companies and those that seek to promote their sites is beyond the scope of this book. convergence on a graph with 322 million edges was reached in 52 iterations (see also Bianchini et al. As a result. [22] for additional discussion).108 CHAPTER 5. With multiple iterations. if one wishes to propagate information from a node to all nodes that are within two links. it is perfectly reasonable to compute derived graph structures in a pre-processing step. For example.9 5. the answer is not very meaningful due to the adversarial nature of web search as previously discussed—the web is full of spam and populated with sites that are actively trying to “game” PageRank and related hyperlink-based metrics. where there would exist a link from node ni to nj if nj was reachable within two link traversals of ni in the original graph G. of course. Of course. how many iterations are necessary for PageRank to converge? This is a diﬃcult question to precisely answer since it depends on many factors. random access. which in the worst case is O(n2 ) in the number of nodes in the graph. is not possible with MapReduce—the programming model does not provide any built-in mechanism for communicating global state. one could process graph G to derive graph G . running PageRank in its unmodiﬁed form presented here would yield unexpected and undesirable results. This. 9 For . The amount of intermediate data generated is on the order of the number of edges. for obvious reasons)—but undoubtedly these algorithmic modiﬁcations impact convergence behavior. For dense graphs. or to a node from nodes linked to it—in other words. passing information is only possible within the local graph structure.4 ISSUES WITH GRAPH PROCESSING The biggest diﬀerence between MapReduce graph algorithms and single-machine graph algorithms is that with the latter. communication can only occur from a node to the nodes it links to. Dijkstra’s algorithm uses a global priority queue that guides the expansion of nodes. it is usually possible to maintain global data structures in memory for fast. MapReduce running time would be dominated by copying intermediate data across the network. Since the most natural representation of large sparse graphs is with adjacency lists. For example.

Finally. we can arrange the data such that nodes in the same component are handled by the same map task—thus maximizing opportunities for combiners to perform local aggregation. Combiners and the in-mapper combining pattern described in Section 3. But the graph may be so large that we can’t partition it except with MapReduce algorithms! Fortunately. For large graphs. For example. MapReduce algorithms are often impractical on large. This way. Another good example is to partition the web graph by the language of the page (since pages in one language tend to link mostly to other pages in that language) or by domain name (since inter-domain links are typically much denser than intra-domain links). used in the two algorithms. Unfortunately. the probability of any particular node is often so small that it underﬂows standard ﬂoating point representations. addition of probabilities is also necessary. in many cases there are simple solutions around this problem in the form of “cheap” partitioning heuristics based on reordering the data [106]. for example. When probabilities are stored as logs. the product of two values is simply their sum. when summing PageRank contribution for a node. This can be implemented with reasonable precision as follows: . in a social network. ISSUES WITH GRAPH PROCESSING 109 designed around commodity networks (e.1 can be used to decrease the running time of graph iterations. dense graphs.g. A very common solution to this problem is to represent probabilities using their logarithms. gigabit Ethernet). combiners are not very useful. respectively. there is a practical consideration to keep in mind when implementing graph algorithms that estimate probability distributions over nodes (such as PageRank). we might sort nodes representing users by zip code. It would be desirable to partition a large graph to facilitate eﬃcient processing by MapReduce. However. as opposed to by last name—based on the observation that friends tend to live close to each other. This implies that it would be desirable to partition large graphs into smaller components where there are many intracomponent links and fewer inter-component links.. are both associative and commutative. Sorting by an even more cohesive property such as school would be even better (if available): the probability of any two random students from the same school knowing each other is much higher than two random students from diﬀerent schools. It is straightforward to use combiners for both parallel breadth-ﬁrst search and PageRank since Min and sum. However. Resorting records using MapReduce is both easy to do and a relatively cheap operation—however. whether the eﬃciencies gained by this crude form of partitioning are worth the extra time taken in performing the resort is an empirical question that will depend on the actual graph structure and algorithm. this sometimes creates a chick-and-egg problem.4.5. combiners are only eﬀective to the extent that there are opportunities for partial aggregation—unless there are nodes pointed to by multiple nodes being processed by an individual map task.

• Algorithms map over nodes and pass partial results to nodes on their adjacency lists. maintaining globally-synchronized state may be possible with the assistance of other tools (e. Schatz [132] tackled the problem of DNA sequence 11 However. Rao and Yarowsky [128] described an implementation of label propagation. The MapReduce programming model does not provide a mechanism to maintain global data structures accessible and mutable by all the mappers and reducers. Both are instances of a large class of iterative algorithms that share the following characteristics: • The graph structure is represented with adjacency lists. Partial results are aggregated for each node in the reducer. GRAPH ALGORITHMS a⊕b= Furthermore. and beyond. Instead. Additional Readings. .110 CHAPTER 5.11 One implication of this is that communication between pairs of arbitrary nodes is diﬃcult to accomplish. discussing in detail parallel breadth-ﬁrst search and PageRank. Its use may further improve the accuracy of implementations that use log probabilities. The ubiquity of large graphs translates into substantial interest in scalable graph algorithms using MapReduce in industry. of course. b + log(1 + ea−b ) a < b a + log(1 + eb−a ) a ≥ b 5.g. [80] presented an approach to estimating the diameter of large graphs using MapReduce and a library for graph mining [81]. with social network analysis in mind. many math libraries include a log1p function which computes log(1 + x) with higher precision than the na¨ implementation would have when x is very small ıve (as is often the case when working with probabilities). we refer readers to the following: Kang et al. There is. much beyond what has been covered in this chapter.. • Algorithms are iterative and under the control of a non-MapReduce driver program. Cohen [39] discussed a number of algorithms for processing undirected graphs. a standard algorithm for semi-supervised machine learning. academia. information typically propagates along graph edges—which gives rise to the structure of algorithms discussed above. • The graph structure itself is passed from the mapper to the reducer. a distributed database). which checks for termination at the end of each iteration. For additional material. such that the output is in the same form as the input. on graphs derived from textual data.5 SUMMARY AND ADDITIONAL READINGS This chapter covers graph algorithms in MapReduce.

to what extent well-known PRAM algorithms translate naturally into the MapReduce framework.5. however.5. it is easy to forget that parallel graph algorithms have been studied by computer scientists for several decades. . It is not clear. 60]. Finally. particular in the PRAM model [77. SUMMARY AND ADDITIONAL READINGS 111 alignment and assembly with graph algorithms in MapReduce.

g. However. Next. translations of text into multiple languages are created by authors wishing to reach an audience speaking diﬀerent languages. annotate. not because they are generating training data for a data-driven machine translation system).g. they often do so catastrophically. when these systems fail. In the last 20 years. called training data. but they are not distinguished in any way. where the “rules” for processing the input are inferred automatically from large corpora of examples. signiﬁcant quantities of training data may even exist for independent reasons (e. rule-based systems suﬀer from a number of serious problems. they tend to deal with the complexities found in real data more robustly than rule-based systems do.112 CHAPTER 6 EM Algorithms for Text Processing Until the end of the 1980s. The basic strategy of the data-driven approach is to start with a processing algorithm capable of capturing how any instance of the kinds of inputs (e. the syntactic structure of the sentence or a classiﬁcation of the email as spam). Furthermore. For some applications.. developing training data tends to be far less expensive than developing rules. usually in a deterministic way.. This rule-based approach can be appealing: a system’s behavior can generally be understood and predicted precisely. At this stage. and developing systems that can deal with inputs from diverse domains is very labor intensive. The learning process. when errors surface. they can be corrected by writing new rules or reﬁning old ones. which often involves iterative algorithms. an active area of research. sentences or emails) can relate to any instance of the kinds of outputs that the ﬁnal system should produce (e. text processing systems tended to rely on large numbers of manually written rules to analyze. instantiating the content of rule templates. a learning algorithm is applied which reﬁnes this process based on the training data—generally attempting to make the model perform as well as possible at predicting the examples in the training data.. They are brittle with respect to the natural variation found in language. Data-driven approaches have turned out to have several beneﬁts over rule-based approaches to system development. This is known as machine learning. unable to oﬀer even a “best guess” as to what the desired analysis of the input might be. and. the system can be thought of as having the potential to produce any output for any input. Since data-driven systems can be trained using examples of the kind that they will eventually be used to process. the rule-based approach has largely been abandoned in favor of more data-driven methods. Second. These advantages come at the cost of systems that often behave internally quite diﬀerently than a human- .g. or determining parameter settings for a given model. and transform text input. typically consists of activities like ranking rules.

In this chapter we will focus on one particularly simple training criterion for parameter estimation. correcting errors that the trained system makes can be quite challenging. is beyond the scope of this chapter. y ∈ X × Y or a conditional model Pr(y|x). which probabilistically relate inputs from an input set X (e. It should be kept in mind that the sets X and Y may still be countably inﬁnite. maximum likelihood estimation. which assigns a probability to every y ∈ Y. or.g.2 The parameters of a statistical model are the values used to compute the probability of some event described by the model. y) which assigns a probability to every pair x. which is the space of possible annotations or analyses that the system should predict. For example. One very common strategy is to select y according to the following criterion: y ∗ = arg max Pr(y|x) y∈Y this chapter. we might have Y = {Spam. the probabilities cannot be represented directly. given some x. NotSpam} and X be the set of all possible email messages. where the model space is inﬁnite. Inference in and learning of so-called nonparameteric models. X might be the set of Arabic sentences and Y the set of English sentences. which deﬁne a joint distribution over sequences of inputs and sequences of annotations. For a problem where X and Y are very small. we will consider discrete models only. For machine translation. called expectation maximization (EM).). sentences. but distinct challenges in statistical textprocessing.113 engineered system. to create a statistical spam detection system. but in this chapter we focus on statistical models. and must be computed algorithmically. As an example of such models. The second challenge is parameter estimation or learning. However. we introduce hidden Markov models (HMMs). using the model to select an annotation y.. 2 We restrict our discussion in this chapter to models with ﬁnite numbers of parameters and where the learning process refers to setting those parameters. and their presentation is simpler than models with continuous densities. As a result. 1 In . The ﬁrst is model selection. Data-driven information processing systems can be constructed using a variety of mathematical techniques. which says to select the parameters that make the training data most probable under the model. which are always observable. This entails selecting a representation of a joint or conditional distribution over the desired X and Y. They tend to be suﬃcient for text processing. to annotations from a set Y. etc. which involves the application of a optimization algorithm and training criterion to select the parameters of the model to optimize the model’s performance (with respect to the given training criterion) on the training data.1 There are three closely related. documents. This model may take the form of either a joint model Pr(x. which have an inﬁnite number of parameters and have become important statistical models for text processing in recent years. and one learning algorithm that attempts to meet this criterion. given an x ∈ X . The ﬁnal challenge for statistical modeling is the problem of decoding. one could imagine representing these probabilities in look-up tables. for something like email classiﬁcation or machine translation.

annotation becomes far more challenging. the number of potential examples that could be classiﬁed (representing a subset of X .. that is Z = x1 . that is Z = x1 . this is a straightforward search for the best y under the model. y) y∈Y y Pr(x. Unsupervised learning. Our focus in this chapter will primarily be on the second problem: learning parameters for models. yi ∈ X × Y and yi is the gold standard (i. meaning that each xi must be manually labeled by a trained individual. since the training data requires each object that is to be classiﬁed (to pick a speciﬁc task). . .. yi . these algorithms can be quite computationally expen- . these gold standard training labels must be generated by a process of expert annotation. where xi ∈ X . While a thorough discussion of unsupervised learning is beyond the scope of this book. but without any example annotations. we focus on a particular class of algorithms—expectation maximization (EM) algorithms—that can be used to learn the parameters of a joint model Pr(x.g. the search is also straightforward.114 CHAPTER 6. the learning criteria and model structure (which crucially deﬁne the space of possible annotations Y and the process by which annotations relate to observable inputs) make it possible to induce annotations by relying on regularities in the unclassiﬁed training instances. on the other hand. in the case of spam detection). y) = arg max Pr(x. As the annotation task becomes more complicated (e.e. y ) The speciﬁc form that the search takes will depend on how the model is represented. Machine learning is often categorized as either supervised or unsupervised. they are often uneconomical to use. Even when the annotation task is quite simple for people to carry out (e. where xi . . y) from incomplete data (i. x2 . which may of course be inﬁnite in size) will far exceed the amount of data that can be annotated..g.. x2 .e. xi to be paired with its correct label. y2 . requires only that the training data consist of a representative collection of objects that should be annotated. y1 . on account of the deﬁnition of conditional probability: y ∗ = arg max Pr(y|x) = arg max y∈Y y∈Y Pr(x. In a joint model. but we will touch on the third problem as well. Expectation maximization algorithms ﬁt naturally into the MapReduce paradigm. when predicting more complex structures such as sequences of labels or when the annotation task requires specialized expertise). . . Furthermore. EM ALGORITHMS FOR TEXT PROCESSING In a conditional (or direct) model. Supervised learning of statistical models simply means that the model parameters are estimated from training data consisting of pairs of inputs and annotations. correct) annotation of xi . and are used to solve a number of problems of interest in text processing. data where some of the variables in the model cannot be observed. in the case of unsupervised learning. In many cases. While supervised models often attain quite good performance. . the yi ’s are unobserved). While it may at ﬁrst seem counterintuitive that meaningful annotations can be learned without any examples of the desired annotations being given.

and then in Section 6. In this game. those characterizing the latent structure that is sought). constituency and dependency trees. since they generally require repeated evaluations of the training data.3 discusses how EM algorithms can be expressed in MapReduce.1. To name just a few applications. we describe maximum likelihood estimation for statistical models. θ) θ (6. Section 6. EXPECTATION MAXIMIZATION 115 sive. They are extensively used in statistical natural language processing where one seeks to infer latent linguistic structure from unannotated text.1. alignments between texts in diﬀerent languages.1 EXPECTATION MAXIMIZATION Expectation maximization (EM) algorithms [49] are a family of iterative optimization algorithms for learning probability distributions from incomplete data.4 we look at a case study of word alignment for statistical machine translation.1 MAXIMUM LIKELIHOOD ESTIMATION Maximum likelihood estimation (MLE) is a criterion for ﬁtting the parameters θ of a statistical model to some given data x. Section 6.1. This chapter is organized as follows. Our task is to construct a model that predicts which cup the ball will drop into.1) To illustrate.1. We describe hidden Markov models (HMMs) in Section 6. 6. and then introduce expectation maximization (EM). A “rule-based” approach might be to take . a very versatile class of models that uses EM for parameter estimation. In Section 6. being diverted to the left or right by the peg (indicated by a triangle) in the center. show how this is generalized to models where not all variables are observable. consider the simple marble game shown in Figure 6.6. it says to select the parameter settings θ∗ such that the likelihood of observing the training data given the model is maximized: θ∗ = arg max Pr(X = x. and it bounces down into one of the cups at the bottom of the board. alignments between acoustic signals and their transcriptions. 6. EM algorithms have been used to ﬁnd part-of-speech sequences. This chapter concludes with a summary and pointers to additional readings.5 examines similar algorithms that are appropriate for supervised learning tasks. as well as for numerous other clustering and structure discovery problems.2. Speciﬁcally. a marble is released at the position indicated by the black dot. but also to improve eﬃciency bottlenecks at scales where non-parallel solutions could be utilized. MapReduce therefore provides an opportunity not only to scale to larger amounts of data. Expectation maximization generalizes the principle of maximum likelihood estimation to the case where the values of some variables are unobserved (speciﬁcally.

To estimate the parameter p of the statistical model of our game. This game can be modeled using a Bernoulli random variable with parameter p. EM ALGORITHMS FOR TEXT PROCESSING 0 1 a b 0 1 2 3 Figure 6. b. the construction of this model would be quite time consuming and diﬃcult. on the other hand. p) will give us the desired result. a b c exact measurements of the board and construct a physical model that we can use to predict the behavior of the ball. A statistical approach. this could certainly lead to a very accurate model. However. a.116 CHAPTER 6. What is the maximum likelihood estimate of p given this data? By assuming that our samples are independent and identically distributed (i. δ is the Kroneker delta function which evaluates to 1 where its arguments are equal and 0 otherwise. which indicates the probability that the marble will go to the right when it hits the peg. b. a . b.b) = p2 · (1 − p)8 Since log is a monotonically increasing function. We can do this diﬀerentiating with respect to p and ﬁnding where 3 In this equation.). b. we can write the likelihood of our data as follows:3 10 Pr(x. the value of the random variable indicates whether path 0 or 1 is taken. b. so we drop 10 marbles into the game which end up in cups x = b. Given sophisticated enough measurements.a) (1 − p)δ(xj . b. b.i.1: A simple marble game where a released marble takes one of two possible paths. so an observation of X is equivalent to an observation of Y . we need some training data. p) = j=1 pδ(xj . We also deﬁne a random variable X whose value is the label of the cup that the marble ends up in.d. note that X is deterministically related to Y . maximizing log Pr(x. That is. . might be to assume that the behavior of the marble in this game can be modeled using a Bernoulli random variable Y with parameter p.

taking on values from {0. 6. the approximation may perform better than the more complicated “rule-based” model. we again assume that the behavior of a marble interacting with a peg can be modeled with a Bernoulli random variable. it is an empirical question whether the model works well enough in practice to be useful. b. the maximum likelihood estimate of p is N1 /(N0 + N1 ). Furthermore. p1 . it would be trivial to estimate the parameters for our model using the maximum likelihood estimator—we would simply need to count the number of times the marble bounced left or right at each peg. if insuﬃcient resources were invested in building a physical model.2.2.6. How should the parameters θ be estimated? If it were possible to observe the paths taken by marbles as they were dropped into the game.2 A LATENT VARIABLE MARBLE GAME To see where latent variables might come into play in modeling. p) dp d[2 · log p + 8 · log(1 − p)] dp 2 8 − p 1−p = 0 = 0 = 0 Solving for p yields 0. To construct a statistical model of this game. left.1.1. and the marble may end up in one of three cups. This version consists of three pegs that inﬂuence the marble’s path. which is the intuitive result. p2 . and Y . astonishingly simple models can outperform complex knowledge-intensive models that attempt to simulate complicated processes. While this model only makes use of an approximation of the true physical process at work when the marble interacts with the game board. This sort of dynamic is found often in text processing problems: given enough data. consider a more complicated variant of our marble game shown in Figure 6. while a Bernoulli trial is an extreme approximation of the physical process. Note that both paths 1 and 2 lead to cup b. We further deﬁne a random variable X taking on values from {a. 3} indicating which path was taken. EXPECTATION MAXIMIZATION 117 the resulting expression equals 0: d log Pr(x. corresponding to the probabilities that the marble will go to the right at the top. c} indicating what cup the marble ends in. we have three random variables with parameters θ = p0 . 2. Note that the full joint distribution Pr(X = x. and right pegs. this is: . If Nx counts the number of times a marble took path x in N trials. 1. Y = y) is determined by θ. and N1 marbles followed path 1 to cup b. Additionally. Since there are three pegs. it is straightforward to show that in N trials where N0 marbles followed path 0 to cup a.

118

CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

0

1

a

b

0

1

2

3

a

b

c

Figure 6.2: A more complicated marble game where the released marble takes one of four possible paths. We assume that we can only observe which cup the marble ends up in, not the speciﬁc path taken.

p0 =

N2 + N3 N

p1 =

N1 N0 + N1

p2 =

N3 N2 + N3

However, we wish to consider the case where the paths taken are unobservable (imagine an opaque sheet covering the center of the game board), but where we can see what cup a marble ends in. In other words, we want to consider the case where we have partial data. This is exactly the problem encountered in unsupervised learning: there is a statistical model describing the relationship between two sets of variables (X’s and Y ’s), and there is data available from just one of them. Furthermore, such algorithms are quite useful in text processing, where latent variables may describe latent linguistic structures of the observed variables, such as parse trees or part-of-speech tags, or alignment structures relating sets of observed variables (see Section 6.4). 6.1.3 MLE WITH LATENT VARIABLES Formally, we consider the problem of estimating parameters for statistical models of the form Pr(X, Y ; θ) which describe not only an observable variable X but a latent, or hidden, variable Y . In these models, since only the values of the random variable X are observable, we deﬁne our optimization criterion to be the maximization of the marginal likelihood, that is, summing over all settings of the latent variable Y , which takes on values from set designated Y:4 Again, we assume that samples in the training data x are i.i.d.:

4 For

this description, we assume that the variables in our model take on discrete values. Not only does this simplify exposition, but discrete models are widely used in text processing.

6.1. EXPECTATION MAXIMIZATION

119

Pr(X = x) =

y∈Y

Pr(X = x, Y = y; θ)

**For a vector of training observations x = x1 , x2 , . . . , x , if we assume the samples are i.i.d.: Pr(x; θ) =
**

|x| j=1 y∈Y

Pr(X = xj , Y = y; θ)

Thus, the maximum (marginal) likelihood estimate of the model parameters θ∗ given a vector of i.i.d. observations x becomes: θ = arg max

∗ θ |x| j=1 y∈Y

Pr(X = xj , Y = y; θ)

Unfortunately, in many cases, this maximum cannot be computed analytically, but the iterative hill-climbing approach of expectation maximization can be used instead. 6.1.4 EXPECTATION MAXIMIZATION Expectation maximization (EM) is an iterative algorithm that ﬁnds a successive series of parameter estimates θ(0) , θ(1) , . . . that improve the marginal likelihood of the training data. That is, EM guarantees:

|x| j=1 y∈Y

Pr(X = xj , Y = y; θ(i+1) ) ≥

|x| j=1 y∈Y

Pr(X = xj , Y = y; θ(i) )

The algorithm starts with some initial set of parameters θ(0) and then updates them using two steps: expectation (E-step), which computes the posterior distribution over the latent variables given the observable data x and a set of parameters θ(i) ,5 and maximization (M-step), which computes new parameters θ(i+1) maximizing the expected log likelihood of the joint distribution with respect to the distribution computed in the E-step. The process then repeats with these new parameters. The algorithm terminates when the likelihood remains unchanged.6 In more detail, the steps are as follows:

term ‘expectation’ is used since the values computed in terms of the posterior distribution Pr(y|x; θ(i) ) that are required to solve the M-step have the form of an expectation (with respect to this distribution). 6 The ﬁnal solution is only guaranteed to be a local maximum, but if the model is fully convex, it will also be the global maximum.

5 The

120

CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

E-step. Compute the posterior probability of each possible hidden variable assignments y ∈ Y for each x ∈ X and the current parameter settings, weighted by the relative frequency with which x occurs in x. Call this q(X = x, Y = y; θ(i) ) and note that it deﬁnes a joint probability distribution over X × Y in that (x,y)∈X ×Y q(x, y) = 1. q(x, y; θ(i) ) = f (x|x) · Pr(Y = y|X = x; θ(i) ) = f (x|x) · Pr(x, y; θ(i) ) (i) y Pr(x, y ; θ )

M-step. Compute new parameter settings that maximize the expected log of the probability of the joint distribution under the q-distribution that was computed in the E-step: θ(i+1) = arg max Eq(X=x,Y =y;θ(i) ) log Pr(X = x, Y = y; θ )

θ

= arg max

θ (x,y)∈X ×Y

q(X = x, Y = y; θ(i) ) · log Pr(X = x, Y = y; θ )

We omit the proof that the model with parameters θ(i+1) will have equal or greater marginal likelihood on the training data than the model with parameters θ(i) , but this is provably true [78]. Before continuing, we note that the eﬀective application of expectation maximization requires that both the E-step and the M-step consist of tractable computations. Speciﬁcally, summing over the space of hidden variable assignments must not be intractable. Depending on the independence assumptions made in the model, this may be achieved through techniques such as dynamic programming. However, some models may require intractable computations. 6.1.5 AN EM EXAMPLE Let’s look at how to estimate the parameters from our latent variable marble game from Section 6.1.2 using EM. We assume training data x consisting of N = |x| observations of X with Na , Nb , and Nc indicating the number of marbles ending in cups a, b, and c. We (0) (0) (0) start with some parameters θ(0) = p0 , p1 , p2 that have been randomly initialized to values between 0 and 1. E-step. We need to compute the distribution q(X = x, Y = y; θ(i) ), as deﬁned above. We ﬁrst note that the relative frequency f (x|x) is: Nx N

f (x|x) =

2. We now need to maximize the expectation of log Pr(X. And since EM only ﬁnds a locally optimal solution. the values computed at the end of the M-step would serve as new parameters for another iteration of EM. θ(i) ) = (1 − p0 )p1 + p0 (1 − p2 ) (i) p0 (1 − p2 ) (i) (i) (i) (i) (i) Except for the four cases just described. Y = y. θ(i) ) · Nb Na + Pr(1|b. Y = y. and it may not ﬁnd a global optimum. θ(i) ) · Nb + Nc N Pr(1|b. However. 6. the example we have presented here is quite simple and the model converges to a global optimum after a single iteration.2 HIDDEN MARKOV MODELS To give a more substantial and useful example of models whose parameters may be estimated using EM. θ(i) ) Na /N Nb /N · Pr(1|b. given x and parameters θ(i) . θ ) log(1 − p0 ) + log(1 − p1 ) log(1 − p0 ) + log p1 log p0 + log(1 − p2 ) log p0 + log p2 Multiplying across each row and adding from top to bottom yields the expectation we wish to maximize. M-step. EM requires several iterations to converge. we observe that Pr(Y = 0|X = a) = 1 and Pr(Y = 3|X = c) = 1 since cups a and c fully determine the value of the path variable Y . the statistics used are the expected path counts. Each parameter can be optimized independently using diﬀerentiation. Typically. However. The non-zero terms in the expectation are as follows: x a b b c y 0 1 2 3 q(X = x. The resulting optimal values are expressed in terms of the counts in x and θ(i) : Pr(2|b.6. θ p0 = p1 = p2 = Nb + Nc It is worth noting that the form of these expressions is quite similar to the fully observed maximum likelihood estimate. Pr(Y = y|X = x) is zero for all other values of x and y (regardless of the value of the parameters). rather than depending on exact path counts. Y . we turn to hidden Markov models (HMMs). θ(i) ) = (1 − p0 )p1 + p0 (1 − p2 ) (i) Pr(2|b. θ(i) ) Nc /N log Pr(X = x. For most models. HMMs are models of . HIDDEN MARKOV MODELS 121 Next. the ﬁnal parameter values depend on the values chose for θ(0) . θ(i) ) · Nb Nc (i) ) · Pr(2|b. The posterior probability of paths 1 and 2 are only non-zero when X is b: (1 − p0 )p1 (i) (i) (i) (i) (i) Pr(1|b. θ ) (which will be a function in terms of the three parameter variables) under the q-distribution we computed in the E step. θ(i) ) Nb /N · Pr(2|b.

observations may be emitted during the transition between states. let t = 1 and select an initial state q according to the distribution π. In an HMM. 7 This is only one possible deﬁnition of an HMM.. but for many applications sparse parameterizations are useful. A hidden Markov model M is deﬁned as a tuple S. or continuous-valued observations may be emitted (for example.4). and o. r. an observation symbol from O is emitted according to the distribution Bq . . Instead.e. We further stipulate that Aq (r) ≥ 0. r. one can view a Markov process as a probabilistic variant of a ﬁnite state machine. O. Following convention. In alternative deﬁnitions. Step 1.122 CHAPTER 6. an observable token (e. and an |S|-dimensional vector π. where Bq (o) gives the probability that symbol o will be emitted from state q. where πq is the probability that the process starts in state q.7 These matrices may be dense. base pairs in a gene. such as words in a sentence. hidden). Bq (o) ≥ 0. This model is parameterized by the tuple θ = A.. As another point of comparison. a word. however. which is a stochastic process consisting of a ﬁnite set of states where the probability of entering a state at time t + 1 depends only on the state of the process at time t [130]. stock market forecasting [70]. where Aq (r) gives the probability of transitioning from state q to state r.3) can be understood as a Markov process: the probability of following any link on a particular page is independent of the path taken to reach that page. These simple but powerful models have been used in applications as diverse as speech recognition [78]. where transitions are taken probabilistically. information extraction [139]. initial and ﬁnal states may be handled diﬀerently. Alternatively. S is a ﬁnite set of states. from a Gaussian distribution). or letter) is emitted according to a probability distribution conditioned on the identity of the state that the underlying process is in. and πq ≥ 0 for all q. from left to right. π consisting of an |S| × |S| matrix A of transition probabilities. the PageRank algorithm considered in the previous chapter (Section 5. as well as that: Aq (r) = 1 ∀q Bq (o) = 1 ∀q πq = 1 q∈S r∈S o∈O A sequence of observations of length τ is generated as follows: Step 0. and s refer to states in S. and o refers to symbols in the observation vocabulary O. the data being modeled is posited to have been generated from an underlying Markov process. gene ﬁnding [143]. not directly observable (i. or letters in a word. which generate symbols from a ﬁnite observation vocabulary O. base pair. The states of this Markov process are. we assume that variables q. θ . but it is one that is useful for many text processing problems.g. at each time step. etc. part of speech tagging [44].). B. an |S| × |O| matrix B of emission probabilities. and word alignment of parallel (translated) texts [150] (more in Section 6. text retrieval [108]. EM ALGORITHMS FOR TEXT PROCESSING data that are ordered sequentially (temporally.

2.d. x . green. t is incremented. nn. x2 . v} correspond to the parts of speech (determiner. . θ) 2. 6. O. This example illustrates a key intuition behind many applications of HMMs: states correspond to equivalence classes or clustering of observations.i. Figure 6. States S = {det. an observation vocabulary O. y. and observations O = {the. observation sequences x1 . HIDDEN MARKOV MODELS 123 Step 2. θ and an observation sequence x. a new q is drawn according to the distribution Aq . B. .} are a subset of English words. x2 . θ . Y)? Pr(x) = y∈Y Pr(x. . Given a model M = S. what are the parameters θ = A. . which is the task of assigning to each word in an input sentence its grammatical category (one of the ﬁrst steps in analyzing textual content). xτ . a. the word wash can be generated by an nn or v. y. . . and an observation sequence of symbols from O. . the process repeats from Step 1. . noun. . and a single observation type may associated with several clusters (in this example.2. Step 3. θ) y∈Y 3. adjective. what is the most likely sequence of states that generated the data? y∗ = arg max Pr(x. Since all events generated by this process are conditionally independent. and verb). Given a model M = S. the probability that 8 The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [125]. Given a set of states S. π that maximize the likelihood of the training data? θ∗ = arg max θ i=1 y∈Y Pr(xi .6. and if t ≤ τ . the answers to the ﬁrst two questions are in principle quite trivial to compute: by iterating over all state sequences Y. what is the probability that M generated the data (summing over all possible state sequences. x = x1 . adj. . O. θ) Using our deﬁnition of an HMM. since wash can either be a noun or a verb). and a series of i.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS There are three fundamental questions associated with hidden Markov models:8 1. y.3 shows a simple example of a hidden Markov model for part-ofspeech tagging. the joint probability of this sequence of observations and the state sequence used to generate it is the product of the individual event probabilities. . .

3 green big old might 0.1 0.5 0.4 0.1 0.19 0.3 0.1 DET NN V Emission probabilities: DET ADJ NN V the a 0.1 0.1 0.5 0.7 0. In the example outputs.01 Example outputs: John might wash NN V V the big green person loves old plants DET ADJ ADJ NN V ADJ NN plants washes books books books NN V V NN V Figure 6.2 0.4 0.1 book plants people person John wash 0. .2 0.2 0. EM ALGORITHMS FOR TEXT PROCESSING Initial probabilities: DET ADJ NN V 0.7 0.5 0.3: An example HMM that relates part-of-speech tags to vocabulary items in an English-like language.2 0. Possible (probability > 0) transitions for the Markov process are shown graphically.1 0.3 0.1 0. the state sequences corresponding to the emissions are written beneath the emitted symbols.1 might wash washes loves reads books 0.2 0 0.3 0.1 0.7 0.2 0.4 0.1 Transition probabilities: DET ADJ NN DET ADJ NN V V ADJ 0 0 0 0.124 CHAPTER 6.3 0.2 0.

and π. . det. B. There are two ways to compute the probability of x having been generated by M.6. The second. . 2. it’s not hard to see that the values of α2 (r) for every r can be computed in terms of the |S| values in α1 (·) and the observation x2 : α2 (r) = Br (x2 ) · α1 (q) · Aq (r) q∈S This works because there are |S| diﬀerent ways to get to state r at time t = 2: starting from state 1. 6. Fortunately. nn . . We assume a model M = S. we assume that M is deﬁned as shown in Figure 6.}. And. x2 . HIDDEN MARKOV MODELS 125 each generated x can be computed by looking up and multiplying the relevant probabilities in A. Unfortunately. θ as deﬁned above. since the set of possible labels is exponential in the length of x. θ . fortunately. xt ? Call this probability αt (q). It is easy to see that the values of α1 (q) can be computed as the product of two independent probabilities: the probability of starting in state q and the probability of state q generating x1 : α1 (q) = πq · Bq (x1 ) From this. is much more eﬃcient. might. and then summing the result or taking the maximum. det. . for example x = John. We can make use of what is known as the forward algorithm to compute the desired probability in polynomial time. Thus. The ﬁrst is to compute the sum over the joint probability of x and every possible labeling y ∈ { det. det.2. making exhaustive enumeration computationally intractable. αt (q) is a two dimensional matrix (of size |x| × |S|). even with all the distributed computing power MapReduce makes available. O. O.3. For the purposes of illustration. . as we hinted at in the previous section. the third question can be answered using EM.2. . because the underlying model behaves exactly the same whenever it is in some state. we can use dynamic programming algorithms to answer all of the above questions without summing over exponentially many sequences. v . . called a trellis.2 THE FORWARD ALGORITHM Given some observation sequence. Furthermore. det. As indicated above this is not feasible for most sequences. This algorithm works by recursively computing the answer to a related question: what is the probability that the process is in state q at time t and has generated x1 . . det. . |S| and transitioning to state r. det . . because the behavior . . wash . regardless of how it got to that state. we will quickly run into trouble if we try to use this na¨ strategy since there are ıve |S|τ distinct state sequences of length τ . Question 1 asks what is the probability that this sequence was generated by an HMM M = S.

the na¨ approach to solving this problem is to enumerate all possible ıve labels and ﬁnd the one with the highest joint probability. we deﬁne bpt (q). having generated x1 .2. Summing over the ﬁnal column also yields 0. 6. Summing over all y . . might. xt . The lower panel shows the forward trellis.4 illustrates the two possibilities. The probability of the full sequence is the probability of being in time |x| and in any state.126 CHAPTER 6. the second question we might want to ask of M is: what is the most likely sequence of states that generated the observations? As with the previous question. x2 . Since we wish to be able to reconstruct the sequence of states. wash . . This is known as the Viterbi algorithm. The upper panel shows the na¨ ıve exhaustive approach. examining the chart of probabilities in the upper panel of Figure 6. with the forward algorithm. the same result. . . a more eﬃcient answer to Question 2 can be computed using the same intuition in the forward algorithm: determine the best state sequence for a short sequence and extend this to easily compute the best sequence for longer ones. to be the state used in this sequence at time t − 1. αt (r) can always be computed in terms of the |S| values in αt−1 (·) and the observation xt : αt (r) = Br (xt ) · αt−1 (q) · Aq (r) q∈S We have now shown how to compute the probability of being in any state q at any time t. Continuing with the example observation sequence x = John. xt . the “backpointer”. consisting of 4 × 3 cells. y ). . there are two ways of computing the probability that a sequence of observations x was generated by M: exhaustive enumeration with summing and the forward algorithm.00018. v is the most likely sequence of states under our example HMM. .3 THE VITERBI ALGORITHM Given an observation sequence x. the marginal probability of x is found to be 0. Figure 6. EM ALGORITHMS FOR TEXT PROCESSING of a Markov process is determined only by the state it is in at some time (not by how it got to that state). The base case for the recursion is as follows (the state index of −1 is used as a placeholder since there is no previous best state at time t = 1): . to be the most probable sequence of states ending in state q at time t and generating observations x1 . so the answer to Question 1 can be computed simply by summing over α values at time |x| for all states: Pr(x.00018. . x2 . θ) = q∈S α|x| (q) In summary. However. We deﬁne γt (q). enumerating all 43 possible labels y of x and computing their joint probability Pr(x. the Viterbi probability.4 shows that y∗ = nn. v. .

0 fromwith and DET DET 0. t is incremented. Y? maximize Pr(x) = Pr(x. and a i. . the = observation sequences 1 x2 . Step V 0. data.} are a ALGORITHMS PROCESSINGThis example illustrates a key 38 38 2. possible. Step 1. observation y∗ = and Figure Given Computing the S. xτ setx2 .M andS. and θ) vocabulary series of series the forward algorithm set anyobservation Pr(x.0 theused to ADJ product ofV product of and NN usedit to it the 0.0 from Step StepTHREEif 3. . an observation symbol from O is emitted according to the distribution Bi .0 0. Step 3. nn.3 shows example of States S = {det. green. what is the most clusters.0 observations and the state sequenceADJ sequence NN NN generate it is thethe NN ADJ 0. θ . a O. .0 ADJ V NNa hidden Markov model for part of speech tagging.0 . .correspond adj. v} correspond to the parts of speech. . . θ∗ = arg x1θ x2 . S.0 emitted DET t <0. .probability that M observation O.0 a new is 0. of = arg observation sequences xobservation x.0 generate NN is NN product ofV the individual 0.0 ADJ x = x3. θ) Pr(xi y. y.d.0 =anADJ to the Vparts0. .000021 1.i.Givenxa.y) p(x. Given.=a 1 model the S. Y? generated over all possible statesequences . and an0. 0.i.0are the M 0. what areB. θ . . Since Step 0.M.0 of event event M probabilities.0 event probabilities.1 THREE QUESTIONS FOR HIDDEN MARKOV MARKOV 2. . πθ = A. the joint probability of this NN DET of 0. t is ADJ ADJ ADJ ADJ 2. 0.0 V ADJ example ADJ 0. O. θ) the relevant probabilities 2.0 Vi isDET ADJ according thethe distribution Bi . used 0.the joint probability of this sequence Vof ADJ used repeats ADJ from Step0. parameters. HIDDEN τMARKOV MODELS from Step eventsDET all 0.0 to it the 0. EMthe English words. y. aM generatedi.2. . . Step 0. v} correspond correspond of speech.0 sequence DET follows. θ and an observation sequence x. . .0 Step 1.shows what shows probability that hidden ADJ for partdata.1 are three QUESTIONSquestions associated with hiddenwith7hidden Markov models:7 THREE fundamental FOR HIDDEN MARKOV MODELS 7 There are three fundamentalDET three0. θ) Pr(x) = Pr(x.00009 States V = {det. clusters. and of sequence of symbols of symbols from O. O.. Given a set of states S. B. θ) wash the most HMM probability of y∈Y = arg max Pr(x.0 0. example. and a key {det. what . . of length τ ADET according to the distribution π. a. to answers to questions are θ∗ = arg max Pr(x .y) sequencewashof length τJohn length τ is follows. 0. an observationseriesand a i. Given a set Given a 3.4.i.0 questions associated with hidden y. clusters. and a single observation type may associated associated or clustering of observations.1 THREE QUESTIONS FOR HIDDEN 2. t QUESTIONSt < τ .DET observation type may Step 0. . V S parts States S = a subset ofnn. of i. the answers to the ﬁrst two questions are in principle in principle Given our an HMM. θ) quite trivial to compute: quite trivial over all stateby iterating over allistate sequences Y. πsumming over x1 sequences. = arg max . θ = A. 7 The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].0 .1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS MODELS MODELS 2.1DET 3. the probability probability state 7 The 7 The organization based section on this in part on ideas Rabiner’s ideas Rabiner’s HMM tutorial HMM 7 The this in part of ideas from is based from on HMM tutorial [64]. x what over τ Y?x2 observationpossiblexstate0. . y.0 Step 1. Given a model M = S. adj.y) John A sequence ofJohn mightofof observationsand a singleas generatedis mightlet t =as and letJohn might let t =p(x.S. adj. . θ) θ maximize the likelihood ofi=1 y∈Y θ i=1 y∈Y θ i=1 y∈Y the training data? Given our deﬁnition of deﬁnition ofdeﬁnition of ananswers the the ﬁrst twothe ﬁrst two questions are in principle Given our an HMM.0 y∈Y Pr(x) =0. θ) multiplying John. y.0 if < ADJ ADJ 0. = 0 several observations.an 2. and summingof summing . the joint probability probability ofV this 0. Since iall events used DET to Ai . possible0.} are the English words.0 DET NN 0.d. . and a series of i.3 is the an a hidden Markov model 0.0 the NN and the partsand speech. green. O. summingspeech tagging. . nn. Step 2. the answers to the ﬁrst two questions are in principle q∈S likely sequence of states likelygenerated of generated the data? the data? that sequence the data? likely sequence of states that states that generated quite trivial to compute: by iterating over all state sequences Y. Y? DET V V 0. 0. observation sequencesmax. the probability that each ∗ generated x can be computed arg O.is θ probability that M sequence of symbols from O. the example illustrates observations O = {the. M = an observation sequencean sequence x. associated Markov models: There are three fundamental questions associated with hidden Markov models:7 y∈Y 1.sequences. Vto observations observationsV O = 0.of states S.observation observation sequence the what is the most S. . ∗ = . a new i NN 2.0 processthis process in are conditionally the 0. x2 . Since all events used in this process are conditionally independent.0 S States state{det. x1.the probability that parameters θ = A.000081 the data. B. of statessetGivenobservationpanel). .0 anπ.0 1.d. EM ALGORITHMS FOR TEXT PROCESSING A sequence of observations of length τ is generated as follows. EM ALGORITHMS 2. green. O.} are a subset of a subset of the English words. . x. DET DET V 0.4. Given andS.0 Pr(x.3 shows. DET probabilities. 2. .what.0a newaccording to Ai .000099 y∈Y θ∗ = arg max y∈Y Pr(xi .} are {the. might. using 3. θ) θ α1 α2 α3 α1 y∈Y α3 α1 α2 α3 α2 i=1 2.00018 example illustratesequivalence classes HMMs: intuition behind many intuition behind many of Pr(x.d. Figure 2. Given a model M = model θ and S. the what the probability that the data. of. x2 ..are x1.4. is. likely of this section states in part on ideas from Lawrence given in Figure 6. Step 2. anmax∗Pr(x. .38 CHAPTER 2. likely sequence of states that generated the data? John might wash 2. . a. statesto HMMs: of HMMs: states to equivalence classes correspond correspond to intuition behind and a single observation type may associated with several many applicationsy∈Y HMMs: states correspond to equivalence classes of or clusteringor clustering ofclustering of observations.d.0 conditionally with hidden joint models:7ADJ sequence sequence of DET V 0. π parameters that 1 maximize the likelihood maximize the likelihood of the training data? of the training data? maximize the likelihood of the training data? 3. τ .an observation symbols from O. and a single observation type may associated with several clusters.000009 V There are process are conditionally independent. . 0.0 NN V NN 0. nn.0 events used DET 3. B. v} adj. . a model observation sequencean observation sequence from O. that . what .0 accordingStep 2.0 ADJ DET Vthe 0. Since all 0. associated t Stepandwash 0 A length Step 0. what is under the y∈Y y∈Y 7 The organizationsequence of is based that generated the data? Rabiner’s HMM tutorial [64]. 6.τ t is incremented. M = an θ and O. xstates .is stateY? 0. theNN ADJ repeats Since Vall 1. if t < τ . i is drawn according to MODELS 127 new HIDDEN MARKOV Ai .0 ADJ V DET 0.0 in are ADJ NNfundamental are ADJ independent. summing . v} to speech. NN drawn NN observation A . V of over all nn. y. an a (lower ∗ vocabulary vocabulary O. and observations O = {the. 0. y.3 . θyPr(x.what. parameters = A.0 ADJ NN DET V observationsin this NN state sequence used to generate 0.0003 0.0 drawn according V DET NN ADJ DET V ADJ Step 3.an . Step conditionally independent. and a single observation type maywith severalwith several of observations.4.0 observations ADJ the state 0.i.0 NN ADJ V the 0. green. most 2.0 V DET 0. V Figure 2. tutorial [64]. = S.NN V V ADJ NN NN V V NN V DET V DET 0. let t = 0 and select an initial state i according to the distribution π. ADJtFOR. x . O. M1=x2S.0 ADJ NN NN NN NN observationsNN and the state sequenceNN used 0.i. t is incremented.0 a Markov modelV for part of DET V NN 0. Step 2. . . max andarg theysequence sequencey. if t < .0 ADJ ADJ V 0.0 for V x Vx1 xADJ x DET =Figure2 . select 0 0. .0 isgenerateNN is DET the individual sequence the individual and the product of 0.1.NN Markov of this eventsNN of 0. .. NN a model 0. DET from O according to DET ADJ 0. . anADJ V of example ofNN M generated the of speech tagging. Given a 1. a. .0 three associated NN DET DET 0. Given observation sequence 0. 0. all x1sequences.y)and select generated Aobservations sequence of isof might wash ofp(x. what Given our deﬁnition of an HMM.0 V DET 0. the English words. v} correspond toThis parts of speech. π that .003 V y∈Y Pr(x. max O. CHAPTER CHAPTER FOR TEXT FOR TEXT FOR TEXT PROCESSING a. θ) Pr(x) = θ) Pr(x. .Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.0 symbols from O. if questionsNN process independent.0 τ process Step 1..4: a model M = by looking upy. the 1 2 x x = all . π. equivalence classes a Pr(x) of the correspond This a key intuition behind many applications applicationsapplicationsθ) = 0. This exampleaillustrates a key {the.0 DET ADJ V an initial state idistribution Bthe distribution π. y. θ) 0. green. the process repeats from Step 1. an an 0. an observation vocabulary O. τ as follows. Given a 2.0 ADJ DET 0. .0 event probabilities. a. .arg max∗ y. organization of this section isof organization is basedsection Lawrencein partLawrence from Lawrence Rabiner’s [64]. = xx what observationis generated generated series .0 i . 1. θ) There There are fundamental questions Markov models: 0. the joint probability of this sequence of observations and the state sequence used to generate it is the product of the individual event probabilities.O. of states S.and an vocabularythe data. let tVO DET select 0. Given a modelPr(x) = an αθ (q) = 0.are theθparameters θ thatA. y. DET. This example illustrates key {the. x sequences xthex2 .} are of subset = statesEnglish words. .incremented.0τ 0. . the HMM. . xτ. . EM 38 CHAPTER 2.3 by explicitly summing over all possible sequence labels (upper panel) and 3. NN over allthe likelihood of the training data? possible state sequences.y∈Y . Pr(xi what Pr(xi y.0 y∗ = arg max0. a 6.sequences Y.0 = S. θ) are the. x θ . .NN DET an symbol an symbol from O 0. or observations. .0 the observation Step 1. . 0. according to symbol ADJ DET is emitted according toaccording to the distribution distribution Binew i is 0. most θ = 3 and x.4.0 NN DET DET an initial clusters. π that . of OV= and observations O = sequences. adj. wash Step 0. O. States S = {det. is DET an isDET DET the 0. subset of ALGORITHMS PROCESSING 38 CHAPTER 2.03 . of a hidden Markov model NN part of speech tagging.00006 V V NN Figure 2. Step 1. select = or clustering of observationsobservationsis generated asgenerated 0 follows. DET state initial state to the distribution distribution ADJobservation observation symbol from O state iinitial ADJ i according i according to0.0 ADJ NN in this process this ADJ is incremented. 0.0 process the process repeats fromADJ 0. EM ALGORITHMS FOR states PROCESSING intuition behind many applications of HMMs: TEXT correspond to equivalence classes p(x. the that each that each quite trivial by compute: to compute: over θall i=1 y∈Ythe probability that each to iterating by iterating sequences Y. y.drawn iV is drawnNN emitted NN emitted ADJ iDET to NN is B a according to Ai .00018 is the what is x. B. to ADJ ADJ repeats 0. O.0 individual DET NN ADJ 0. Given a model M = model θ .

might. that nn. might. . states.00009 V γ1 γ2 γ3 γ1 γ2 γ3 γ1 γ2 γ3 Figure 6.we wish . probability. over all y . underwash PROCESSING for the sequence sequence our under HMM.to we 0. a Question 2 Question 2 Question 2 can be computed using intuition as intuition in intuition in the used in determine the best state sequence thesequence forsequence for we used as the used as we forward the forwarddetermine the best state best state we forward algorithm: algorithm: algorithm: determine for might a short sequence and extend this John this to easily compute thecompute the best sequence for longer ones.03 base case for the recursionindex followsis(the of −1 index of −1 is used time t − caseNN t − case for the recursion is as follows is as of −1 index state is used time the recursion is as follows (the state 0. to pointer”. For example.0 algorithm.4. 0. consisting of 4 × 3 cells.00018. probability of x isbility of x bebility of xThe lowerto The shows panel shows panel shows the forward trellis. than summing over previous states. However. . y∗ .5 illustrates a Viterbi trellis. . Summing the marginal probacomputing their joint probability probability Pr(x.0 DET be the most be the most be the most probable sequence ending in of ending in state and generating generating probable sequence of sequence ending in of of statestime t and generating time t and probable states of of states state q at state q at time t q at observationsobservationsobservations xtx1 . to This the Viterbi the Viterbi the Viterbi We deﬁneViterbi the γt (q).00018. yjoint Pr(x. (q). v.0 as a place-holderplace-holderplace-holder is no previous no previous best state at time 1): as a since there is no previous bestthere is best state at time 1): as a since there since state at time 1): 0. v is of most for CHAPTER observation observationwash might. to cell of the that of algorithm.0 0.the chart the chart one with the highest jointhighest the probability. Summing over the ﬁnal computes 0. Since we wish 0. 128 the observation 6. 0. ADJ. also computes 0. a more eﬃcient answer to more eﬃcient answer to can be computed using the same the same However. For example.0 be sequence of to states. is 0. The lower trellis. and then the backpointers are followed.3 THE2. a short sequence and to easily compute the best easilywashsequence for longer ones. t (q). the naive problem is problem is problem is to enumerate all possible labels and ﬁnd theand ﬁnd theand ﬁnd the one with jointhighest joint probability. x wish . the marginal probacomputing their ). . that nn. the bpt to the “back used state sequence at time t − 1. all 43 possible 4 possible labels computing their joint probability Pr(x.00006state for The (the used 0. to t 0.3 THE2. EM ALGORITHMS FOR TEXT exampleour exampleour example HMM. wash un1 (q) = πγ1 Bq = algorithm. v. approach.00018. x2 . examining examining examining of probabilitiesprobabilitiesprobabilities in Figure ?? the most likely sequencethe states likely states of in Figure ??in Figure ?? shows v is shows v is the most likely sequence of sequence of states of shows that nn. y the marginalover all y . the (q). to construct the rest of the sequence: ∗ y|x| = arg max γ|x| (q) q∈S ∗ yt−1 = bpt (yt ) Figure 6. except rather than summing The recursion that in the is similar to to the the column (thicker arrows). we deﬁne bpwe deﬁne “back deﬁne “back pointer”.4. the naive approach to approach to approach to to enumerate all possible labels model. γ1 (q) = πq · Bq (x1 ) bp1 (q) = −1 The recursion is similar to that of the forward algorithm. .3 THE VITERBI ALGORITHM VITERBI ALGORITHM VITERBI ALGORITHM If we have an we have an we have an observation sequence and wish most probable label under label under a If observation observation sequence and wish most probable label the most probable a If sequence and wish to ﬁnd the to ﬁnd the to ﬁnd under a model. Note that the backpointer simply records the index of the originating state—a separate computation is not necessary. a short sequence and extend this to sequence for longer ones. Summing ).0 . the over previous states.γt (q). column. 2. to be thein this used in this sequence at states. wash under HMM.0 0. consisting 0. a separate– a separate is not necessary. be the statebe thein this used state sequence at the bpt (q). the state with the highest probability path at time |x| is selected. v. The base 1. the naive solving this solving this solving this to enumerate all possible labels model. the maximum value of all possible trajectories into state r at time t is computed. alsocolumn. Since .0003to be able to the able to the sequence of xtx1 x2 . of 4 × 3 cells. . .4. a more eﬃcient answer to can be computed using the same However. we pointer”.be table to reconstruct reconstruct reconstruct the sequence of x1 . Summing over × 3 cells. Since . to This known as algorithm. the maximumpossible trajectories trajectories into state into state maximum value of all value of all value of all state r trajectories t is computed. might. except rather than summing over previous states. extend best This is known asis known asisalgorithm. . recursively. for the sequence John. y ).γWe deﬁneViterbi the Viterbi probability. .003 0. πq · most q Viterbi q Bq = der the HMM given in Figure γ6. algorithm. For example. probability. Note that the back-pointer just index of the originating index is back-pointer just records the records records originating originating state – a separate–computationcomputationcomputation is not necessary. The base 1. is not necessary.John.John. We deﬁne0.5: Computing the most likely state sequence that generated John.00018. including backpointers that have been used to compute the most likely state sequence. Note thattthecomputed. consisting found to is found to be found panel lower the forward the forward trellis.upper panel upper panel upperapproach. γt (r) = max γt−1 (q) · Aq (r) · Br (xt ) q∈S q∈S bpt (r) = arg max γt−1 (q) · Aq (r) · Br (xt ) To compute the best sequence of states. be 0.00018. . .the chart one with the probability. x2 . Summing over all y . of 4 the ﬁnal Summing over the ﬁnal computes 0.3 using the·(q) (x1 ) πγ1·(q) (x1 ) The Bq (x1 )likely state sequence bp1 (q) recovered programmatically by following backpointers from = −1 (q) = −1 (q) = −1 bp1 bp1 is highlighted in bold and could be the maximal probabilityis recursion last column that of ﬁrst forwardexcept rather than summing The recursion is similar The similar to forward the forwardexcept ratheralgorithm. alsocolumn. the maximumpossible intopossible at time r at time r at time over previous states. Note that the back-pointer just index of the state of thestate t is computed.00018. enumeratingenumeratingenumerating all x 3and y of x and y of x and shows the naive panel shows the naive all 43 possible labels y of labels shows the naive approach.

6.2. HIDDEN MARKOV MODELS

129

John

DET ADJ NN V

might

wash

Figure 6.6: A “fully observable” HMM training instance. The output sequence is at the top of the ﬁgure, and the corresponding states and transitions are shown in the trellis below.

6.2.4 PARAMETER ESTIMATION FOR HMMS We now turn to Question 3: given a set of states S and observation vocabulary O, what are the parameters θ∗ = A, B, π that maximize the likelihood of a set of training examples, x1 , x2 , . . . , x ?9 Since our model is constructed in terms of variables whose values we cannot observe (the state sequence) in the training data, we may train it to optimize the marginal likelihood (summing over all state sequences) of x using EM. Deriving the EM update equations requires only the application of the techniques presented earlier in this chapter and some diﬀerential calculus. However, since the formalism is cumbersome, we will skip a detailed derivation, but readers interested in more information can ﬁnd it in the relevant citations [78, 125]. In order to make the update equations as intuitive as possible, consider a fully observable HMM, that is, one where both the emissions and the state sequence are observable in all training instances. In this case, a training instance can be depicted as shown in Figure 6.6. When this is the case, such as when we have a corpus of sentences in which all words have already been tagged with their parts of speech, the maximum likelihood estimate for the parameters can be computed in terms of the counts of the number of times the process transitions from state q to state r in all training instances, T (q → r); the number of times that state q emits symbol o, O(q ↑ o); and the number of times the process starts in state q, I(q). In this example, the process starts in state nn; there is one nn → v transition and one v → v transition. The nn state emits John in the ﬁrst time step, and v state emits might and wash in the second and third time steps, respectively. We also deﬁne N (q) to be the number of times the process enters state q. The maximum likelihood estimates of the parameters in the fully observable case are:

9 Since

an HMM models sequences, its training data consists of a collection of example sequences.

130

CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

πq =

=

I(q) r I(r)

Aq (r) =

For example, to compute the emission parameters from state nn, we simply need to keep track of the number of times the process is in state nn and what symbol it generates at each of these times. Transition probabilities are computed similarly: to compute, for example, the distribution Adet (·), that is, the probabilities of transitioning away from state det, we count the number of times the process is in state det, and keep track of what state the process transitioned into at the next time step. This counting and normalizing be accomplished using the exact same counting and relative frequency algorithms that we described in Section 3.3. Thus, in the fully observable case, parameter estimation is not a new algorithm at all, but one we have seen before. How should the model parameters be estimated when the state sequence is not provided? It turns out that the update equations have the satisfying form where the optimal parameter values for iteration i + 1 are expressed in terms of the expectations of the counts referenced in the fully observed case, according to the posterior distribution over the latent variables given the observations x and the parameters θ(i) : πq =

E[I(q)]

T (q → r) N (q) = r T (q → r )

Bq (o) =

O(q ↑ o) (6.2) N (q) = o O(q ↑ o )

Aq (r) =

E[T (q → r)] E[N (q)]

Bq (o) =

E[O(q ↑ o)] E[N (q)]

(6.3)

Because of the independence assumptions made in the HMM, the update equations consist of 2 · |S| + 1 independent optimization problems, just as was the case with the ‘observable’ HMM. Solving for the initial state distribution, π, is one problem; there are |S| solving for the transition distributions Aq (·) from each state q; and |S| solving for the emissions distributions Bq (·) from each state q. Furthermore, we note that the following must hold:

E[N (q)] =

r∈S

E[T (q → r)] =

o∈O

E[O(q ↑ o)]

As a result, the optimization problems (i.e., Equations 6.2) require completely independent sets of statistics, which we will utilize later to facilitate eﬃcient parallelization in MapReduce. How can the expectations in Equation 6.3 be understood? In the fully observed training case, between every time step, there is exactly one transition taken and the source and destination states are observable. By progressing through the Markov chain, we can let each transition count as ‘1’, and we can accumulate the total number of times each kind of transition was taken (by each kind, we simply mean the number of times that one state follows another, for example, the number of times nn follows det). These statistics can then in turn be used to compute the MLE for an ‘observable’ HMM, as

6.2. HIDDEN MARKOV MODELS

131

described above. However, when the transition sequence is not observable (as is most often the case), we can instead imagine that at each time step, every possible transition (there are |S|2 of them, and typically |S| is quite small) is taken, with a particular probability. The probability used is the posterior probability of the transition, given the model and an observation sequence (we describe how to compute this value below). By summing over all the time steps in the training data, and using this probability as the ‘count’ (rather than ‘1’ as in the observable case), we compute the expected count of the number of times a particular transition was taken, given the training sequence. Furthermore, since the training instances are statistically independent, the value of the expectations can be computed by processing each training instance independently and summing the results. Similarly for the necessary emission counts (the number of times each symbol in O was generated by each state in S), we assume that any state could have generated the observation. We must therefore compute the probability of being in every state at each time point, which is then the size of the emission ‘count’. By summing over all time steps we compute the expected count of the number of times that a particular state generated a particular symbol. These two sets of expectations, which are written formally here, are suﬃcient to execute the M-step.

|x|

E[O(q ↑ o)] = E[T (q → r)] =

i=1 |x|−1 i=1

Pr(yi = q|x; θ) · δ(xi , o) Pr(yi = q, yi+1 = r|x; θ)

(6.4) (6.5)

Posterior probabilities. The expectations necessary for computing the M-step in HMM training are sums of probabilities that a particular transition is taken, given an observation sequence, and that some state emits some observation symbol, given an observation sequence. These are referred to as posterior probabilities, indicating that they are the probability of some event whose distribution we have a prior belief about, after addition evidence has been taken into consideration (here, the model parameters characterize our prior beliefs, and the observation sequence is the evidence). Both posterior probabilities can be computed by combining the forward probabilities, αt (·), which give the probability of reaching some state at time t, by any path, and generating the observations x1 , x2 , . . . , xt , with backward probabilities, βt (·), which give the probability of starting in some state at time t and generating the rest of the sequence xt+1 , xt+2 , . . . , x|x| , using any sequence of states to do so. The algorithm for computing the backward probabilities is given a bit later. Once the forward and backward probabilities have been computed, the state transition posterior probabilities and the emission posterior probabilities can be written as follows:

6) (6.7) Equation 6. is also not complicated: it is the product of four conditionally independent probabilities: the probability of getting to state q at time i (having generated the ﬁrst part of the sequence). In this illustration. s3 } and O = {a.7. s2 .6 is the probability of being in state q at time i. θ) = αi (q) · βi (q) Pr(yi = q. and the probability of generating the rest of the sequence. the probability of taking transition q → r (which is speciﬁed in the parameters. the backward algorithm uses dynamic programming to incrementally compute βt (·). A visualization of the quantities used in computing this probability is shown in Figure 6. The backward algorithm. c}. the probability of generating observation xi+1 from state r (also speciﬁed in θ). and the shaded area on the right corresponds to the backward probability β3 (s2 ). along any path.132 CHAPTER 6. given the observation sequence a b b c b. Its base case starts at time |x|. and is deﬁned as follows: . Like the forward and Viterbi algorithms introduced above to answer Questions 1 and 2. we assume an HMM with S = {s1 . and the correctness of the expression should be clear from the deﬁnitions of forward and backward probabilities. Pr(yi = q|x.7: Using forward and backward probabilities to compute the posterior probability of the dashed transition. θ) = αi (q) · Aq (r) · Br (xi+1 ) · βi+1 (r) (6. The shaded area on the left corresponds to the forward probability α2 (s2 ). b. θ). given x. The intuition for Equation 6. the probability of taking a particular transition at a particular time.7. EM ALGORITHMS FOR TEXT PROCESSING a S1 b b c b S2 S3 α2(2) β3(2) Figure 6. yi+1 = r|x.

a few practical considerations: HMMs have a non-convex likelihood surface (meaning that it has the equivalent of many hills and valleys in the number of dimensions corresponding to the number of parameters in the model). EM will never change this in future iterations. . EM training is only guaranteed to ﬁnd a local maximum. only a handful of iterations are necessary. Additionally. For each x in the training data. and the quality of the learned model may vary considerably.3. which yields θ(i+1) . depending on the initial parameters that are used. and the number of times each state transitions into each other state. π.6. using the parameter settings of the current iteration. completing the E-step. θ(i) . HIDDEN MARKOV MODELS 133 β|x| (q) = 1 To understand the intuition for this base case. hundreds may be required. the backward algorithm is computed from right to left and makes no reference to the start probabilities. The forward and backward probabilities are in turn used to compute the expected number of times the underlying Markov process enters into each state. 6. if some parameter is assigned a probability of 0 (either as an initial value or during one of the M-step parameter updates). For some applications. Strategies for optimal selection of initial parameters depend on the phenomena being modeled. O. The process then repeats from the E-step using the new parameters.5 FORWARD-BACKWARD TRAINING: SUMMARY In the preceding section. The M-step involves normalizing the expected counts computed in the E-step using the calculations in Equation 6. whereas for others. the forward and backward probabilities are computed using the algorithms given above (for this reason. As a result. and since there is nothing left to generate after time |x|. The number of iterations required for convergence depends on the quality of the initial parameters. we have shown how to compute all quantities needed to ﬁnd the parameter settings θ(i+1) using EM training with a hidden Markov model M = S. keep in mind that since the backward probabilities βt (·) are the probability of generating the remainder of the sequence after time t (as well as being in some state). These expectations are summed over all training instances. this training algorithm is often referred to as the forward-backward algorithm).2. and the complexity of the model. Finally. the number of times each state generates each output symbol type. The recursion is deﬁned as follows: βt (q) = r∈S βt+1 (r) · Aq (r) · Br (xt+1 ) Unlike the forward and Viterbi algorithms. the probability must be 1. θ(i) . To recap: each training instance x is processed independently.2.

one must be aware of this behavior.g. however.e. Since parameters are estimated from a collection of samples that are assumed to be i. • Model parameters θ(i) . HMMs typically deﬁne a massive number of sequences.. A very common solution to this problem is to represent probabilities using their logarithms. The degree of parallelization that can be attained depends on the statistical independence assumed in the model and in the derived quantities required to solve the optimization problems in the M-step. driver program) spawns the MapReduce jobs... Although the model being optimized determines the details of the required computations. given that the startup costs associated with individual map tasks in Hadoop may be considerable.4 for additional discussion on working with log probabilities. in fact. and so the probability of any one of them is often vanishingly small—so small that they often underﬂow standard ﬂoating point representations. which sum together the training statistics. EM ALGORITHMS FOR TEXT PROCESSING This can be useful.3 EM IN MAPREDUCE Expectation maximization algorithms ﬁt quite naturally into the MapReduce programming model. See Section 5. MapReduce implementations of EM algorithms share a number of characteristics: • Each iteration of EM is one MapReduce job. . In the limit. are often quite eﬀective at reducing the amount of data that must be written to disk. computing partial latent variable posteriors (or summary statistics. keeps track of the number of iterations and convergence criteria. • A controlling process (i. • Combiners. the E-step can generally be parallelized eﬀectively since every training instance can be processed independently of the others. which are static for the duration of the MapReduce job. Another pitfall to avoid when implementing HMMs is arithmetic underﬂow. since it provides a way of constraining the structures of the Markov model. a distributed key-value store).d. • Mappers map over independent training instances. such as expected counts). are loaded by each mapper from HDFS or other data provider (e.i.134 CHAPTER 6. 6. each independent training instance could be processed by a separate mapper!10 10 Although the wisdom of doing this is questionable. Note that expected counts do not typically have this problem and can be represented using normal ﬂoating point numbers. • Reducers sum together the required training statistics and solve one or more of the M-step optimization problems.

3. A pairs approach that requires less memory at the cost of slower performance is also feasible. While the optimization problem is computationally trivial.2. a number of reducers may be run in parallel. and this constrains the number of reducers that may be used. The outputs are: . the full computational resources of a cluster may be brought to bear to solve this problem. The quantities that are required to solve the M-step optimization problem are quite similar to the relative frequency estimation example discussed in Section 3. rather than counts of observed events. the majority of the computational eﬀort in HMM training is the running of the forward and backward algorithms. In this situation. making it necessary to use a single reducer. being able to reduce in parallel helps avoid the data bottleneck that would limit performance if only a single reducer is used. First. The input consists of key-value pairs with a unique id as the key and a training instance (e. completing the E-step.4. emitting expected event counts computed using the forwardbackward algorithm introduced in Section 6. The process can be summarized as follows: in each iteration. This parallelization strategy is eﬀective for several reasons. many common models (such as HMMs) require solving several independent optimization problems in the M-step. HMM training mapper. a sentence) as the value.1 HMM TRAINING IN MAPREDUCE As we would expect.3. since the M-step of an HMM training iteration with |S| states in the model consists of 2 · |S| + 1 independent optimization problems that require non-overlapping sets of statistics. must aggregate the statistics necessary to solve the optimization problems as required by the model. as described in Section 3. however. mappers process training instances. however. Reducers aggregate the expected counts. Second. it is possible that in the worst case. we aggregate expected counts of events. Fortunately. 2n + 1 stripes are emitted with unique keys. Since there is no limit on the number of mappers that may be run.2.8. the M-step optimization problem will not decompose into independent subproblems. and every training instance emits the same set of keys.. The degree to which these may be solved independently depends on the structure of the model. Still. the training of hidden Markov models parallelizes well in MapReduce. The pseudo-code for the HMM training mapper is given in Figure 6.3.3. we can employ the stripes representation for aggregating sets of related values. and then generate parameter estimates for the next iteration using the updates given in Equation 6.6. EM IN MAPREDUCE 135 Reducers. For each training instance.g. Each unique key corresponds to one of the independent optimization problems that will be solved in the M-step. this may be exploited with as many as 2 · |S| + 1 reducers running in parallel. As a result of the similarity. 6.

4 I ← new AssociativeArray Initial state expectations for all q ∈ S do Loop over states I{q} ← α1 (q) · β1 (q) O ← new AssociativeArray of AssociativeArray Emissions for t = 1 to |x| do Loop over observations for all q ∈ S do Loop over states O{q}{xt } ← O{q}{xt } + αt (q) · βt (q) t←t+1 T ← new AssociativeArray of AssociativeArray Transitions for t = 1 to |x| − 1 do Loop over observations for all q ∈ S do Loop over states for all r ∈ S do Loop over states T {q}{r} ← T {q}{r} + αt (q) · Aq (r) · Br (xt+1 ) · βt+1 (r) t←t+1 Emit(string ‘initial ’. The mappers map over training instances (i.2. Section 6. . EM ALGORITHMS FOR TEXT PROCESSING 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: class Mapper method Initialize(integer iteration) S. θ) cf.2. stripe O{q}) Emit(string ‘transit from ’ + q.e. B. θ) cf.136 CHAPTER 6.8: Mapper pseudo-code for training hidden Markov models using EM. Section 6. sequence x) α ← Forward(x. π ← ReadModelParams(iteration) method Map(sample id. emissions. stripe I) for all q ∈ S do Emit(string ‘emit from ’ + q. stripe T {q}) Loop over states Figure 6.. and transitions taken to generate the sequence. O ← ReadModel θ ← A.2 β ← Backward(x. sequences of observations xi ) and generate the expected counts of initial states.

C) z←0 for all k. . . v ∈ Cf do z ←z+v Pf ← new AssociativeArray for all k. .3. stripe Cf ) class Reducer method Reduce(string t. C) Emit(string t. .9: Combiner and reducer pseudo-code for training hidden Markov models using EM. .]) Cf ← new AssociativeArray for all stripe C ∈ stripes [C1 . stripe Pf ) Figure 6. . . v ∈ Cf do Pf {k} ← v/z Final parameters vector Emit(string t. . . C2 . C2 .]) Cf ← new AssociativeArray for all stripe C ∈ stripes [C1 . stripes [C1 . C2 . . C2 .] do Sum(Cf . stripes [C1 . . . EM IN MAPREDUCE 137 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: class Combiner method Combine(string t. . The HMMs considered in this book are fully parameterized by multinomial distributions.6. so reducers do not require special logic to handle diﬀerent types of model parameters (since they are all of the same type).] do Sum(Cf .

but translation pairs had to be developed indepen- . which is an important task in statistical machine translation that is typically solved using models whose parameters are learned with EM. For a number of years. the expected number of times state q transitions to each state r. 6. These early attempts failed to live up to the admittedly optimistic expectations. work on translation was dominated by manual attempts to encode linguistic knowledge into computers—another instance of the ‘rule-based’ approach we described in the introduction to this chapter. HMM training reducer. representing initial state probabilities π. 2. aggregates the count collections associated with each key by summing them. there was considerable optimism that translation of human languages would be another soluble problem. the idea of fully automated translation was viewed with skepticism. the associative array contains all of the statistics necessary to compute a subset of the parameters for the next EM iteration.4 CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION To illustrate the real-world beneﬁts of expectation maximization algorithms using MapReduce. In the early years. refer to [85. The optimal parameter settings for the following iteration are computed simply by computing the relative frequency of each event with respect to its expected count at the current iteration. 97]. with a key indicating that the associated value is a set of transition counts from state q. EM ALGORITHMS FOR TEXT PROCESSING 1. Not only was constructing a translation system labor intensive. The new computed parameters are emitted from the reducer and written to HDFS.138 CHAPTER 6. We begin by giving a brief introduction to statistical machine translation and the phrase-based translation approach. shown together with an optional combiner in Figure 6. for a more comprehensive introduction.9. After successes with code-breaking during World War II. and 3. with a unique key designating that the values are initial state counts. When the values for each key have been completely aggregated. we turn to the problem of word alignment. the expected number of times that state q generated each emission symbol o (the set of emission symbols included will be just those found in each training instance x). Fully-automated translation has been studied since the earliest days of electronic computers. transition probabilities Aq for each state q. The reducer for one iteration of HMM training. Note that they will be spread across 2 · |S| + 1 keys. with a key indicating that the associated value is a set of emission counts from state q. and emission probabilities Bq for each state q. the probabilities that the unobserved Markov process begins in each state q.

using example translations which are produced for other purposes. meaning that improvements in a Russian-English translation system could not. improvements in learning algorithms and statistical modeling can yield beneﬁts in many translation pairs at once. Furthermore. which took a data-driven approach to solving the problem of machine translation. The core idea of SMT is to equip the computer to learn how to translate. attempting to improve both the quality of translation while reducing the cost of developing systems [29]. such as the one used inside Google Translate. i am . and modeling the process as a statistical process with some parameters θ relating strings in a source language (typically denoted as f) to strings in a target language (typically denoted as e): e∗ = arg max Pr(e|f. los estudiantes. many millions of such phrase pairs may be automatically learned. translate between languages. translation systems can be developed cheaply and quickly for any language pair.11 Phrase-based translation works by learning how strings of words. some students . rather than being speciﬁc to individual language pairs. SMT. .1 STATISTICAL PHRASE-BASED TRANSLATION One approach to statistical translation that is simple yet powerful is called phrase-based translation [86]. After languishing for a number of years. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 139 dently.4.6. they are not required to correspond to the deﬁnition of a phrase in any linguistic theory. 6.com 12 Phrases are simply sequences of words. which contains pairs of sentences in two languages that are translations of each other. the ﬁeld has grown tremendously and numerous statistical models of translation have been developed. as long as there is suﬃcient training data available.12 Example phrase pairs for Spanish-English translation might include los estudiantes. the ﬁeld was reinvigorated in the late 1980s when researchers at IBM pioneered the development of statistical machine translation (SMT). θ) e With the statistical approach. for the most part. We provide a rough outline of the process since it is representative of most state-of-the-art statistical translation systems. the students . Since the advent of statistical approaches to translation. The starting point is typically a parallel corpus (also called bitext). Thus. be leveraged to improve a French-English system.4. and soy. is an attempt to leverage the vast quantities of textual data that is available to solve problems that would otherwise require considerable manual eﬀort to encode specialized knowledge. From a few hundred thousand sentences of example translations. like many other topics we are considering in this book. called phrases. with many incorporating quite specialized knowledge about the behavior of natural language as biases in their learning algorithms. Parallel corpora 11 http://translate.google.

. is a string in the target language. . By using these word alignments as a skeleton. phrases can be extracted from the sentence that is likely to preserve the meaning relationships represented by the word alignment. By the chain rule of probability. . .8) k=1 Due to the extremely large number of parameters involved in estimating such a model directly. each phrase pair is associated with a number of scores which. That is. . . it is customary to make the Markov assumption.11) n The probabilities used in computing Pr(w1 ) based on an n-gram language model are generally estimated from a monolingual corpus of target language text. The parallel corpus is then annotated with word alignments. Pr(wn |w1 ) = k−1 Pr(wk |w1 ) (6. are used to compute the phrase translation probability. wn . The translation model attempts to preserve the meaning of the source language during the translation process. and text generated by the United Nations in many diﬀerent languages. After phrase extraction. The phrase-based translation process is summarized in Figure 6. taken together. which we show below how to compute with EM. While an explanation of the process is not necessary here. EM ALGORITHMS FOR TEXT PROCESSING are frequently generated as the byproduct of an organization’s eﬀort to disseminate information in multiple languages.140 CHAPTER 6. we mention it as a motivation for learning word alignments. A language model gives the probability that a string of words w = n w1 . We brieﬂy note that although EM could be utilized to learn the phrase translation probabilities. w2 . phrase-based translation depends on a language model. which indicate which words in one language correspond to words in the other. an n-gram language model is equivalent to k−1 a (n − 1)th-order Markov model.10. Thus.10) (6.9) (6. The collection of phrase pairs and their scores are referred to as the translation model. this is not typically done in practice since the maximum likelihood solution turns out to be quite bad for this problem. we get: n n 2 n−1 Pr(w1 ) = Pr(w1 ) Pr(w2 |w1 ) Pr(w3 |w1 ) . which gives the probability of a string in the target language. we can approximate P (wk |w1 ) as follows: k−1 bigrams: P (wk |w1 ) ≈ P (wk |wk−1 ) k−1 trigrams: P (wk |w1 ) ≈ P (wk |wk−1 wk−2 ) k−1 k−1 n-grams: P (wk |w1 ) ≈ P (wk |wk−n+1 ) (6. written as w1 for short. proceedings of the Canadian Parliament in French and English. that the sequence histories only depend on prior local context. for example. a conditional probability that reﬂects how likely the source phrase translates into the target phrase. Since only target . while the language model ensures that the output is ﬂuent and grammatical in the target language. In addition to the translation model.

Maria no dio una bofetada a la bruja verde Mary not t did not no give i a a slap slap slap l to t by to the to the the th witch it h green green witch did not give slap the witch Figure 6. Both serve as input to the decoder. The best possible translation path is indicated with a dashed line. the small table) … he sat at the table the service was good Target Language Target-Language Text Language Model Translation Model Decoder maria no daba una bofetada a la bruja verde Foreign Input Sentence mary did not slap the green witch English Output Sentence Figure 6. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 141 Training Data i saw the small table vi la mesa pequeña Parallel Sentences Word Alignment Phrase Extraction (vi. The language model is estimated from a monolingual corpus.4. . which performs the actual translation.11: Translation coverage of the sentence Maria no dio una bofetada a la bruja verde by a phrase-based model. i saw) (la mesa pequeña.6.10: The standard phrase-based machine translation architecture. The translation model is constructed with phrases extracted from a word-aligned parallel corpus.

a state-of-the-art approach is known as Kneser-Ney smoothing [35]. For a detailed introduction to phrase-based decoding. the web). Brants et al. introduced a technique known as “stupid backoﬀ” that was exceedingly simple and so na¨ that the ıve resulting model didn’t even deﬁne a valid probability distribution (it assigned arbitrary scores as opposed to probabilities). Brants et al. the number of parameters can easily exceed the number of words from which to estimate those parameters. the gap between stupid backoﬀ and Kneser-Ney narrowed. could a simpler smoothing strategy. Being able to vary the order of the phrases used is necessary since languages may express the same ideas using diﬀerent word orders. We brieﬂy touched upon this work in Chapter 1.g.4. For higher-order models (e.. With smaller corpora. beat Kneser-Ney in a machine translation task? It should come as no surprise that the answer is yes. 5-grams) used in real-world applications. this may be done eﬃciently using dynamic programming. A sequence of phrase pairs is selected such that each word in f is translated exactly once. training n-gram language models still requires estimating an enormous number of parameters: potentially V n . applied to more text. aﬀorded an extremely scalable implementations in MapReduce. as the amount of data increased. we refer the reader to a recent textbook by Koehn [85]. . For many applications. language modeling has been well served by large-data approaches that take advantage of the vast quantities of text available on the web. the phrase-based decoder creates a matrix of all translation possibilities of all substrings in the input string. and provides another example illustrating the eﬀectiveness data-driven approaches in general. In 2007. To cope with this sparseness. In fact. Even after making the Markov assumption. researchers have developed a number of smoothing techniques [102]. To translate an input sentence f. most n-grams will never be observed in a corpus..142 CHAPTER 6. however. stupid backoﬀ didn’t work as well as Kneser-Ney in generating accurate and ﬂuent translations. as an example illustrates in Figure 6. [25] reported experimental results that answered an interesting question: given the availability of large corpora (i. where V is the number of words in the vocabulary. which all share the basic idea of moving probability mass from observed to unseen events in a principled manner.11. The simplicity.2 BRIEF DIGRESSION: LANGUAGE MODELING WITH MAPREDUCE Statistical machine translation provides the context for a brief digression on distributed parameter estimation for language models using MapReduce. 6. Because the phrase translation probabilities are independent of each other and the Markov assumption made in the language model.e. 13 The phrases may not necessarily be selected in a strict left-to-right order. no matter how large. However.13 The decoder seeks to ﬁnd the translation that maximizes the product of the translation probabilities of the phrases used and the language model probability of the resulting string in the target language. EM ALGORITHMS FOR TEXT PROCESSING language text is necessary (without any additional annotation).

and |a| = |e|. we can learn the parameters for this model using EM and treating a as a latent variable. we refer the reader to a forthcoming book from Morgan & Claypool [26].14 This means that the variable ai indicates the source word position of the ith target word generated. which means that the model’s parameters include the probability of any word in the source language translating to any word in the target language. While this independence assumption is problematic in many ways. Applying this language model to a machine translation task yielded better results than a (smaller) language model trained with Kneser-Ney smoothing. 6.4. The indices of the words in f used to generate each word in e are stored in an alignment variable. . e pairs.3 WORD ALIGNMENT Word alignments. e) Alignment probability × Pr(ei |fai ) Lexical probability Since we have parallel corpora consisting of only f. respectively). a. we introduce a popular alignment model based on HMMs. a|f) = Pr(a|f. it results in a simple model structure that admits eﬃcient inference yet produces reasonable alignments. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 143 and eventually disappeared with suﬃcient data. the observable variables are the words in the source and target sentences (conventionally written using the variables f and e. large data triumphs! For more information about estimating language models using MapReduce. Once again. the better its description of relevant language phenomena and hence its ability to select good translations. with stupid backoﬀ it was possible to train a language model on more data than was feasible with KneserNey smoothing. To make this model tractable. which permits words to be inserted ‘out of nowhere’.6. to combat 14 In the original presentation of statistical lexical translation models. Using these assumptions.4. Furthermore. Alignment models that make this assumption generate a string e in the target language by selecting words in the source language according to a lexical translation distribution. the probability of an alignment and translation can be written as follows: |e| i=1 Pr(e. In the statistical model of word alignment considered here. which are necessary for building phrase-based translation models (as well as many other more sophisticated translation models). grammatical translations from a large hypothesis space: the more training data a language model has access to. we omit it from our presentation for simplicity. a special null word is added to the source sentences. can be learned automatically using EM. The role of the language model in statistical machine translation is to select ﬂuent. In this section. However. Since this does not change any of the important details of training. and their alignment is a latent variable. we assume that words are translated independently of one another.

and use this simpler model to learn initial lexical translation (emission) parameters for the HMM. By letting the probability of an alignment depend only on the position of the previous aligned word we capture a valuable insight (namely. and therefore EM will learn the same solution regardless of initialization. words that are nearby in the source language will tend to be nearby in the target language). rather than time O(|e| · |f|2 ) required by the forward-backward algorithm.4 EXPERIMENTS How well does a MapReduce word aligner for statistical machine translation perform? We describe previously-published results [54] that compared a Java-based Hadoop implementation against a highly optimized word aligner called Giza++ [112]. Both axes in the ﬁgure are on a log scale. it is conventional to further simplify the alignment probability model. the uniformity assumption means that the expected emission counts can be estimated in time O(|e| · |f|).4. The favored simpliﬁcation is to assert that all alignments are uniformly probable: 1 × Pr(ei |fai ) |f||e| i=1 |e| Pr(e. We compared the training time of Giza++ and our aligner on a Hadoop cluster with 19 slave nodes. we must make some further simplifying assumptions. which was written in C++ and designed to run eﬃciently on a single core.144 CHAPTER 6. It is attractive for initialization because it is convex everywhere.12 shows the performance of Giza++ in terms of the running time of a single EM iteration for both Model 1 and the HMM alignment model as a function of the number of training pairs. To properly initialize this HMM. each with two single-core processors and two disks (38 cores total). while the forward-backward algorithm could be used to compute the expected counts necessary for training this model by setting Aq (r) to be a constant value for all q and r. a|f) = This model is known as IBM Model 1. EM ALGORITHMS FOR TEXT PROCESSING data sparsity in the alignment probability. Finally. and the best alignment for a sentence pair can be found using the Viterbi algorithm. and our model acquires the structure of an HMM [150]: |e| i=1 |e| i=1 Pr(e. Figure 6. but the ticks . 6. a|f) = Pr(ai |ai−1 ) × Pr(ei |fai ) Transition probability Emission probability This model can be trained using the forward-backward algorithm described in the previous section. summing over all settings of a.

Third.13 we plot the running time of our MapReduce implementation running on the 38-core cluster described above.6. C++ performance. Second. Although these results do confound several variables (Java vs. the relative cost of the overhead associated with distributing data. There are three things to note. these results show that when computation is distributed over a cluster of many machines. we plot points indicating what 1/38 of the running time of the Giza++ iterations would be at each data size. in comparison to the running times of the single-core implementation. even of Model 1. although a comparison with Figure 6. more data leads to improvements in translation quality [54]. the HMM is a constant factor slower than Model 1. we note that. Why are these results important? Perhaps the most signiﬁcant reason is that the quantity of parallel data that is available to train statistical machine translation models is ever increasing. there is a signiﬁcant advantage to using the distributed implementation. a single iteration takes over three hours to complete! Five iterations are generally necessary to train the models. it is reasonable to expect that the confounds would tend to make the single-core system’s performance appear relatively better than the MapReduce system (which is. which we observe is light on computation. the HMM alignment iterations begin to approach optimal runtime eﬃciency. and as is the case with so many problems we have encountered. Second. Three things may be observed in the results. of course. which means that full training takes the better part of a day. at large data sizes. marshaling and aggregating counts. First. Recently a corpus of one billion words of French-English data was mined automatically from the web and released . Model 1. the opposite pattern from what we actually observe). does not approach the theoretical performance of an ideal parallelization. has almost the same running time as the HMM alignment algorithm.12 indicates that the MapReduce implementation is still substantially faster than the single core implementation. and in fact. even an unsophisticated implementation of the HMM aligner could compete favorably with a highly optimized single-core system whose performance is well-known to many people in the MT research community. Finally. We conclude that the overhead associated with distributing and aggregating data is signiﬁcant compared to the Model 1 computations. the alignment process is quite slow as the size of the training data grows—at one million sentences.4. In Figure 6. memory usage patterns). assuming that there was no overhead associated with distributing computation across these machines. First. At one million sentence pairs of training data. the running time scales linearly with the size of the training data. which gives a rough indication of what an ‘ideal’ parallelization could achieve. Furthermore. at least once a certain training data size is reached. For reference. as the amount of data increases. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 145 on the y-axis are aligned with ‘meaningful’ time intervals rather than exact orders of magnitude. decreases.

1/38 running times of the Giza++ models are shown.12: Running times of Giza++ (baseline single-core system) for Model 1 and HMM training iterations at various corpus sizes. .13: Running times of our MapReduce implementation of Model 1 and HMM training iterations at various corpus sizes.146 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING Model 1 HMM Average iteration latency (seconds) 3 hrs 60 min 20 min 5 min 90 s 30 s 10 s 3s 10000 100000 Corpus size (sentences) 1e+06 Figure 6. For reference. Optimal Model 1 (Giza/38) Optimal HMM (Giza/38) MapReduce Model 1 (38 M/R) MapReduce HMM (38 M/R) 3 hrs 60 min Time (seconds) 20 min 5 min 90 s 30 s 10 s 3s 10000 100000 Corpus size (sentences) 1e+06 Figure 6.

Obviously. is diﬀerentiable. These important algorithms are indispensable for learning models with latent structure from unannotated data. Fortunately.html . In this section we focus on gradient-based optimization. 15 http://www. lead to rapid new developments in statistical machine translation. there is an open source Hadoop-based MapReduce implementation of a training pipeline for phrase-based translation that includes word alignment. and its derivatives can be eﬃciently evaluated.statmt. Gradient-based optimization is particularly useful in the learning of maximum entropy (maxent) models [110] and conditional random ﬁelds (CRF) [87] that have an exponential form and are trained to maximize conditional likelihood. Furthermore. we hope.5. the results presented here show that even at data sizes that may be tractable on single machines. 6.org/wmt10/translation-task. these algorithms are only applicable in cases where a useful objective exists. this is the case for many important problems of interest in text processing. we will give examples in terms of minimizing functions. Fortunately. EM-LIKE ALGORITHMS 15 147 publicly [33]. provided it is diﬀerentiable with respect to the parameters being optimized. phrase extraction.6. which allows researchers to more quickly explore the solution space— which will. their gradients take the form of expectations. This improvement reduces experimental turnaround times. For the reader interested in statistical machine translation. Single-core solutions to model construction simply cannot keep pace with the amount of translated data that is constantly being produced.1 GRADIENT-BASED OPTIMIZATION AND LOG-LINEAR MODELS Gradient-based optimization refers to a class of iterative optimization algorithms that use the derivatives of a function to ﬁnd the parameters that yield a minimal or maximal value of that function.5. As a result. 6. signiﬁcant performance improvements are attainable using MapReduce implementations. both the data and their annotations must be observable). In addition to being widely used supervised classiﬁcation models in text processing (meaning that during training.5 EM-LIKE ALGORITHMS This chapter has focused on expectation maximization algorithms and their implementation in the MapReduce programming framework. which means that we can take advantage of this data. For the purposes of this discussion. and phrase scoring [56]. and they can be implemented quite naturally in MapReduce. and discuss their implementation. some of the previously-introduced techniques are also applicable for optimizing these models. which refers to a class of techniques used to optimize any objective function. several independent researchers have shown that existing modeling algorithms can be expressed naturally and eﬀectively using MapReduce. We now explore some related learning algorithms that are similar to EM but can be used to solve more general problems.

this update strategy may converge slowly. (θ). . which are linear components of the objective and gradient. However. . the details of the function being optimized determines how it should best be implemented. then the following is true: F (θ∗ ) = 0 An extremely simple gradient-based minimization algorithm produces a series of parameter estimates θ(1) . F (θ) . Second. the gradient F is a vector ﬁeld that points in the direction of the greatest increase of F and whose magnitude indicates the rate of increase.148 CHAPTER 6. Gradient-based optimization algorithms can often be implemented eﬀectively in MapReduce. The values they emit are pairs F (θ). while simple. and not every function optimization problem will be a good ﬁt for MapReduce. and therefore mappers can process input data in parallel. which are estimated by successive evaluations of F (θ). . Gradient-based optimization in MapReduce. where the structure of the model determines the speciﬁcs of the realization. θ(2) . Like EM. This implies that the gradient also decomposes linearly. • The objective should decompose linearly across training instances. . this strategy will ﬁnd a local minimum of F . and proper selection of η is non-trivial. . Provided this value is small enough that F decreases. First. Its gradient is deﬁned as: F (θ) = ∂F ∂F ∂F (θ). MapReduce implementations of gradient-based optimization tend to have the following characteristics: • Each optimization iteration is one MapReduce job. by starting with some initial parameter settings θ(1) and updating parameters through successive iterations according to the following rule: θ(i+1) = θ(i) − η (i) F (θ(i) ) (6. More sophisticated algorithms perform updates that are informed by approximations of the second derivative. .12) The parameter η (i) > 0 is a learning rate which indicates how quickly the algorithm moves along the gradient during iteration i. EM ALGORITHMS FOR TEXT PROCESSING Assume that we have some real-valued function F (θ) where θ is a k-dimensional vector and that F is diﬀerentiable with respect to θ. if θ∗ is a (local) minimum of F. (θ) ∂θ1 ∂θ2 ∂θk The gradient has two crucial properties that are exploited in gradient-based optimization. . . and can converge much more rapidly [96]. Nevertheless.

EM-LIKE ALGORITHMS 149 • Evaluation of the function and its gradient is often computationally expensive because they require processing lots of data. Some.i. Parameter learning for log-linear models. In this case.12 treat the dimensions of θ independently. This may either be emitted together with θ(i+1) or written to the distributed ﬁle system as a side eﬀect of the reducer.6.5. Gradient-based optimization techniques can be quite eﬀectively used to learn the parameters of probabilistic models with a log-linear parameterization [100]. • Whether more than one reducer can run in parallel depends on the speciﬁc optimization algorithm being used. Log-linear models are particularly useful for supervised learning (unlike the unsupervised models learned with EM). • Many optimization algorithms are stateful and must persist their state between optimization iterations. . Such external side eﬀects must be handled carefully.2 for a discussion. In the latter case. y 1 . run the optimization algorithm. y ) In this expression. where an annotation y ∈ Y is available for every x ∈ X in the training data.13) (6. compute the total objective and gradient. it is possible to directly model the conditional distribution of label given input: Pr(y|x. θ) = exp i θi · Hi (x. We therefore include a brief summary. y 2 . Hi are real-valued functions sensitive to features of the input and labeling.d.14) θ∗ = arg min F (θ) . and their training using gradient-based optimization. and emit θ(i+1) . like the trivial algorithm of Equation 6. . y) y exp i θi · Hi (x. refer to Section 2. such models are used extensively in text processing applications. θ) θ (6. can be implemented eﬀectively using MapReduce. . parallelization across multiple reducers is non-trivial. which we assume to be i.: F (θ) = x. This make parallelization with MapReduce worthwhile. x. The parameters of the model is selected so as to minimize the negative conditional log likelihood of a set of training instances x. which may otherwise be computationally expensive.y − log Pr(y|x. • Reducer(s) sum the component objective/gradient pairs. . whereas many are sensitive to global properties of F (θ). While a comprehensive introduction to these models is beyond the scope of this book.

there are no additional latent variables). y )] x. Second. y) − EPr(y |x. The gradient derivative of F with respect to θi can be shown to have the following form [141]:16 ∂F (θ) = ∂θi Hi (x. 6. EM ALGORITHMS FOR TEXT PROCESSING As Equation 6. they are particularly well-positioned to take advantage of large clusters. meaning it can be optimized quite well in MapReduce.2. enumerating all possible values y can become computationally intractable. given the enormous quantities of data available “for free”.θ) [Hi (x. when we hailed large data as the “rising tide that lifts all boats” to yield more eﬀective algorithms. Thus. be used to compute the expectation EPr(y |x. y )] needed in CRF sequence models.e.θ) [Hi (x. making them a good example of how to express a commonly-used algorithm in this new framework. For more information about inference in CRFs using the forward-backward algorithm. which means that they have access to far more training data than comparable supervised approaches.4 can. many models. as long as the feature functions respect the same Markov assumption that is made in HMMs.. such as the widely-used hidden Markov model (HMM) trained using EM. as we saw with EM. And. when very large event spaces are being modeled. using expectation maximization algorithms or gradient-based optimization techniques. EM algorithms are unsupervised learning algorithms.y The expectation in the second part of the gradient’s expression can be computed using a variety of techniques. The same pattern of results holds when training linear CRFs. the objective decomposes linearly across training instances. In Chapter 1.and M-steps. given that the manual eﬀort required to generate annotated data remains a bottleneck in many supervised approaches.13 makes clear. as was the case with HMMs. we were mostly referring to unsupervised approaches. We focused especially on EM algorithms for three reasons. However. Data acquisition for unsupervised algorithms is often as simple as crawling speciﬁc web sources. we refer the reader to Sha et al. [140]. MapReduce oﬀers signiﬁcant speedups when training iterations require running the forward-backward algorithm. This is quite important. Finally. First. make independence assumptions that permit an high degree of parallelism in both the E. . combined with the ability of MapReduce to process 16 This assumes that when x. As we saw in the previous section. as is the case with sequence labeling. these algorithms can be expressed naturally in the MapReduce programming model. independence assumptions can be used to enable eﬃcient computation using dynamic programming.6 SUMMARY AND ADDITIONAL READINGS This chapter focused on learning the parameters of statistical models from data. with only minimal modiﬁcation.150 CHAPTER 6. In fact. This. the forward-backward algorithm introduced in Section 6. y is present the model is fully observed (i.

6. The discussion demonstrates that not only does MapReduce provide a means for coping with ever-increasing amounts of data. Because of its ability to leverage large amounts of training data. machine learning is an attractive problem for MapReduce and an area of active research. 17 http://lucene. with certain similarities to EM. this led us to consider how related supervised learning models (which typically have much less training data available). [37] presented general formulations of a variety of machine learning problems. even for small amounts of data. SUMMARY AND ADDITIONAL READINGS 151 large datasets in parallel. have been explored by Wang et al.apache.org/mahout/ . Since EM algorithms are relatively computationally expensive. but it is also useful for parallelizing expensive computations. provides researchers with an eﬀective strategy for developing increasingly-eﬀective applications. can also be implemented in MapReduce.6. Although MapReduce has been designed with mostly dataintensive applications in mind. Issues associated with a MapReduce implementation of latent Dirichlet allocation (LDA). the ability to leverage clusters of commodity hardware to parallelize computationally-expensive algorithms is an important use case. focusing on a normal form for expressing a variety of machine learning algorithms in MapReduce. The Apache Mahout project is an open-source implementation of these and other learning algorithms. [151]. which is another important unsupervised learning technique. Additional Readings.17 and it is also the subject of a forthcoming book [116]. Chu et al.

Both EM and the gradient-based learning algorithms we described are instances of what are known as batch learning algorithms. For engineers building information processing tools and applications. business intelligence—driven by the ability to gather data from a dizzying array of sources—promises to help organizations better understand their customers and the marketplace. As a result. However.and petabyte-scale datasets rapidly becoming commonplace. 7.152 CHAPTER 7 Closing Remarks The need to process enormous quantities of data has never been greater. solutions to many interesting problems in text processing do not require global synchronization. there are many examples of algorithms that depend crucially on the existence of shared global state during processing. which are examples of algorithms that require maintaining global state. Section 7. the ability to analyze massive amounts of data may provide the key to unlocking the secrets of the cosmos or the mysteries of life. Section 7. On one hand. This simply means that the full “batch” of training data is processed before any updates to the model parameters are made. this is diﬃcult to accomplish in MapReduce.2 discusses alternative programming models. we have shown how MapReduce can be exploited to solve a variety of problems related to text processing at scales that would have been unthinkable a few years ago. Not only are terabyte.3. from machine translation to spam detection. but there is consensus that great value lies buried in them. In the natural and physical sciences. In the commercial sphere. no tool—no matter how powerful or ﬂexible— can be perfectly adapted to every task. hopefully leading to better business decisions and competitive advantages.1 LIMITATIONS OF MAPREDUCE As we have seen throughout this book. The ﬁrst example is online learning. since map and reduce tasks run independently and in isolation. they can be expressed naturally in MapReduce. making them diﬃcult to implement in MapReduce (since the single opportunity for global synchronization in MapReduce is the barrier between the map and reduce phases of processing). Recall from Chapter 6 the concept of learning as the setting of parameters in a statistical model. As we have seen. However. larger datasets lead to more eﬀective algorithms for a wide range of tasks. In the preceding chapters. so it is only fair to discuss the limitations of the MapReduce programming model and survey alternatives. waiting to be unlocked by the right computational tools. and the book concludes in Section 7. this is quite reasonable: .1 covers online learning algorithms and Monte Carlo simulations.

where updates occur in one or more reducers (or. to be hasty. LIMITATIONS OF MAPREDUCE 153 updates are not made until the full evidence of the training data has been weighed against the model. with online learning. this cost is amortized over the processing of many key-value pairs. Unfortunately. these updates must occur after processing smaller numbers of instances. which are models whose structure is not speciﬁed in advance. However. but is rather inferred from training data. In Hadoop. online learning algorithms can be understood as computing an approximation of the true gradient.1. A related diﬃculty occurs when running what are called Monte Carlo simulations. map and reduce tasks have considerable startup costs. The basic idea is quite simple: samples are drawn from the random variables in the model to simulate its behavior. and then simple frequency statistics are computed over the samples. for small datasets. All processes performing the evaluation (presumably the mappers) must have access to this state. in the driver code). However. which are used to perform inference in probabilistic models where evaluating or representing the model exactly is impossible. but inferring the number of states. A ﬁnal solution is then arrived at by merging individual results. Since MapReduce was speciﬁcally optimized for batch operations over large amounts of data. for example. the gradient computed from a small subset of training instances is often quite reasonable. This is acceptable because in most circumstances. This means that the framework must be altered to support faster processing of smaller datasets. particularly for unsupervised learning ap- . imagine learning a hidden Markov model. in some sense. alternatively. updates can be made after every training instance. rather than having them speciﬁed. In a batch learner. synchronization of this resource is enforced by the MapReduce framework. which goes against the design choices of most existing MapReduce implementations.5).7. implementing online learning algorithms in MapReduce is problematic. however. using only a few training instances. Thinking in terms of gradient optimization (see Section 6. However. it is generally the case that more frequent updates can lead to more rapid convergence of the model (in terms of number of training instances processed). The model parameters in a learning algorithm can be viewed as shared global state. such a style of computation would likely result in ineﬃcient use of resources. An earlier update would seem. which must be updated as the model is evaluated against training data. and the aggregate behavior of multiple updates tends to even out errors that are made. Experiments. show that the merged solution is inferior to the output of running the training algorithm on the entire dataset [52]. In the limit. For an illustration. Being able to parallelize Monte Carlo simulations would be tremendously valuable. even if those updates are made by considering less data [24]. This sort of inference is particularly useful when dealing with so-called nonparametric models. An alternative is to abandon shared global state and run independent instances of the training algorithm in parallel (on diﬀerent portions of the data). these high startup costs become intolerable. Although only an approximation.

sensor readings. At runtime. 159]. However. Although recent work [10] has shown that the delays in synchronizing sample statistics due to parallel implementations do not necessarily damage the inference. Furthermore. which have a large number of (relatively simple) functional units [104]. since streaming algorithms are comparatively simple (because there is only so much that can be done with a particular training instance). but more interestingly. persistent multidimensional sorted map built on top of GFS. might also be useful in this respect. The problem of global state is suﬃciently pervasive that there has been substantial work on solutions. Amazon’s Dynamo [48]. although it wasn’t originally designed with such an application in mind.2 ALTERNATIVE COMPUTING PARADIGMS Streaming algorithms [3] represent an alternative programming model for dealing with large volumes of data with limited computational and storage resources. which could be ﬁles in a distributed ﬁle system. streaming algorithms have been applied to language modeling [90]. Stream processing is very attractive for working with time-series data (news feeds. The idea of stream processing has been generalized in the Dryad framework as arbitrary dataﬂow graphs [75. distributed. channels . ﬁts the bill. and has been used in exactly this manner. A Dryad job is a directed acyclic graph where each vertex represents developer-speciﬁed computations and edges represent data channels that capture dependencies. and only once. MapReduce oﬀers no natural mechanism for managing the global shared state that would be required for such an implementation. Unfortunately. This model assumes that data are presented to the algorithm as one or more streams of inputs that are processed in order. it is unclear if the open-source implementations of these two systems (HBase and Cassandra.154 CHAPTER 7. which is a distributed key-value store (with a very diﬀerent architecture). In the context of text processing. The model is agnostic with respect to the source of these streams. data from an “external” source or some other data gathering device. respectively) are suﬃciently mature to handle the low-latency and high-throughput demands of maintaining global state in the context of massively distributed processing (but recent benchmarks are encouraging [40]). which is diﬃcult in MapReduce (once again. which is a sparse. The dataﬂow graph is a logical computation graph that is automatically mapped onto physical resources by the framework.). 7. and detecting the ﬁrst mention of news event in a stream [121]. given its batch-oriented design). Google’s BigTable [34]. CLOSING REMARKS plications where they have been found to be far more eﬀective than EM-based learning (which requires specifying the model). such a system would need to be highly scalable to be used in conjunction with MapReduce. translation modeling [89]. they can often take advantage of modern GPUs. One approach is to build a distributed datastore capable of maintaining the global state. tweets. etc.

can be described as a data analytics platform that provides a lightweight scripting language for manipulating large datasets. Dryad is more ﬂexible than MapReduce. Pig [114]. a longer description is anticipated in a forthcoming paper [99]. They diﬀer.) at a much higher level. Hive [68]. MapReduce is certainly no exception to this generalization. it remains unclear. But of course. which was inspired by Google’s Sawzall [122]. however. and in fact. Pregel was speciﬁcally designed for large-scale graph algorithms. grouping. there is a proposal to add a third merge phase after map and reduce to better support relational operations [36]. MapReduce is not the end. another open-source . we can certainly expect the development of new models and a better understanding of existing ones. but others more diﬃcult. and can be realized using ﬁles. this begs the obvious question: What other abstractions are available in the massively-distributed datacenter environment? Are there more appropriate computational models that would allow us to tackle classes of problems that are diﬃcult for MapReduce? Dryad and Pregel are alternative answers to these questions. ALTERNATIVE COMPUTING PARADIGMS 155 are used to transport partial results between vertices. joining. These conceptions represent diﬀerent tradeoﬀs between simplicity and expressivity: for example. Similarly. constructs in the language allow developers to specify data transformations (ﬁltering. but unfortunately there are few published details at present. As anyone who has taken an introductory computer science course would know. or shared memory. For example. It is merely the ﬁrst of many approaches to harness large-scaled distributed computing resources. Although Pig scripts (in a language called Pig Latin) are ultimately converted into Hadoop jobs by Pig’s execution engine. in how distributed computations are conceptualized: functional-style programming. MapReduce can be trivially implemented in Dryad. at least at present. and perhaps not even the best. What is the signiﬁcance of these developments? The power of MapReduce derives from providing an abstraction that allows developers to harness the power of large clusters. if not impossible. However. separating the what from the how of computation and isolating the developer from the details of concurrent programming. Another system worth mentioning is Pregel [98]. which approach is more appropriate for diﬀerent classes of applications. and one of the goals of this book has been to give the reader a better understanding of what’s easy to do in MapReduce and what its limitations are.2.7. This process makes certain tasks easier. abstractions manage complexity by hiding details and presenting well-deﬁned behaviors to users of those abstractions. arbitrary dataﬂows. we have already observed the development of alternative approaches for expressing distributed computations. which implements a programming model inspired by Valiant’s Bulk Synchronous Parallel (BSP) model [148]. etc. Even within the Hadoop/MapReduce ecosystem. Looking forward. However. They share in providing an abstraction for large-scale distributed computations. or BSP. TCP pipes.

all these ideas came together and were demonstrated on practical problems at scales unseen before. The engineers at Google deserve a tremendous amount of credit for that. By scaling “out” with commodity servers. Add to that the advent of utility computing. paves the way for innovations in scalable algorithms that can run on petabyte-scale datasets. large-data processing capabilities are now available “to the masses” with a relatively low barrier to entry. which eliminates capital investments associated with cluster infrastructure. also. The golden age of massively distributed computing is ﬁnally upon us. the engineers and executives at Yahoo deserve a lot of credit for starting the open-source Hadoop project. None of these points are new or particularly earth-shattering—computer scientists have known about these principles for decades. in turn. CLOSING REMARKS project. and provide a model using which programmers can reason about computations at a massive scale without being distracted by ﬁne-grained concurrency management. and a host of other issues in distributed computing. we have been able to economically bring large clusters of machines to bear on problems of interest.156 CHAPTER 7. the system provides a data analysis tool for users who are already comfortable with relational databases. both in terms of computational resources and the impact on the daily lives of millions. emphasizing throughput over latency for batch tasks by sequential scans through data. however. and also for sharing their insights with the rest of the world. avoiding random seeks. Furthermore.3 MAPREDUCE AND BEYOND The capabilities necessary to tackle large-data problems are already within reach by many and will continue to become more accessible over time. provides an abstraction on top of Hadoop that allows users to issue SQL queries against large relational datasets stored in HDFS. But this has only been possible with corresponding innovations in software and how computations are organized on a massive scale. This. while simultaneously taking advantage of Hadoop’s data processing capabilities. for the ﬁrst time. 7. Therefore. MapReduce is unique in that. Hive queries (in HiveQL) “compile down” to Hadoop jobs by the Hive query engine. However. These abstractions are at the level of entire datacenters. which has made MapReduce accessible to everyone and created the vibrant software ecosystem that ﬂourishes today. is the development of new abstractions that hide system-level details from the application developer. Most important of all. . error recovery. fault tolerance. as opposed to the other way around. Important ideas include: moving processing to the data.

Copper Mountain Resort. In Proceedings of the AFIPS Spring Joint Computer Conference. Hellerstein. University of California at Berkeley. [7] Thomas Anderson. Philadelphia. Avi Silberschatz. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP 1995). Tyson Condie. In Proceedings of the 35th International Conference on Very Large Data Base (VLDB 2009). Mansi Shah. 2005. Technical Report UCB/EECS-2009-98. Colorado. Armando Fox. [3] Noga Alon. Katz. Inverted index compression using word-aligned binary codes. 2009. 2009. Prasenjit Sarkar. [9] Michael Armbrust.157 Bibliography [1] Azza Abouzeid. pages 109–126. Lyon. Patterson. David Patterson. Statistical mechanics of complex nete a o a works. and Alexander Rasin. Serverless network ﬁle systems. and Renu Tewari. The space complexity of approximating the frequency moments. BOOM: Data-centric programming in the datacenter. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Daniel Abadi. Kamil Bajda-Pawlikowski. Anthony D. and Mario Szegedy. Yossi Matias. [4] Peter Alvaro. California. Andrew Konwinski. Ion . Karan Gupta. France. Validity of the single processor approach to achieving large-scale computing capabilities. Rean Griﬃth. Jeanna Neefe. Drew Roselli. 8(1):151–166. Himabindu Pucha. pages 20–29. 2002. 1996. Pennsylvania. pages 483–485. pages 922–933. Sears. and Russell C. and Randolph Wang. Reviews of Modern Physics. Electrical Engineering and Computer Sciences. Information Retrieval. Ariel Rabkin. Gunho Lee. 1995. 1967. [5] Gene Amdahl. Cloud analytics: Do we really need to reinvent the storage stack? In Proceedings of the 2009 Workshop on Hot Topics in Cloud Computing (HotCloud 09). David A. [2] R´ka Albert and Albert-L´szl´ Barab´si. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC ’96). Joseph. 74:47–97. Prashant Pandey. Randy H. Neil Conway. San Diego. Joseph M. Michael Dahlin. [6] Rajagopal Ananthanarayanan. 2009. Khaled Elmeleegy. [8] Vo Ngoc Anh and Alistair Moﬀat.

Challenges on distributed web retrieval. Jeﬀrey Dean. The impact of caching on search engines. Canada. 40(12):33–37. Electrical Engineering and Computer Sciences. University of California at Berkeley. Ian Pratt. [15] Paul Barham. [18] Luiz Andr´ Barroso and Urs H¨lzle. pages 183–190. Vancouver. 2008. pages 17–24. 2001. and Fabrizio Silvestri. Vassilis Plachouras. In Advances in Neural Information Processing Systems 21 (NIPS 2008). Carlos Castillo. Vassilis Plachouras. Keir Fraser. Boris Dragovic. and Matei Zaharia. pages 81–88. [13] Ricardo Baeza-Yates.158 CHAPTER 7. 2005. 2007. 2003. Rolf Neugebauer. 2009. and Fabrizio Silvestri. New York. and Urs H¨lzle. pages 6–20. 2007. Toulouse. Xen and the art of virtualization. Istanbul. and Andrew Warﬁeld. [10] Arthur Asuncion. Morgan & Claypool Publishers. CLOSING REMARKS Stoica. The Datacenter as a Computer: An Introduce o tion to the Design of Warehouse-Scale Machines. 23(2):22–28. [12] Ricardo Baeza-Yates. and Max Welling. 2003. Amsterdam. Chiba. pages 164–177. Bolton Landing. Flavio Junqueira. PageRank increase uno der diﬀerent collusion topologies. . IEEE Micro. Above the clouds: A Berkeley view of cloud computing. [14] Michele Banko and Eric Brill. The case for energy-proportional computing. Turkey. France. [17] Luiz Andr´ Barroso and Urs H¨lzle. Flavio Junqueira. Aristides Gionis. Japan. Vanessa Murdock. Alex Ho. British Columbia. Scaling to very very large corpora for natural language disambiguation. Tim Harris. Padhraic Smyth. Web search for a planet: The e o Google cluster architecture. [11] Ricardo Baeza-Yates. Technical Report UCB/EECS-2009-28. Steven Hand. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003). The Netherlands. 2007. [16] Luiz Andr´ Barroso. and Vicente L´pez. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005). In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001). Carlos Castillo. In Proceedings of the IEEE 23rd International Conference on Data Engineering (ICDE 2007). Asynchronous distributed learning of topic models. pages 26–33. e o Computer. 2009. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007).

1999. Inside PageRank. Jimmy Lin. 5(1):92–128. Lessons learned from managing a petabyte. The mathematics of statistical machine translation: Parameter estimation. Sergei Nikolaev. [21] Gordon Bell. In Olivier Bousquet and Ulrike von Luxburg. Maryland. Prague. Massachusetts. MAPREDUCE AND BEYOND 159 [19] Jacek Becla. 2005. Wang. Peng Xu. Della Pietra. Och. Large language models in machine translation. Mercer. Brooks. SLAC Publications SLAC-PUB-12292. MIT Press. [24] L´on Bottou. e editors. Popat. 2001. 323(5919):1297–1298. pages 858–867. In Proceedings of the Second Biennial Conference on Innovative Data Systems Research (CIDR 2005). [25] Thorsten Brants. and Robert L. Marco Gori. Vincent J. Reading. [28] Frederick P. Andrew Hanushevsky. Addison-Wesley. and Jim Gray. [27] Eric Brill. Information u Retrieval: Implementing and Evaluating Search Engines. California. Michele Banko. ACM Transactions on Internet Technology. [23] Jorge Luis Borges. Tony Hey. Cambridge. 1995. [29] Peter F. . 2010. and Franco Scarselli. Stochastic learning. The Mythical Man-Month: Essays on Software Engineering. and Jeﬀrey Dean. pages 146–168. Massachusetts. [22] Monica Bianchini. and Alex Szalay. Computational Linguistics. pages 393–400. 2005. Brown. Cormack. and Andrew Ng. Springer Verlag. Clarke. Dataintensive question answering. LNAI 3176. Asilomar. Ghaleb Abdulla.3. and Gordon V.7. [30] Stefan B¨ttcher. Della Pietra. Beyond the data deluge. 2004. Charles L. Lecture Notes in Artiﬁcial Intelligence. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001). Franz J. [20] Jacek Becla and Daniel L. Susan Dumais. [26] Thorsten Brants and Peng Xu. Anniversary Edition. 2009. 19(2):263–311. Morgan & Claypool Publishers. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. A. Penguin. 2007. May 2006. Collected Fictions (translated by Andrew Hurley). Gaithersburg. Distributed Language Models. Advanced Lectures on Machine Learning. Ashok C. Science. Maria Nieto-Santisteban. 1993. Czech Republic. Ani Thakar. 2010. Stanford Linear Accelerator Center. Designing a multi-petabyte database for LSST. Stephen A. Berlin. Alex Szalay.

An empirical study of smoothing techniques for language modeling. Adam Silberstein. and Russell Sears. E. and D. In Proceedings of the 7th Symposium on Operating System Design and Implementation (OSDI 2006). Gary Bradski. Graph twiddling in a MapReduce world. Yi-An Lin. Andrew Fikes. Future Generation Computer Systems. Cooper. Sanjay Ghemawat. MapReduce-Merge: Simpliﬁed relational data processing on large clusters. Seattle. Michael Burrows. mutual information. Andrew Ng. 1991. pages 205–218. British Columbia. Bigtable: A distributed storage system for structured data. and Kunle Olukotun. James Broberg. Srikumar Venugopal. [40] Brian F. 2006. [35] Stanley F. and reality for delivering computing as the 5th utility. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. In Proceedings of the Fourth Workshop on Statistical Machine Translation (StatMT ’09). pages 1029–1040. Athens. Beijing. Greece. and Robert Gruber. Santa Cruz. YuanYuan Yu. 2010. [32] Luis-Felipe Cabrera and Darrell D. Benchmarking cloud serving systems with YCSB. Computational Linguistics. In Advances in Neural Information Processing Systems 19 (NIPS 2006). and Josh Schroeder. Philipp Koehn. Erwin Tam. Computing in Science and Engineering. Ali Dasdan. Computer Systems. Cloud computing and emerging IT platforms: Vision. CLOSING REMARKS [31] Rajkumar Buyya. [34] Fay Chang. . 2009. Ruey-Lung Hsiao. Christof Monz. China. Wilson C. In Proceedings of the First ACM Symposium on Cloud Computing (ACM SOCC 2010). Indiana. Map-Reduce for machine learning on multicore. 11(4):29–41. 2009. Swift: Using distributed disk striping to provide high I/O data rates. Stott Parker. Hsieh.160 CHAPTER 7. Jeﬀrey Dean. pages 1–28. [36] Hung chih Yang. California. 4(4):405–436. Word association norms. 16(1):22–29. 2007. and Ivona Brandic. Chee Shin Yeo. Chen and Joshua Goodman. and lexicography. [39] Jonathan Cohen. 1990. [33] Chris Callison-Burch. Wallach. 25(6):599–616. Findings of the 2009 workshop on statistical machine translation. Vancouver. Deborah A. Raghu Ramakrishnan. pages 310–318. Indianapolis. hype. Church and Patrick Hanks. Long. 2006. 2009. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996). pages 281–288. Tushar Chandra. Canada. [37] Cheng-Tao Chu. [38] Kenneth W. 1996. Sang Kyun Kim. Washington.

[47] Jeﬀrey Dean and Sanjay Ghemawat. pages 205–220. Massachusetts. Gunavardhan Kakulapati. Abhijit Sahay. Julian Kupiec. David Patterson. Journal of the Royal Statistical Society. Leonard D. Rivest. 1990. Shapiro. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP 2007). Stonebraker. 2009. Washington. 79:123–149. Madan Jampani. Leiserson. and Thorsten von Eicken. San Francisco. DeWitt. Donald Meztler. Trento. 2008. Avinash Lakshman. 51(1):107–113. 35(6):85–98. Parallel database systems: The future of high performance database systems.3.7. Series B (Methodological). Reading. [49] Arthur P. 53(1):72–77. Alex Pilchin. ACM SIGMOD Record. Communications of the ACM. Rubin. and David Wood. [51] David J. and Penelope Sibun. MapReduce: A ﬂexible data processing tool. pages 133–140. 28(7):1–12. MapReduce: Simpliﬁed data processing on large clusters. Klaus Erik Schauser. Frank Olken. Dynamo: Amazon’s highly available key-value store. Implementation techniques for main memory database systems. [45] Jeﬀrey Dean and Sanjay Ghemawat. and Donald B. Peter Vosshall. Machine Learning. Deniz Hastorun. Cormen. and Werner Vogels. 2004. Richard Karp. Dempster. Nan M. 1977. Maximum likelihood from incomplete data via the EM algorithm. Swami Sivasubramanian. California. [48] Giuseppe DeCandia. 1993. pages 137–150. 2010. [50] David J. [42] W. Katz. and Koby Crammer. 2007. Introduction to Algorithms. Randy H. MAPREDUCE AND BEYOND 161 [41] Thomas H. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004). 2010. Communications of the ACM. 1992. Laird. Michael R. 1984. In Proceedings of the Third Conference on Applied Natural Language Processing. Cambridge. and Ronald L. 39(1):1–38. MapReduce: Simpliﬁed data processing on large clusters. and Trevor Strohman. [44] Doug Cutting. . 14(2):1–8. Stevenson. MIT Press. Communications of the ACM. Ramesh Subramonian. Eunice Santos. LogP: Towards a realistic model of parallel computation. Massachusetts. ACM SIGPLAN Notices. [52] Mark Dredze. Alex Kulesza. DeWitt and Jim Gray. Italy. Charles E. Bruce Croft. [43] David Culler. Jan Pedersen. 1992. [46] Jeﬀrey Dean and Sanjay Ghemawat. Multi-domain learning by conﬁdence-weighted parameter combination. A practical part-of-speech tagger. Addison-Wesley. Search Engines: Information Retrieval in Practice.

. Alex Mont. pages 39–47. 1:201–233.162 CHAPTER 7. 93:37–46. Granovetter. Firth. 2010. [58] Seth Gilbert and Nancy Lynch. available. Training phrase-based machine translation models on the cloud: Open source machine translation toolkit Chaski. Newman. Jimmy Lin. 78(6):1360–1380. 2003. A synopsis of linguistic theory 1930–55. The Prague Bulletin of Mathematical Linguistics. Tampere. The strength of weak ties. Michele Banko. Granovetter. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003). pages 29–43. Bolton Landing. Community structure in social and biological networks. Special Volume of the Philological Society. pages 199–207. Columbus. 99(12):7821– 7826. New York. Howard Gobioﬀ. [59] Michelle Girvan and Mark E. pages 291–298. [61] Mark S. [54] Chris Dyer. [60] Ananth Grama. Proceedings of the National Academy of Science. Introduction to Parallel Computing. Blackwell. CLOSING REMARKS [53] Susan Dumais. 2002. pages 1–32. Ohio. Anshul Gupta. 33(2):51–59. 2003. Addison-Wesley. Reading. 2002. Brewer’s Conjecture and the feasibility of consistent. 2005. George Karypis. Eric Brill. and cheap: Construction of statistical machine translation models with MapReduce. J. Finland. In Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008. [55] John R. The strength of weak ties: A network theory revisited. 2002. ACM SIGACT News. and Shun-Tak Leung. [63] Zolt´n Gy¨ngyi and Hector Garcia-Molina. and Andrew Ng. Fast. 1973. and Vipin Kumar. 1983. 1957. easy. Sociological Theory. Japan. [57] Sanjay Ghemawat. partition-tolerant web services. [56] Qin Gao and Stephan Vogel. Chiba. Oxford. The Google File System. Web question answering: Is more always better? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002). [62] Mark S. and Jimmy Lin. Web spam taxonomy. 2008. Aaron Cordova. In Studies in Linguistic Analysis. In Proceedings a o of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005). Massachusetts. The American Journal of Sociology.

Washington. [69] Zelig S. Cambridge. . Island Networks: Communication. The Fourth Paradigm: DataIntensive Scientiﬁc Discovery. Wenbin Fang. Canada. New York. 2009. Jim Gray on eScience: A transformed scientiﬁc method. Wiley. Scale and performance in a distributed ﬁle system. pages 73–84. Texas. Qiong Luo. In Proceedings of the 21st Large Installation System Administration Conference (LISA ’07). Govindaraju. Asilomar. MAPREDUCE AND BEYOND 163 [64] Per Hage and Frank Harary. Microsoft Research. Wroclaw. California. Kinship. Stewart Tansley. On designing and deploying Internet-scale services. Information platforms and the rise of the data scientist. Stewart Tansley. Sherri Menees. Cooperative Expendable Micro-Slice Servers (CEMS): Low cost. 2009. 2009. The unreasonable eﬀectiveness of data.3. 2008. and Kristin Tolle. [70] Md. In Tony Hey. Naga K. and Michael West.7. David Nichols. Dallas. low power servers for Internet-scale services. ACM Transactions on Computer Systems. and Fernando Pereira. Sebastopol. Harris. 2005. [73] Tony Hey. editors. 1968. Microsoft Research. California. [66] James Hamilton. Redmond. Mars: A MapReduce framework on graphics processors. pages 260–269. and Kristin Tolle. 2007. [65] Alon Halevy. O’Reilly. Communications of the ACM. Cambridge University Press. [71] Bingsheng He. Stock market forecasting using hidden Markov models: A new approach. Mahadev Satyanarayanan. and Tuyong Wang. Washington. In Proceedings of the Fourth Biennial Conference on Innovative Data Systems Research (CIDR 2009). 2009. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT 2008). Robert Sidebotham. 1996. [74] John Howard. [67] James Hamilton. Stewart Tansley. Mathematical Structures of Language. England. [68] Jeﬀ Hammerbacher. 6(1):51–81. editors. Beautiful Data. In Toby Segaran and Jeﬀ Hammerbacher. Raﬁul Hassan and Baikunth Nath. Michael Kazar. pages 192–196. pages 233–244. 1988. and Kristin Tolle. 2009. Poland. Peter Norvig. The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery. Ontario. [72] Tony Hey. 24(2):8–12. Redmond. and Classiﬁcation Structures in Oceania. Toronto. In Proceedings of the 5th International Conference on Intelligent Systems Design and Applications (ISDA ’05).

[81] U Kang. Portugal. and Dennis Fetterly. Christos Faloutsos. Cambridge. Upper Saddle River. Yuan Yu. Journal of the ACM. Ana Paula Appel. The pathologies of big data. Statistical phrase-based translation. 7(6). [86] Philipp Koehn. Dryad: Distributed data-parallel programs from sequential building blocks. Cluster computing for Web-scale data processing. Statistical Machine Translation. Alberta. [76] Adam Jacobs. pages 48–54. 2007. New Jersey. pages 116–120. [78] Frederick Jelinek. Charalampos Tsourakakis. [77] Joseph JaJa. ACM Queue. Lisbon. CLOSING REMARKS [75] Michael Isard. Franz J. Martin. Authoritative sources in a hyperlinked environment. An Introduction to Parallel Algorithms. 2009. and Sergei Vassilvitskii. Och. 1997. A model of computation for MapReduce. Massachusetts. Siddharth Suri. Floria. 2008. MIT Press. Reading. 2010. pages 59–72. Andrew Birrell. England. and Jure Leskovec. 46(5):604–632. Pearson. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining (ICDM 2009). 2010. School of Computer Science. and Daniel Marcu. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2010). Cambridge. Kleinberg. pages 229–238. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2003). [84] Jon M. 2009. Tsourakakis. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys 2007). [80] U Kang. Speech and Language Processing. Carnegie Mellon University.164 CHAPTER 7. [83] Aaron Kimball. and Christos Faloutsos. Cambridge University Press. In Proceedings of the 39th ACM Technical Symposium on Computer Science Education (SIGCSE 2008). Austin. 2009. [79] Daniel Jurafsky and James H. Charalampos E. Mihai Budiu. Massachusetts. HADI: Fast diameter estimation and mining in massive graphs with Hadoop. PEGASUS: A peta-scale graph mining system—implementation and observations. . Statistical methods for speech recognition. Sierra Michels-Slettvet. Texas. Edmonton. [85] Philipp Koehn. Miami. Addison-Wesley. 1992. Portland. 1999. Technical Report CMU-ML-08-117. and Christophe Bisciglia. 2008. Canada. [82] Howard Karloﬀ. 2003. Oregon.

San Francisco. Laﬀerty. high-throughput access to static global resources within the Hadoop framework. 7(11). Liu. pages 756–764. Andrew McCallum. Singapore. [89] Abby Levenberg. Anand Bahety. Shravya Konda. pages 419–428. 2009. Maryland. ACM Computing Surveys. In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010). College Park. Columbus. 45(3):503–528. 40(3):1– 49. Technical Report HCIL-2009-01. Dong C. In Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL-08) at ACL 2008. California. 1989. and Jorge Nocedal. University of Maryland. pages 54–61. Liu. Scalable language processing algorithms for the masses: A case study in computing word co-occurrence matrices with MapReduce. An exploration of the principles underlying redundancy-based factoid question answering. 2008. Statistical machine translation. [90] Abby Levenberg and Miles Osborne. pages 282–289. [95] Jimmy Lin. On the limited memory BFGS method for large scale optimization. 19(2):131–160. Jorge Nocedal. Hawaii. [88] Ronny Lempel and Shlomo Moran. 2001. Ohio.3. [92] Jimmy Lin. [97] Adam Lopez. ACM Transactions on Information Systems. Triple-parity RAID and beyond. Chris Callison-Burch. Exploring large-data issues in the curriculum: A case study with MapReduce. 2010. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. California. 2008. Honolulu. [94] Jimmy Lin. and Fernando Pereira. . In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). and Miles Osborne. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008). Lowlatency. ACM Transactions on Information Systems. [93] Jimmy Lin. MAPREDUCE AND BEYOND 165 [87] John D. ACM Queue. 2007. Stream-based translation models for statistical machine translation. 27(2):1–55. and Samantha Mahindrakar. [91] Adam Leventhal. 2008. 2001. Mathematical Programming B. Los Angeles. SALSA: The Stochastic Approach for LinkStructure Analysis. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. 2009. [96] Dong C. Stream-based randomised language models for SMT. January 2009.7.

and Ken Kennedy. Manning. Ilan Horn. David Whalley. McKusick and Sean Quinlan. James C. Canada. [99] Grzegorz Malewicz. Indiana. Schwartz. Naty Leiser. GFS: Evolution on fast-forward. and Hinrich Sch¨tze. Foundations of Statistical Natural u Language Processing. 2009. Alberta. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999). 2002. McCool. pages 214–221. Aart J. page 6. Aart J. Matthew H. 2008. and Grzegorz Czajkowski. Prabhakar Raghavan. 29(3):217–247. Taiwan. 2010. An Introducu tion to Information Retrieval.166 CHAPTER 7. [105] Marshall K. Building enriched document representations using aggregated anchor text. Matthew H. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC 2009). C. [102] Christopher D. Pregel: A system for largescale graph processing. 2008. James C. 2008. and Grzegorz Czajkowski. Tim Leek. Scalable programming models for massively multicore processors. [101] Christopher D. Proceedings of the IEEE. [107] Donald Metzler. pages 49–55. Cambridge. [103] Elaine R. Jasmine Novak. 1999. Bik. California. International Journal of Parallel Programming. and Richard M. 2009. Hang Cui. England. A comparison of algorithms for maximum entropy parameter estimation. Mardis. Dehnert. Naty Leiser. . 2001. 2009. [104] Michael D. Berkeley. H. Bik. Indianapolis. 96(5):816–831. [106] John Mellor-Crummey. Ilan Horn. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002). The impact of next-generation sequencing technology on genetics. 7(7). 1999. Trends in Genetics. Pregel: A system for large-scale graph processing. and Srihari Reddy. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009). [108] David R. Austern. Calgary. Cambridge University Press. ACM Queue. CLOSING REMARKS [98] Grzegorz Malewicz. Taipei. 24(3):133–141. Improving memory hierarchy performance for irregular applications using data and computation reorderings. Cambridge. Massachusetts. Dehnert. C. Manning and Hinrich Sch¨tze. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. pages 219–226. Austern. [100] Robert Malouf. MIT Press. Miller. A hidden Markov model information retrieval system.

Benjamin Reed. Sweden. [112] Franz J. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006). A comparison of approaches to . The future of microprocessors.3. [117] Lawrence Page. Canada. Pig Latin: A not-so-foreign language for data processing. 2003. Mahout in Action. ACM Queue. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2(1–2):1–135. D. David J. [111] Daniel Nurmi. and Dmitrii Zagorodnov. 2008. Foundations and Trends in Information Retrieval. Stockholm. Sergey Brin. 3(7):27–34. pages 1099–1110. pages 348–355. The Eucalyptus open-source cloudcomputing system. 2010. William Webber. pages 124–131. Greenwich. Alexander Rasin. A systematic comparison of various statistical alignment models. 52(1):105. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering. 1999. Stanford Digital Library Working Paper SIDL-WP-1999-0120. Connecticut. [115] Kunle Olukotun and Lance Hammond. The PageRank citation ranking: Bringing order to the Web. pages 61–67. [113] Christopher Olston and Marc Najork. Web crawling. Rajeev Motwani. [118] Bo Pang and Lillian Lee. John Laﬀerty. Seattle. [116] Sean Owen and Robin Anil. Daniel J. Using maximum entropy for text classiﬁcation. and Justin Zobel. Sunil Soman. 2005.. The data center is the computer. and Michael Stonebraker. 2006. 2009. Foundations and Trends in Information Retrieval.C. Rich Wolski. Computational Linguistics. Communications of the ACM. 4(3):175–246. Washington. and Andrew Tomkins. [114] Christopher Olston. Opinion mining and sentiment analysis. 2008. 2008. 29(1):19–51. Abadi. In Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. Lamia Youseﬀ. MAPREDUCE AND BEYOND 167 [109] Alistair Moﬀat. British Columbia. Stanford University. [120] Andrew Pavlo. Washington. Ravi Kumar. Vancouver. and Andrew McCallum. [110] Kamal Nigam. Samuel Madden. Load balancing for termdistributed parallel retrieval..7. [119] David A. Graziano Obertelli. Och and Hermann Ney. Erik Paulson. Chris Grzegorczyk. DeWitt. 2010. Patterson. Manning Publications Co. and Terry Winograd. Utkarsh Srivastava. 1999.

[121] Sasa Petrovic. Morgan Kaufmann Publishers. San Jose. and Christos Kozyrakis. In Proceedings of the ACL/IJCNLP 2009 Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-4). Providence. 2009. 2005. 2004. [127] Colby Ranger. ACM Operating Systems Review. Ranking and semi-supervised classiﬁcation on large scale graphs using Map-Reduce. 43(2):25–34. Ramanan Raghuraman. Phoenix. Scientiﬁc Programming Journal. and Dimitrios S. [126] M. Sean Dorward. 2010. [129] Michael A. Gary Bradski. Web page classiﬁcation: Features and algorithms. . 2009. The utility business model and the future of computing services. 34(1):32–42. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). Davison. [124] Xiaoguang Qi and Brian D. 1996. Wiley. A tutorial on hidden Markov models and selected applications in speech recognition. pages 165–178. 2008. 2007. Miles Osborne. Benjamin Rose. Ali R. Supporting MapReduce on large-scale asymmetric multi-core clusters. California. 41(2). 2009. New York. [122] Rob Pike. Singapore. Rhode Island. pages 267–296. 1990. 2009. Arun Penmetsa. Nikolopoulos. [130] Sheldon M. pages 205–218. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA 2007). and Victor Lavrenko. Evaluating MapReduce for multi-core and multiprocessor systems. Rappa. CLOSING REMARKS large-scale data analysis. Failure trends e in a large disk drive population. 13(4):277– 298. [123] Eduardo Pinheiro.168 CHAPTER 7. Rabiner. California. California. Los Angeles. Arizona. ACM Computing Surveys. Wolf-Dietrich Weber. Mustafa Raﬁque. Ross. In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010). IBM Systems Journal. Butt. Robert Griesemer. [125] Lawrence R. [128] Delip Rao and David Yarowsky. In Readings in Speech Recognition. In Proceedings of the 35th ACM SIGMOD International Conference on Management of Data. Interpreting the data: Parallel analysis with Sawzall. and Sean Quinlan. San Francisco. and Luiz Andr´ Barroso. Stochastic processes. Streaming ﬁrst story detection with application to Twitter.

33(3):307–318. 1999. Schneider and David J. Oregon. 24(1):97–123. Eduardo Pinheiro. 2009. Log-linear models. and Wolf-Dietrich Weber.pdf. Edmonton. [132] Michael Schatz. In Proceedings of the First USENIX Conference on File and Storage Technologies. Named Entities: Recognition. [139] Kristie Seymore. MAPREDUCE AND BEYOND 169 [131] Thomas Sandholm and Kevin Lai. Shallow parsing with conditional random ﬁelds.edu/~nasmith/papers/ smith. Amsterdam. Computational Linguisu tics. 1998. Orlando. [133] Frank Schmuck and Roger Haskin. Learning hidden Markov model structure for information extraction. pages 193–204. Florida. 2009. Classiﬁcation and Use. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2003). Washington.cs. High Performance Computing for DNA Sequence Alignment and Assembly. http://www. Seattle. pages 110–121.7. 2003. pages 299–310. MapReduce optimization using regulated dynamic prioritization. Automatic word sense discrimination. Portland. Alberta. GPFS: A shared-disk ﬁle system for large computing clusters. 2002. Washington. DRAM errors in the wild: A large-scale ﬁeld study. John Benjamins.cmu. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data. [135] Bianca Schroeder. Monterey. Information Processing and Management. 2010. [141] Noah Smith. [136] Hinrich Sch¨tze. 2009. pages 231–244. Pedersen. A cooccurrence-based thesaurus and two u applications to information retrieval. PhD thesis. The Netherlands. [140] Fei Sha and Fernando Pereira. . [137] Hinrich Sch¨tze and Jan O. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.tut04. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’09).3. [134] Donovan A. 1989. 2004. pages 134–141. College Park. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. University of Maryland. Canada. pages 37–42. Seattle. Andrew Mccallum. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’09). and Ronald Rosenfeld. 1998. DeWitt. [138] Satoshi Sekine and Elisabete Ranchhod. California.

[146] Wittawat Tantisiriroj. Saint-Malo. Watts and Steven H.170 CHAPTER 7. California. Juan Caceres. 2009. 1996. 2009. Hongjie Bai. Technical Report CMUPDL-08-114. Nature. . [151] Yi Wang. Ani Thakar. Sam Madden. and Christoph Tillmann. Timothy Mann. and Maik Lindner. . pages 224–237. [143] Mario Stanke and Stephan Waack. 53(1):64–71. 1997. Don Slutz. 1998. PLDA: Parallel latent Dirichlet allocation for large-scale applications. and Garth Gibson. Microsoft Research. MapReduce and parallel DBMSs: Friends or foes? Communications of the ACM. In Proceedings of the 16th International Conference on Computational Linguistics (COLING 1996). Collective dynamics of ‘small-world’ networks. Copenhagen. Szalay. Thekkath. 33(8):103–111. DeWitt. and Edward Y. Stewart Tansley. SIGMOD Record. Communications of the ACM. Hermann Ney. San Francisco. pages 301–314. Data-intensive ﬁle systems for Internet services: A rose by any other name. David J. HMM-based word alignment in statistical translation. France. 39(1):50–55. Valiant. 2010. 393:440–442. Peter Z. 2009. Redmond. Beyond the tsunami: Developing the infrastructure to deal with life sciences data. Gene prediction with a hidden Markov model and a new intron submodel. Matt Stanton. . 19 Suppl 2:ii215–225. Frangipani: A scalable distributed ﬁle system. Daniel Abadi. Carnegie Mellon University. [148] Leslie G. Chang. Brunner. [150] Stephan Vogel. Jim Gray. and Robert J. Erik Paulson. Wen-Yen Chen. Denmark. [152] Duncan J. Parallel Data Laboratory. 1990. [145] Alexander S. pages 836–841. A break in the clouds: Towards a cloud deﬁnition. A bridging model for parallel computation. 2008. Washington. editors. In Tony Hey. In Proceedings of the Fifth International Conference on Algorithmic Aspects in Information and Management (AAIM 2009). Luis Rodero-Merino. and Kristin Tolle. October 2003. and Edward K. [144] Michael Stonebraker. 2000. Lee. Swapnil Patil. and Alexander Rasin. Andrew Pavlo. Kunszt. ACM SIGCOMM Computer Communication Review. 29(2):451–462. Vaquero. Bioinformatics. . [149] Luis M. The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery. Strogatz. CLOSING REMARKS [142] Christopher Southan and Graham Cameron. [147] Chandramohan A. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP 1997). Designing and mining multi-terabyte astronomy archives: The Sloan Digital Sky Survey.

California. Andy Konwinski. and Timothy C. Communications in Pure and Applied Mathematics. O’Reilly. 2006. Witten. California. Khaled Elmeleegy. DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. San Diego. Bell. Survey of clustering algorithms. Dennis Fetterly. University of California at Berkeley. 2008. Ischia. [155] Eugene Wigner. Job scheduling for multi-user MapReduce clusters. Hadoop: The Deﬁnitive Guide. California. 2009. Mihai Budiu. [156] Ian H. pages 55–66. Anthony D. Ulfar Erlingsson.7. and Ion Stoica. Managing Gigabytes: Compressing and Indexing Documents and Images. ACM Computing Surveys. In Proceedings of the 5th Conference on Computing Frontiers. and Ion Stoica. 13(1):1–14. 2008. [160] Matei Zaharia. [161] Matei Zaharia. Randy Katz. Joydeep Sen Sarma. Technical Report UCB/EECS-2009-55. 1999. San Diego.3. Alistair Moﬀat. and Jon Currey. 2008. FPGA-based prototype of a PRAM-On-Chip processor. [158] Rui Xu and Donald Wunsch II. Dhruba Borthakur. California. Pradeep Kumar Gunda. 1998. 16(1):61–81. The unreasonable eﬀectiveness of mathematics in the natural sciences. . In Proceedings of the 8th Symposium on Operating System Design and Implementation (OSDI 2008). 16(3):645–678. Corpus-based stemming using cooccurrence of word variants. Morgan Kaufmann Publishing. MAPREDUCE AND BEYOND 171 [153] Xingzhi Wen and Uzi Vishkin. pages 29–42. Bruce Croft. ´ [159] Yuan Yu. Scott Shenker. Italy. Inverted ﬁles for text search engines. In Proceedings of the 8th Symposium on Operating System Design and Implementation (OSDI 2008). Improving MapReduce performance in heterogeneous environments. Electrical Engineering and Computer Sciences. ACM Transactions on Information Systems. [157] Jinxi Xu and W. pages 1–14. Michael Isard. IEEE Transactions on Neural Networks. Sebastopol. Joseph. 2005. [162] Justin Zobel and Alistair Moﬀat. San Francisco. 2009. 38(6):1–56. 1960. [154] Tom White.